aboutsummaryrefslogtreecommitdiff
path: root/virt/kvm
AgeCommit message (Collapse)Author
2015-11-24KVM: arm/arm64: vgic: Trust the LR state for HW IRQsChristoffer Dall
We were probing the physial distributor state for the active state of a HW virtual IRQ, because we had seen evidence that the LR state was not cleared when the guest deactivated a virtual interrupted. However, this issue turned out to be a software bug in the GIC, which was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29 (KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active, 2015-11-24) Therefore, get rid of the complexities and just look at the LR. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-11-24KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.activeChristoffer Dall
We were incorrectly removing the active state from the physical distributor on the timer interrupt when the timer output level was deasserted. We shouldn't be doing this without considering the virtual interrupt's active state, because the architecture requires that when an LR has the HW bit set and the pending or active bits set, then the physical interrupt must also have the corresponding bits set. This addresses an issue where we have been observing an inconsistency between the LR state and the physical distributor state where the LR state was active and the physical distributor was not active, which shouldn't happen. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-11-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "First batch of KVM changes for 4.4. s390: A bunch of fixes and optimizations for interrupt and time handling. PPC: Mostly bug fixes. ARM: No big features, but many small fixes and prerequisites including: - a number of fixes for the arch-timer - introducing proper level-triggered semantics for the arch-timers - a series of patches to synchronously halt a guest (prerequisite for IRQ forwarding) - some tracepoint improvements - a tweak for the EL2 panic handlers - some more VGIC cleanups getting rid of redundant state x86: Quite a few changes: - support for VT-d posted interrupts (i.e. PCI devices can inject interrupts directly into vCPUs). This introduces a new component (in virt/lib/) that connects VFIO and KVM together. The same infrastructure will be used for ARM interrupt forwarding as well. - more Hyper-V features, though the main one Hyper-V synthetic interrupt controller will have to wait for 4.5. These will let KVM expose Hyper-V devices. - nested virtualization now supports VPID (same as PCID but for vCPUs) which makes it quite a bit faster - for future hardware that supports NVDIMM, there is support for clflushopt, clwb, pcommit - support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in userspace, which reduces the attack surface of the hypervisor - obligatory smattering of SMM fixes - on the guest side, stable scheduler clock support was rewritten to not require help from the hypervisor" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits) KVM: VMX: Fix commit which broke PML KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0() KVM: x86: allow RSM from 64-bit mode KVM: VMX: fix SMEP and SMAP without EPT KVM: x86: move kvm_set_irq_inatomic to legacy device assignment KVM: device assignment: remove pointless #ifdefs KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic KVM: x86: zero apic_arb_prio on reset drivers/hv: share Hyper-V SynIC constants with userspace KVM: x86: handle SMBASE as physical address in RSM KVM: x86: add read_phys to x86_emulate_ops KVM: x86: removing unused variable KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr() KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings KVM: arm/arm64: Optimize away redundant LR tracking KVM: s390: use simple switch statement as multiplexer KVM: s390: drop useless newline in debugging data KVM: s390: SCA must not cross page boundaries KVM: arm: Do not indent the arguments of DECLARE_BITMAP ...
2015-11-04KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomicPaolo Bonzini
We do not want to do too much work in atomic context, in particular not walking all the VCPUs of the virtual machine. So we want to distinguish the architecture-specific injection function for irqfd from kvm_set_msi. Since it's still empty, reuse the newly added kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-04KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configsJan Beulich
The symbol was missing a KVM dependency. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-04Merge tag 'kvm-arm-for-4.4' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/ARM Changes for v4.4-rc1 Includes a number of fixes for the arch-timer, introducing proper level-triggered semantics for the arch-timers, a series of patches to synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups getting rid of redundant state, and finally a stylistic change that gets rid of some ctags warnings. Conflicts: arch/x86/include/asm/kvm_host.h
2015-11-04KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()Pavel Fedin
Now we see that vgic_set_lr() and vgic_sync_lr_elrsr() are always used together. Merge them into one function, saving from second vgic_ops dereferencing every time. Signed-off-by: Pavel Fedin <p.fedin@samsung.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-11-04KVM: arm/arm64: Clean up vgic_retire_lr() and surroundingsPavel Fedin
1. Remove unnecessary 'irq' argument, because irq number can be retrieved from the LR. 2. Since cff9211eb1a1f58ce7f5a2d596b617928fd4be0e ("arm/arm64: KVM: Fix arch timer behavior for disabled interrupts ") LR_STATE_PENDING is queued back by vgic_retire_lr() itself. Also, it clears vlr.state itself. Therefore, we remove the same, now duplicated, check with all accompanying bit manipulations from vgic_unqueue_irqs(). 3. vgic_retire_lr() is always accompanied by vgic_irq_clear_queued(). Since it already does more than just clearing the LR, move vgic_irq_clear_queued() inside of it. Signed-off-by: Pavel Fedin <p.fedin@samsung.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-11-04KVM: arm/arm64: Optimize away redundant LR trackingPavel Fedin
Currently we use vgic_irq_lr_map in order to track which LRs hold which IRQs, and lr_used bitmap in order to track which LRs are used or free. vgic_irq_lr_map is actually used only for piggy-back optimization, and can be easily replaced by iteration over lr_used. This is good because in future, when LPI support is introduced, number of IRQs will grow up to at least 16384, while numbers from 1024 to 8192 are never going to be used. This would be a huge memory waste. In its turn, lr_used is also completely redundant since ae705930fca6322600690df9dc1c7d0516145a93 ("arm/arm64: KVM: Keep elrsr/aisr in sync with software model"), because together with lr_used we also update elrsr. This allows to easily replace lr_used with elrsr, inverting all conditions (because in elrsr '1' means 'free'). Signed-off-by: Pavel Fedin <p.fedin@samsung.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-11-03Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The irq departement delivers: - Rework the irqdomain core infrastructure to accomodate ACPI based systems. This is required to support ARM64 without creating artificial device tree nodes. - Sanitize the ACPI based ARM GIC initialization by making use of the new firmware independent irqdomain core - Further improvements to the generic MSI management - Generalize the irq migration on CPU hotplug - Improvements to the threaded interrupt infrastructure - Allow the migration of "chained" low level interrupt handlers - Allow optional force masking of interrupts in disable_irq[_nosysnc] - Support for two new interrupt chips - Sigh! - A larger set of errata fixes for ARM gicv3 - The usual pile of fixes, updates, improvements and cleanups all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) Document that IRQ_NONE should be returned when IRQ not actually handled PCI/MSI: Allow the MSI domain to be device-specific PCI: Add per-device MSI domain hook of/irq: Use the msi-map property to provide device-specific MSI domain of/irq: Split of_msi_map_rid to reuse msi-map lookup irqchip/gic-v3-its: Parse new version of msi-parent property PCI/MSI: Use of_msi_get_domain instead of open-coded "msi-parent" parsing of/irq: Use of_msi_get_domain instead of open-coded "msi-parent" parsing of/irq: Add support code for multi-parent version of "msi-parent" irqchip/gic-v3-its: Add handling of PCI requester id. PCI/MSI: Add helper function pci_msi_domain_get_msi_rid(). of/irq: Add new function of_msi_map_rid() Docs: dt: Add PCI MSI map bindings irqchip/gic-v2m: Add support for multiple MSI frames irqchip/gic-v3: Fix translation of LPIs after conversion to irq_fwspec irqchip/mxs: Add Alphascale ASM9260 support irqchip/mxs: Prepare driver for hardware with different offsets irqchip/mxs: Panic if ioremap or domain creation fails irqdomain: Documentation updates irqdomain/msi: Use fwnode instead of of_node ...
2015-10-22arm/arm64: KVM: Add tracepoints for vgic and timerChristoffer Dall
The VGIC and timer code for KVM arm/arm64 doesn't have any tracepoints or tracepoint infrastructure defined. Rewriting some of the timer code handling showed me how much we need this, so let's add these simple trace points once and for all and we can easily expand with additional trace points in these files as we go along. Cc: Wei Huang <wei@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22arm/arm64: KVM: Support edge-triggered forwarded interruptsChristoffer Dall
We mark edge-triggered interrupts with the HW bit set as queued to prevent the VGIC code from injecting LRs with both the Active and Pending bits set at the same time while also setting the HW bit, because the hardware does not support this. However, this means that we must also clear the queued flag when we sync back a LR where the state on the physical distributor went from active to inactive because the guest deactivated the interrupt. At this point we must also check if the interrupt is pending on the distributor, and tell the VGIC to queue it again if it is. Since these actions on the sync path are extremely close to those for level-triggered interrupts, rename process_level_irq to process_queued_irq, allowing it to cater for both cases. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22arm/arm64: KVM: Rework the arch timer to use level-triggered semanticsChristoffer Dall
The arch timer currently uses edge-triggered semantics in the sense that the line is never sampled by the vgic and lowering the line from the timer to the vgic doesn't have any effect on the pending state of virtual interrupts in the vgic. This means that we do not support a guest with the otherwise valid behavior of (1) disable interrupts (2) enable the timer (3) disable the timer (4) enable interrupts. Such a guest would validly not expect to see any interrupts on real hardware, but will see interrupts on KVM. This patch fixes this shortcoming through the following series of changes. First, we change the flow of the timer/vgic sync/flush operations. Now the timer is always flushed/synced before the vgic, because the vgic samples the state of the timer output. This has the implication that we move the timer operations in to non-preempible sections, but that is fine after the previous commit getting rid of hrtimer schedules on every entry/exit. Second, we change the internal behavior of the timer, letting the timer keep track of its previous output state, and only lower/raise the line to the vgic when the state changes. Note that in theory this could have been accomplished more simply by signalling the vgic every time the state *potentially* changed, but we don't want to be hitting the vgic more often than necessary. Third, we get rid of the use of the map->active field in the vgic and instead simply set the interrupt as active on the physical distributor whenever the input to the GIC is asserted and conversely clear the physical active state when the input to the GIC is deasserted. Fourth, and finally, we now initialize the timer PPIs (and all the other unused PPIs for now), to be level-triggered, and modify the sync code to sample the line state on HW sync and re-inject a new interrupt if it is still pending at that time. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22arm/arm64: KVM: Use appropriate define in VGIC reset codeChristoffer Dall
We currently initialize the SGIs to be enabled in the VGIC code, but we use the VGIC_NR_PPIS define for this purpose, instead of the the more natural VGIC_NR_SGIS. Change this slightly confusing use of the defines. Note: This should have no functional change, as both names are defined to the number 16. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22arm/arm64: KVM: Implement GICD_ICFGR as RO for PPIsChristoffer Dall
The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only. We currently simulate this behavior by writing a hardcoded value to the register for the SGIs and PPIs on every write of these bits to the register (ignoring what the guest actually wrote), and by writing the same value as the reset value to the register. This is a bit counter-intuitive, as the register is RO for these bits, and we can just implement it that way, allowing us to control the value of the bits purely in the reset code. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22arm/arm64: KVM: vgic: Factor out level irq processing on guest exitChristoffer Dall
Currently vgic_process_maintenance() processes dealing with a completed level-triggered interrupt directly, but we are soon going to reuse this logic for level-triggered mapped interrupts with the HW bit set, so move this logic into a separate static function. Probably the most scary part of this commit is convincing yourself that the current flow is safe compared to the old one. In the following I try to list the changes and why they are harmless: Move vgic_irq_clear_queued after kvm_notify_acked_irq: Harmless because the only potential effect of clearing the queued flag wrt. kvm_set_irq is that vgic_update_irq_pending does not set the pending bit on the emulated CPU interface or in the pending_on_cpu bitmask if the function is called with level=1. However, the point of kvm_notify_acked_irq is to call kvm_set_irq with level=0, and we set the queued flag again in __kvm_vgic_sync_hwstate later on if the level is stil high. Move vgic_set_lr before kvm_notify_acked_irq: Also, harmless because the LR are cpu-local operations and kvm_notify_acked only affects the dist Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq: Also harmless, because now we check the level state in the clear_soft_pend function and lower the pending bits if the level is low. Reviewed-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_blockChristoffer Dall
We currently schedule a soft timer every time we exit the guest if the timer did not expire while running the guest. This is really not necessary, because the only work we do in the timer work function is to kick the vcpu. Kicking the vcpu does two things: (1) If the vpcu thread is on a waitqueue, make it runnable and remove it from the waitqueue. (2) If the vcpu is running on a different physical CPU from the one doing the kick, it sends a reschedule IPI. The second case cannot happen, because the soft timer is only ever scheduled when the vcpu is not running. The first case is only relevant when the vcpu thread is on a waitqueue, which is only the case when the vcpu thread has called kvm_vcpu_block(). Therefore, we only need to make sure a timer is scheduled for kvm_vcpu_block(), which we do by encapsulating all calls to kvm_vcpu_block() with kvm_timer_{un}schedule calls. Additionally, we only schedule a soft timer if the timer is enabled and unmasked, since it is useless otherwise. Note that theoretically userspace can use the SET_ONE_REG interface to change registers that should cause the timer to fire, even if the vcpu is blocked without a scheduled timer, but this case was not supported before this patch and we leave it for future work for now. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-22KVM: Add kvm_arch_vcpu_{un}blocking callbacksChristoffer Dall
Some times it is useful for architecture implementations of KVM to know when the VCPU thread is about to block or when it comes back from blocking (arm/arm64 needs to know this to properly implement timers, for example). Therefore provide a generic architecture callback function in line with what we do elsewhere for KVM generic-arch interactions. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-20arm/arm64: KVM: Fix disabled distributor operationChristoffer Dall
We currently do a single update of the vgic state when the distributor enable/disable control register is accessed and then bypass updating the state for as long as the distributor remains disabled. This is incorrect, because updating the state does not consider the distributor enable bit, and this you can end up in a situation where an interrupt is marked as pending on the CPU interface, but not pending on the distributor, which is an impossible state to be in, and triggers a warning. Consider for example the following sequence of events: 1. An interrupt is marked as pending on the distributor - the interrupt is also forwarded to the CPU interface 2. The guest turns off the distributor (it's about to do a reboot) - we stop updating the CPU interface state from now on 3. The guest disables the pending interrupt - we remove the pending state from the distributor, but don't touch the CPU interface, see point 2. Since the distributor disable bit really means that no interrupts should be forwarded to the CPU interface, we modify the code to keep updating the internal VGIC state, but always set the CPU interface pending bits to zero when the distributor is disabled. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-20arm/arm64: KVM: Clear map->active on pend/active clearChristoffer Dall
When a guest reboots or offlines/onlines CPUs, it is not uncommon for it to clear the pending and active states of an interrupt through the emulated VGIC distributor. However, since the architected timers are defined by the architecture to be level triggered and the guest rightfully expects them to be that, but we emulate them as edge-triggered, we have to mimic level-triggered behavior for an edge-triggered virtual implementation. We currently do not signal the VGIC when the map->active field is true, because it indicates that the guest has already been signalled of the interrupt as required. Normally this field is set to false when the guest deactivates the virtual interrupt through the sync path. We also need to catch the case where the guest deactivates the interrupt through the emulated distributor, again allowing guests to boot even if the original virtual timer signal hit before the guest's GIC initialization sequence is run. Reviewed-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-20arm/arm64: KVM: Fix arch timer behavior for disabled interruptsChristoffer Dall
We have an interesting issue when the guest disables the timer interrupt on the VGIC, which happens when turning VCPUs off using PSCI, for example. The problem is that because the guest disables the virtual interrupt at the VGIC level, we never inject interrupts to the guest and therefore never mark the interrupt as active on the physical distributor. The host also never takes the timer interrupt (we only use the timer device to trigger a guest exit and everything else is done in software), so the interrupt does not become active through normal means. The result is that we keep entering the guest with a programmed timer that will always fire as soon as we context switch the hardware timer state and run the guest, preventing forward progress for the VCPU. Since the active state on the physical distributor is really part of the timer logic, it is the job of our virtual arch timer driver to manage this state. The timer->map->active boolean field indicates whether we have signalled this interrupt to the vgic and if that interrupt is still pending or active. As long as that is the case, the hardware doesn't have to generate physical interrupts and therefore we mark the interrupt as active on the physical distributor. We also have to restore the pending state of an interrupt that was queued to an LR but was retired from the LR for some reason, while remaining pending in the LR. Cc: Marc Zyngier <marc.zyngier@arm.com> Reported-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-20KVM: arm/arm64: Do not inject spurious interruptsPavel Fedin
When lowering a level-triggered line from userspace, we forgot to lower the pending bit on the emulated CPU interface and we also did not re-compute the pending_on_cpu bitmap for the CPU affected by the change. Update vgic_update_irq_pending() to fix the two issues above and also raise a warning in vgic_quue_irq_to_lr if we encounter an interrupt pending on a CPU which is neither marked active nor pending. [ Commit text reworked completely - Christoffer ] Signed-off-by: Pavel Fedin <p.fedin@samsung.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-10-16kvm/irqchip: allow only multiple irqchip routes per GSIAndrey Smetanin
Any other irq routing types (MSI, S390_ADAPTER, upcoming Hyper-V SynIC) map one-to-one to GSI. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-16kvm/eventfd: add arch-specific set_irqAndrey Smetanin
Allow for arch-specific interrupt types to be set. For that, add kvm_arch_set_irq() which takes interrupt type-specific action if it recognizes the interrupt type given, and -EWOULDBLOCK otherwise. The default implementation always returns -EWOULDBLOCK. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-16kvm/eventfd: factor out kvm_notify_acked_gsi()Andrey Smetanin
Factor out kvm_notify_acked_gsi() helper to iterate over EOI listeners and notify those matching the given gsi. It will be reused in the upcoming Hyper-V SynIC implementation. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-16kvm/eventfd: avoid loop inside irqfd_update()Andrey Smetanin
The loop(for) inside irqfd_update() is unnecessary because any other value for irq_entry.type will just trigger schedule_work(&irqfd->inject) in irqfd_wakeup. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-14kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.cKosuke Tatsukawa
async_pf_execute() seems to be missing a memory barrier which might cause the waker to not notice the waiter and miss sending a wake_up as in the following figure. async_pf_execute kvm_vcpu_block ------------------------------------------------------------------------ spin_lock(&vcpu->async_pf.lock); if (waitqueue_active(&vcpu->wq)) /* The CPU might reorder the test for the waitqueue up here, before prior writes complete */ prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); /*if (kvm_vcpu_check_block(vcpu) < 0) */ /*if (kvm_arch_vcpu_runnable(vcpu)) { */ ... return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && !vcpu->arch.apf.halted) || !list_empty_careful(&vcpu->async_pf.done) ... return 0; list_add_tail(&apf->link, &vcpu->async_pf.done); spin_unlock(&vcpu->async_pf.lock); waited = true; schedule(); ------------------------------------------------------------------------ The attached patch adds the missing memory barrier. I found this issue when I was looking through the linux source code for places calling waitqueue_active() before wake_up*(), but without preceding memory barriers, after sending a patch to fix a similar issue in drivers/tty/n_tty.c (Details about the original issue can be found here: https://lkml.org/lkml/2015/9/28/849). Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-09arm/arm64: KVM: Only allow 64bit hosts to build VGICv3Jean-Philippe Brucker
Hardware virtualisation of GICv3 is only supported by 64bit hosts for the moment. Some VGICv3 bits are missing from the 32bit side, and this patch allows to still be able to build 32bit hosts when CONFIG_ARM_GIC_V3 is selected. To this end, we introduce a new option, CONFIG_KVM_ARM_VGIC_V3, that is only enabled on the 64bit side. The selection is done unconditionally because CONFIG_ARM_GIC_V3 is always enabled on arm64. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-10-01KVM: Update Posted-Interrupts Descriptor when vCPU is blockedFeng Wu
This patch updates the Posted-Interrupts Descriptor when vCPU is blocked. pre-block: - Add the vCPU to the blocked per-CPU list - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR post-block: - Remove the vCPU from the per-CPU list Signed-off-by: Feng Wu <feng.wu@intel.com> [Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'Feng Wu
This patch adds an arch specific hooks 'arch_update' in 'struct kvm_kernel_irqfd'. On Intel side, it is used to update the IRTE when VT-d posted-interrupts is used. Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01KVM: eventfd: add irq bypass consumer managementEric Auger
This patch adds the registration/unregistration of an irq_bypass_consumer on irqfd assignment/deassignment. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01KVM: introduce kvm_arch functions for IRQ bypassEric Auger
This patch introduces - kvm_arch_irq_bypass_add_producer - kvm_arch_irq_bypass_del_producer - kvm_arch_irq_bypass_stop - kvm_arch_irq_bypass_start They make possible to specialize the KVM IRQ bypass consumer in case CONFIG_KVM_HAVE_IRQ_BYPASS is set. Signed-off-by: Eric Auger <eric.auger@linaro.org> [Add weak implementations of the callbacks. - Feng] Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01KVM: create kvm_irqfd.hEric Auger
Move _irqfd_resampler and _irqfd struct declarations in a new public header: kvm_irqfd.h. They are respectively renamed into kvm_kernel_irqfd_resampler and kvm_kernel_irqfd. Those datatypes will be used by architecture specific code, in the context of IRQ bypass manager integration. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01kvm: add capability for any-length ioeventfdsJason Wang
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01kvm: use kmalloc() instead of kzalloc() during iodev register/unregisterJason Wang
All fields of kvm_io_range were initialized or copied explicitly afterwards. So switch to use kmalloc(). Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01KVM: x86: Add EOI exit bitmap inferenceSteve Rutherford
In order to support a userspace IOAPIC interacting with an in kernel APIC, the EOI exit bitmaps need to be configurable. If the IOAPIC is in userspace (i.e. the irqchip has been split), the EOI exit bitmaps will be set whenever the GSI Routes are configured. In particular, for the low MSI routes are reservable for userspace IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the destination vector of the route will be set for the destination VCPU. The intention is for the userspace IOAPICs to use the reservable MSI routes to inject interrupts into the guest. This is a slight abuse of the notion of an MSI Route, given that MSIs classically bypass the IOAPIC. It might be worthwhile to add an additional route type to improve clarity. Compile tested for Intel x86. Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-25KVM: disable halt_poll_ns as default for s390xDavid Hildenbrand
We observed some performance degradation on s390x with dynamic halt polling. Until we can provide a proper fix, let's enable halt_poll_ns as default only for supported architectures. Architectures are now free to set their own halt_poll_ns default value. Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-17Merge tag 'kvm-arm-for-4.3-rc2-2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master Second set of KVM/ARM changes for 4.3-rc2 - Workaround for a Cortex-A57 erratum - Bug fix for the debugging infrastructure - Fix for 32bit guests with more than 4GB of address space on a 32bit host - A number of fixes for the (unusual) case when we don't use the in-kernel GIC emulation - Removal of ThumbEE handling on arm64, since these have been dropped from the architecture before anyone actually ever built a CPU - Remove the KVM_ARM_MAX_VCPUS limitation which has become fairly pointless
2015-09-17arm/arm64: KVM: Remove 'config KVM_ARM_MAX_VCPUS'Ming Lei
This patch removes config option of KVM_ARM_MAX_VCPUS, and like other ARCHs, just choose the maximum allowed value from hardware, and follows the reasons: 1) from distribution view, the option has to be defined as the max allowed value because it need to meet all kinds of virtulization applications and need to support most of SoCs; 2) using a bigger value doesn't introduce extra memory consumption, and the help text in Kconfig isn't accurate because kvm_vpu structure isn't allocated until request of creating VCPU is sent from QEMU; 3) the main effect is that the field of vcpus[] in 'struct kvm' becomes a bit bigger(sizeof(void *) per vcpu) and need more cache lines to hold the structure, but 'struct kvm' is one generic struct, and it has worked well on other ARCHs already in this way. Also, the world switch frequecy is often low, for example, it is ~2000 when running kernel building load in VM from APM xgene KVM host, so the effect is very small, and the difference can't be observed in my test at all. Cc: Dann Frazier <dann.frazier@canonical.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-09-16KVM: add halt_attempted_poll to VCPU statsPaolo Bonzini
This new statistic can help diagnosing VCPUs that, for any reason, trigger bad behavior of halt_poll_ns autotuning. For example, say halt_poll_ns = 480000, and wakeups are spaced exactly like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes 10+20+40+80+160+320+480 = 1110 microseconds out of every 479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then is consuming about 30% more CPU than it would use without polling. This would show as an abnormally high number of attempted polling compared to the successful polls. Acked-by: Christian Borntraeger <borntraeger@de.ibm.com< Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-15kvm: fix zero length mmio searchingJason Wang
Currently, if we had a zero length mmio eventfd assigned on KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it always compares the kvm_io_range() with the length that guest wrote. This will cause e.g for vhost, kick will be trapped by qemu userspace instead of vhost. Fixing this by using zero length if an iodevice is zero length. Cc: stable@vger.kernel.org Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-15kvm: fix double free for fast mmio eventfdJason Wang
We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS and once on KVM_FAST_MMIO_BUS but with a single iodev instance. This will lead to an issue: kvm_io_bus_destroy() knows nothing about the devices on two buses pointing to a single dev. Which will lead to double free[1] during exit. Fix this by allocating two instances of iodevs then registering one on KVM_MMIO_BUS and another on KVM_FAST_MMIO_BUS. CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013 task: ffff88009ae0c4b0 ti: ffff88020e7f0000 task.ti: ffff88020e7f0000 RIP: 0010:[<ffffffffc07e25d8>] [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm] RSP: 0018:ffff88020e7f3bc8 EFLAGS: 00010292 RAX: dead000000200200 RBX: ffff8801ec19c900 RCX: 000000018200016d RDX: ffff8801ec19cf80 RSI: ffffea0008bf1d40 RDI: ffff8801ec19c900 RBP: ffff88020e7f3bd8 R08: 000000002fc75a01 R09: 000000018200016d R10: ffffffffc07df6ae R11: ffff88022fc75a98 R12: ffff88021e7cc000 R13: ffff88021e7cca48 R14: ffff88021e7cca50 R15: ffff8801ec19c880 FS: 00007fc1ee3e6700(0000) GS:ffff88023e240000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f389d8000 CR3: 000000023dc13000 CR4: 00000000001427e0 Stack: ffff88021e7cc000 0000000000000000 ffff88020e7f3be8 ffffffffc07e2622 ffff88020e7f3c38 ffffffffc07df69a ffff880232524160 ffff88020e792d80 0000000000000000 ffff880219b78c00 0000000000000008 ffff8802321686a8 Call Trace: [<ffffffffc07e2622>] ioeventfd_destructor+0x12/0x20 [kvm] [<ffffffffc07df69a>] kvm_put_kvm+0xca/0x210 [kvm] [<ffffffffc07df818>] kvm_vcpu_release+0x18/0x20 [kvm] [<ffffffff811f69f7>] __fput+0xe7/0x250 [<ffffffff811f6bae>] ____fput+0xe/0x10 [<ffffffff81093f04>] task_work_run+0xd4/0xf0 [<ffffffff81079358>] do_exit+0x368/0xa50 [<ffffffff81082c8f>] ? recalc_sigpending+0x1f/0x60 [<ffffffff81079ad5>] do_group_exit+0x45/0xb0 [<ffffffff81085c71>] get_signal+0x291/0x750 [<ffffffff810144d8>] do_signal+0x28/0xab0 [<ffffffff810f3a3b>] ? do_futex+0xdb/0x5d0 [<ffffffff810b7028>] ? __wake_up_locked_key+0x18/0x20 [<ffffffff810f3fa6>] ? SyS_futex+0x76/0x170 [<ffffffff81014fc9>] do_notify_resume+0x69/0xb0 [<ffffffff817cb9af>] int_signal+0x12/0x17 Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 10 00 00 RIP [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm] RSP <ffff88020e7f3bc8> Cc: stable@vger.kernel.org Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-15kvm: factor out core eventfd assign/deassign logicJason Wang
This patch factors out core eventfd assign/deassign logic and leaves the argument checking and bus index selection to callers. Cc: stable@vger.kernel.org Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-15kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfdJason Wang
We only want zero length mmio eventfd to be registered on KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to make sure this. Cc: stable@vger.kernel.org Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-14KVM: make the declaration of functions within 80 charactersWei Yang
After 'commit 0b8ba4a2b658 ("KVM: fix checkpatch.pl errors in kvm/coalesced_mmio.h")', the declaration of the two function will exceed 80 characters. This patch reduces the TAPs to make each line in 80 characters. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-14Merge tag 'kvm-arm-for-4.3-rc2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM changes for 4.3-rc2 - Fix timer interrupt injection after the rework that went in during the merge window - Reset the timer to zero on reboot - Make sure the TCR_EL2 RES1 bits are really set to 1 - Fix a PSCI affinity bug for non-existing vcpus
2015-09-14KVM: fix polling for guest halt continued even if disable itWanpeng Li
If there is already some polling ongoing, it's impossible to disable the polling, since as soon as somebody sets halt_poll_ns to 0, polling will never stop, as grow and shrink are only handled if halt_poll_ns is != 0. This patch fix it by reset vcpu->halt_poll_ns in order to stop polling when polling is disabled. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge third patch-bomb from Andrew Morton: - even more of the rest of MM - lib/ updates - checkpatch updates - small changes to a few scruffy filesystems - kmod fixes/cleanups - kexec updates - a dma-mapping cleanup series from hch * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits) dma-mapping: consolidate dma_set_mask dma-mapping: consolidate dma_supported dma-mapping: cosolidate dma_mapping_error dma-mapping: consolidate dma_{alloc,free}_noncoherent dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent} mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd() mm: make sure all file VMAs have ->vm_ops set mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff() mm: mark most vm_operations_struct const namei: fix warning while make xmldocs caused by namei.c ipc: convert invalid scenarios to use WARN_ON zlib_deflate/deftree: remove bi_reverse() lib/decompress_unlzma: Do a NULL check for pointer lib/decompressors: use real out buf size for gunzip with kernel fs/affs: make root lookup from blkdev logical size sysctl: fix int -> unsigned long assignments in INT_MIN case kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo kexec: align crash_notes allocation to make it be inside one physical page kexec: remove unnecessary test in kimage_alloc_crash_control_pages() kexec: split kexec_load syscall from kexec core code ...
2015-09-10mmu-notifier: add clear_young callbackVladimir Davydov
In the scope of the idle memory tracking feature, which is introduced by the following patch, we need to clear the referenced/accessed bit not only in primary, but also in secondary ptes. The latter is required in order to estimate wss of KVM VMs. At the same time we want to avoid flushing tlb, because it is quite expensive and it won't really affect the final result. Currently, there is no function for clearing pte young bit that would meet our requirements, so this patch introduces one. To achieve that we have to add a new mmu-notifier callback, clear_young, since there is no method for testing-and-clearing a secondary pte w/o flushing tlb. The new method is not mandatory and currently only implemented by KVM. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08kvm: irqchip: fix memory leakSudip Mukherjee
We were taking the exit path after checking ue->flags and return value of setup_routing_entry(), but 'e' was not freed incase of a failure. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>