aboutsummaryrefslogtreecommitdiff
path: root/hw/vfio/common.c
AgeCommit message (Collapse)Author
2020-11-01vfio: fix incorrect print typeZhengui li
The type of input variable is unsigned int while the printer type is int. So fix incorrect print type. Signed-off-by: Zhengui li <lizhengui@huawei.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Add routine for finding VFIO_DEVICE_GET_INFO capabilitiesMatthew Rosato
Now that VFIO_DEVICE_GET_INFO supports capability chains, add a helper function to find specific capabilities in the chain. Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Find DMA available capabilityMatthew Rosato
The underlying host may be limiting the number of outstanding DMA requests for type 1 IOMMU. Add helper functions to check for the DMA available capability and retrieve the current number of DMA mappings allowed. Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> [aw: vfio_get_info_dma_avail moved inside CONFIG_LINUX] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Create shared routine for scanning info capabilitiesMatthew Rosato
Rather than duplicating the same loop in multiple locations, create a static function to do the work. Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01qapi: Add VFIO devices migration stats in Migration statsKirti Wankhede
Added amount of bytes transferred to the VM at destination by all VFIO devices Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Add ioctl to get dirty pages bitmap during dma unmapKirti Wankhede
With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase of migration. In that case, unmap ioctl should return pages pinned in that range and QEMU should find its correcponding guest physical addresses and report those dirty. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Neo Jia <cjia@nvidia.com> [aw: fix error_report types, fix cpu_physical_memory_set_dirty_lebitmap() cast] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Dirty page tracking when vIOMMU is enabledKirti Wankhede
When vIOMMU is enabled, register MAP notifier from log_sync when all devices in container are in stop and copy phase of migration. Call replay and get dirty pages from notifier callback. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Add vfio_listener_log_sync to mark dirty pagesKirti Wankhede
vfio_listener_log_sync gets list of dirty pages from container using VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all devices are stopped and saving state. Return early for the RAM block section of mapped MMIO region. Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Neo Jia <cjia@nvidia.com> [aw: fix error_report types, fix cpu_physical_memory_set_dirty_lebitmap() cast] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Get migration capability flags for containerKirti Wankhede
Added helper functions to get IOMMU info capability chain. Added function to get migration capability information from that capability chain for IOMMU container. Similar change was proposed earlier: https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html Disable migration for devices if IOMMU module doesn't support migration capability. Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Cc: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-11-01vfio: Add function to unmap VFIO regionKirti Wankhede
This function will be used for migration region. Migration region is mmaped when migration starts and will be unmapped when migration is complete. Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Neo Jia <cjia@nvidia.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2020-09-16util: rename qemu_open() to qemu_open_old()Daniel P. Berrangé
We want to introduce a new version of qemu_open() that uses an Error object for reporting problems and make this it the preferred interface. Rename the existing method to release the namespace for the new impl. Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2020-07-02vfio: Convert to ram_block_discard_disable()David Hildenbrand
VFIO is (except devices without a physical IOMMU or some mediated devices) incompatible with discarding of RAM. The kernel will pin basically all VM memory. Let's convert to ram_block_discard_disable(), which can now fail, in contrast to qemu_balloon_inhibit(). Leave "x-balloon-allowed" named as it is for now. Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Tony Krowiak <akrowiak@linux.ibm.com> Cc: Halil Pasic <pasic@linux.ibm.com> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: Eric Farman <farman@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20200626072248.78761-4-david@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-02-18Report stringified errno in VFIO related errorsMichal Privoznik
In a few places we report errno formatted as a negative integer. This is not as user friendly as it can be. Use strerror() and/or error_setg_errno() instead. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Ján Tomko <jtomko@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <4949c3ecf1a32189b8a4b5eb4b0fd04c1122501d.1581674006.git.mprivozn@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2019-10-04memory: allow memory_region_register_iommu_notifier() to failEric Auger
Currently, when a notifier is attempted to be registered and its flags are not supported (especially the MAP one) by the IOMMU MR, we generally abruptly exit in the IOMMU code. The failure could be handled more nicely in the caller and especially in the VFIO code. So let's allow memory_region_register_iommu_notifier() to fail as well as notify_flag_changed() callback. All sites implementing the callback are updated. This patch does not yet remove the exit(1) in the amd_iommu code. in SMMUv3 we turn the warning message into an error message saying that the assigned device would not work properly. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-04vfio: Turn the container error into an Error handleEric Auger
The container error integer field is currently used to store the first error potentially encountered during any vfio_listener_region_add() call. However this fails to propagate detailed error messages up to the vfio_connect_container caller. Instead of using an integer, let's use an Error handle. Messages are slightly reworded to accomodate the propagation. Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-16Include qemu/main-loop.h lessMarkus Armbruster
In my "build everything" tree, changing qemu/main-loop.h triggers a recompile of some 5600 out of 6600 objects (not counting tests and objects that don't depend on qemu/osdep.h). It includes block/aio.h, which in turn includes qemu/event_notifier.h, qemu/notify.h, qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h, qemu/thread.h, qemu/timer.h, and a few more. Include qemu/main-loop.h only where it's needed. Touching it now recompiles only some 1700 objects. For block/aio.h and qemu/event_notifier.h, these numbers drop from 5600 to 2800. For the others, they shrink only slightly. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190812052359.30071-21-armbru@redhat.com> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2019-08-16Include sysemu/reset.h a lot lessMarkus Armbruster
In my "build everything" tree, changing sysemu/reset.h triggers a recompile of some 2600 out of 6600 objects (not counting tests and objects that don't depend on qemu/osdep.h). The main culprit is hw/hw.h, which supposedly includes it for convenience. Include sysemu/reset.h only where it's needed. Touching it now recompiles less than 200 objects. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20190812052359.30071-9-armbru@redhat.com>
2019-06-13vfio/common: Introduce vfio_set_irq_signaling helperEric Auger
The code used to assign an interrupt index/subindex to an eventfd is duplicated many times. Let's introduce an helper that allows to set/unset the signaling for an ACTION_TRIGGER, ACTION_MASK or ACTION_UNMASK action. In the error message, we now use errno in case of any VFIO_DEVICE_SET_IRQS ioctl failure. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Li Qiang <liq3ea@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-03-12vfio: Make vfio_get_region_info_cap publicAlexey Kardashevskiy
This makes vfio_get_region_info_cap() to be used in quirks. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20190307050518.64968-3-aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-21hw/vfio/common: Refactor container initializationEric Auger
We introduce the vfio_init_container_type() helper. It computes the highest usable iommu type and then set the container and the iommu type. Its usage in vfio_connect_container() makes the code ready for addition of new iommu types. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-02-21vfio/common: Work around kernel overflow bug in DMA unmapAlex Williamson
A kernel bug was introduced in v4.15 via commit 71a7d3d78e3c which adds a test for address space wrap-around in the vfio DMA unmap path. Unfortunately due to overflow, the kernel detects an unmap of the last page in the 64-bit address space as a wrap-around. In QEMU, a Q35 guest with VT-d emulation and guest IOMMU enabled will attempt to make such an unmap request during VM system reset, triggering an error: qemu-kvm: VFIO_UNMAP_DMA: -22 qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument) Here the IOVA start address (0xfef00000) and the size parameter (0xffffffff01100000) add to exactly 2^64, triggering the bug. A kernel fix is queued for the Linux v5.0 release to address this. This patch implements a workaround to retry the unmap, excluding the final page of the range when we detect an unmap failing which matches the requirements for this issue. This is expected to be a safe and complete workaround as the VT-d address space does not extend to the full 64-bit space and therefore the last page should never be mapped. This workaround can be removed once all kernels with this bug are sufficiently deprecated. Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291 Reported-by: Pei Zhang <pezhang@redhat.com> Debugged-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-01-11qemu/queue.h: typedef QTAILQ headsPaolo Bonzini
This will be needed when we change the QTAILQ head and elem structs to unions. However, it is also consistent with the usage elsewhere in QEMU for other list head structs (see for example FsMountList). Note that most QTAILQs only need their name in order to do backwards walks. Those do not break with the struct->union change, and anyway the change will also remove the need to name heads when doing backwards walks, so those are not touched here. Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-11vfio: make vfio_address_spaces staticPaolo Bonzini
It is not used outside hw/vfio/common.c, so it does not need to be extern. Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-23vfio/pci: Fix failure to close file descriptor on errorAlex Williamson
A new error path fails to close the device file descriptor when triggered by a ballooning incompatibility within the group. Fix it. Fixes: 238e91728503 ("vfio/ccw/pci: Allow devices to opt-in for ballooning") Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-08-21vfio/spapr: Allow backing bigger guest IOMMU pages with smaller physical pagesAlexey Kardashevskiy
At the moment the PPC64/pseries guest only supports 4K/64K/16M IOMMU pages and POWER8 CPU supports the exact same set of page size so so far things worked fine. However POWER9 supports different set of sizes - 4K/64K/2M/1G and the last two - 2M and 1G - are not even allowed in the paravirt interface (RTAS DDW) so we always end up using 64K IOMMU pages, although we could back guest's 16MB IOMMU pages with 2MB pages on the host. This stores the supported host IOMMU page sizes in VFIOContainer and uses this later when creating a new DMA window. This uses the system page size (64k normally, 2M/16M/1G if hugepages used) as the upper limit of the IOMMU pagesize. This changes the type of @pagesize to uint64_t as this is what memory_region_iommu_get_min_page_size() returns and clz64() takes. There should be no behavioral changes on platforms other than pseries. The guest will keep using the IOMMU page size selected by the PHB pagesize property as this only changes the underlying hardware TCE table granularity. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-08-17vfio/ccw/pci: Allow devices to opt-in for ballooningAlex Williamson
If a vfio assigned device makes use of a physical IOMMU, then memory ballooning is necessarily inhibited due to the page pinning, lack of page level granularity at the IOMMU, and sufficient notifiers to both remove the page on balloon inflation and add it back on deflation. However, not all devices are backed by a physical IOMMU. In the case of mediated devices, if a vendor driver is well synchronized with the guest driver, such that only pages actively used by the guest driver are pinned by the host mdev vendor driver, then there should be no overlap between pages available for the balloon driver and pages actively in use by the device. Under these conditions, ballooning should be safe. vfio-ccw devices are always mediated devices and always operate under the constraints above. Therefore we can consider all vfio-ccw devices as balloon compatible. The situation is far from straightforward with vfio-pci. These devices can be physical devices with physical IOMMU backing or mediated devices where it is unknown whether a physical IOMMU is in use or whether the vendor driver is well synchronized to the working set of the guest driver. The safest approach is therefore to assume all vfio-pci devices are incompatible with ballooning, but allow user opt-in should they have further insight into mediated devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-08-17vfio: Inhibit ballooning based on group attachment to a containerAlex Williamson
We use a VFIOContainer to associate an AddressSpace to one or more VFIOGroups. The VFIOContainer represents the DMA context for that AdressSpace for those VFIOGroups and is synchronized to changes in that AddressSpace via a MemoryListener. For IOMMU backed devices, maintaining the DMA context for a VFIOGroup generally involves pinning a host virtual address in order to create a stable host physical address and then mapping a translation from the associated guest physical address to that host physical address into the IOMMU. While the above maintains the VFIOContainer synchronized to the QEMU memory API of the VM, memory ballooning occurs outside of that API. Inflating the memory balloon (ie. cooperatively capturing pages from the guest for use by the host) simply uses MADV_DONTNEED to "zap" pages from QEMU's host virtual address space. The page pinning and IOMMU mapping above remains in place, negating the host's ability to reuse the page, but the host virtual to host physical mapping of the page is invalidated outside of QEMU's memory API. When the balloon is later deflated, attempting to cooperatively return pages to the guest, the page is simply freed by the guest balloon driver, allowing it to be used in the guest and incurring a page fault when that occurs. The page fault maps a new host physical page backing the existing host virtual address, meanwhile the VFIOContainer still maintains the translation to the original host physical address. At this point the guest vCPU and any assigned devices will map different host physical addresses to the same guest physical address. Badness. The IOMMU typically does not have page level granularity with which it can track this mapping without also incurring inefficiencies in using page size mappings throughout. MMU notifiers in the host kernel also provide indicators for invalidating the mapping on balloon inflation, not for updating the mapping when the balloon is deflated. For these reasons we assume a default behavior that the mapping of each VFIOGroup into the VFIOContainer is incompatible with memory ballooning and increment the balloon inhibitor to match the attached VFIOGroups. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-15iommu: Add IOMMU index argument to notifier APIsPeter Maydell
Add support for multiple IOMMU indexes to the IOMMU notifier APIs. When initializing a notifier with iommu_notifier_init(), the caller must pass the IOMMU index that it is interested in. When a change happens, the IOMMU implementation must pass memory_region_notify_iommu() the IOMMU index that has changed and that notifiers must be called for. IOMMUs which support only a single index don't need to change. Callers which only really support working with IOMMUs with a single index can use the result of passing MEMTXATTRS_UNSPECIFIED to memory_region_iommu_attrs_to_index(). Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Message-id: 20180604152941.20374-3-peter.maydell@linaro.org
2018-05-31Make address_space_translate{, _cached}() take a MemTxAttrs argumentPeter Maydell
As part of plumbing MemTxAttrs down to the IOMMU translate method, add MemTxAttrs as an argument to address_space_translate() and address_space_translate_cached(). Callers either have an attrs value to hand, or don't care and can use MEMTXATTRS_UNSPECIFIED. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-id: 20180521140402.23318-4-peter.maydell@linaro.org
2018-04-05vfio: Use a trace point when a RAM section cannot be DMA mappedEric Auger
Commit 567b5b309abe ("vfio/pci: Relax DMA map errors for MMIO regions") added an error message if a passed memory section address or size is not aligned to the page size and thus cannot be DMA mapped. This patch fixes the trace by printing the region name and the memory region section offset within the address space (instead of offset_within_region). We also turn the error_report into a trace event. Indeed, In some cases, the traces can be confusing to non expert end-users and let think the use case does not work (whereas it works as before). This is the case where a BAR is successively mapped at different GPAs and its sections are not compatible with dma map. The listener is called several times and traces are issued for each intermediate mapping. The end-user cannot easily match those GPAs against the final GPA output by lscpi. So let's keep those information to informed users. In mid term, the plan is to advise the user about BAR relocation relevance. Fixes: 567b5b309abe ("vfio/pci: Relax DMA map errors for MMIO regions") Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-13vfio-pci: Allow mmap of MSIX BARAlexey Kardashevskiy
At the moment we unconditionally avoid mapping MSIX data of a BAR and emulate MSIX table in QEMU. However it is 1) not always necessary as a platform may provide a paravirt interface for MSIX configuration; 2) can affect the speed of MMIO access by emulating them in QEMU when frequently accessed registers share same system page with MSIX data, this is particularly a problem for systems with the page size bigger than 4KB. A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added to the kernel [1] which tells the userspace that mapping of the MSIX data is possible now. This makes use of it so from now on QEMU tries mapping the entire BAR as a whole and emulate MSIX on top of that. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6 Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-13vfio/pci: Relax DMA map errors for MMIO regionsAlexey Kardashevskiy
At the moment if vfio_memory_listener is registered in the system memory address space, it maps/unmaps every RAM memory region for DMA. It expects system page size aligned memory sections so vfio_dma_map would not fail and so far this has been the case. A mapping failure would be fatal. A side effect of such behavior is that some MMIO pages would not be mapped silently. However we are going to change MSIX BAR handling so we will end having non-aligned sections in vfio_memory_listener (more details is in the next patch) and vfio_dma_map will exit QEMU. In order to avoid fatal failures on what previously was not a failure and was just silently ignored, this checks the section alignment to the smallest supported IOMMU page size and prints an error if not aligned; it also prints an error if vfio_dma_map failed despite the page size check. Both errors are not fatal; only MMIO RAM regions are checked (aka "RAM device" regions). If the amount of errors printed is overwhelming, the MSIX relocation could be used to avoid excessive error output. This is unlikely to cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: Fix Int128 bit ops] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-13vfio/common: cleanup in vfio_region_finalizeGerd Hoffmann
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-07Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into stagingPeter Maydell
* socket option parsing fix (Daniel) * SCSI fixes (Fam) * Readline double-free fix (Greg) * More HVF attribution fixes (Izik) * WHPX (Windows Hypervisor Platform Extensions) support (Justin) * POLLHUP handler (Klim) * ivshmem fixes (Ladi) * memfd memory backend (Marc-André) * improved error message (Marcelo) * Memory fixes (Peter Xu, Zhecheng) * Remove obsolete code and comments (Peter M.) * qdev API improvements (Philippe) * Add CONFIG_I2C switch (Thomas) # gpg: Signature made Wed 07 Feb 2018 15:24:08 GMT # gpg: using RSA key BFFBD25F78C7AE83 # gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" # gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" # Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1 # Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83 * remotes/bonzini/tags/for-upstream: (47 commits) Add the WHPX acceleration enlightenments Introduce the WHPX impl Add the WHPX vcpu API Add the Windows Hypervisor Platform accelerator. tests/test-filter-redirector: move close() tests: use memfd in vhost-user-test vhost-user-test: make read-guest-mem setup its own qemu tests: keep compiling failing vhost-user tests Add memfd based hostmem memfd: add hugetlbsize argument memfd: add hugetlb support memfd: add error argument, instead of perror() cpus: join thread when removing a vCPU cpus: hvf: unregister thread with RCU cpus: tcg: unregister thread with RCU, fix exiting of loop on unplug cpus: dummy: unregister thread with RCU, exit loop on unplug cpus: kvm: unregister thread with RCU cpus: hax: register/unregister thread with RCU, exit loop on unplug ivshmem: Disable irqfd on device reset ivshmem: Improve MSI irqfd error handling ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org> # Conflicts: # cpus.c
2018-02-07vfio: listener unregister before unset containerPeter Xu
After next patch, listener unregister will need the container to be alive. Let's move this unregister phase to be before unset container, since that operation will free the backend container in kernel, otherwise we'll get these after next patch: qemu-system-x86_64: VFIO_UNMAP_DMA: -22 qemu-system-x86_64: vfio_dma_unmap(0x559bf53a4590, 0x0, 0xa0000) = -22 (Invalid argument) Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180122060244.29368-4-peterx@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-02-06vfio/common: Remove redundant copy of local variableAlexey Kardashevskiy
There is already @hostwin in vfio_listener_region_add() so there is no point in having the other one. Fixes: 2e4109de8e58 ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06vfio/spapr: Use iommu memory region's get_attr()Alexey Kardashevskiy
In order to enable TCE operations support in KVM, we have to inform the KVM about VFIO groups being attached to specific LIOBNs. The KVM already knows about VFIO groups, the only bit missing is which in-kernel TCE table (the one with user visible TCEs) should update the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE attribute of the VFIO KVM device which receives a groupfd/tablefd couple. This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd and calls KVM to establish the link. As get_attr() is not implemented yet, this should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-13vfio/spapr: Allow fallback to SPAPR TCE IOMMU v1Alexey Kardashevskiy
The vfio_iommu_spapr_tce driver advertises kernel's support for v1 and v2 IOMMU support, however it is not always possible to use the requested IOMMU type. For example, a pseries host platform does not support dynamic DMA windows so v2 cannot initialize and QEMU fails to start. This adds a fallback to the v1 IOMMU if v2 cannot be used. Fixes: 318f67ce1371 ("vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-13vfio/common: init giommu_list and hostwin_list of vfio containerLiu, Yi L
The init of giommu_list and hostwin_list is missed during container initialization. Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-13vfio: Fix vfio-kvm group registrationAlex Williamson
Commit 8c37faa475f3 ("vfio-pci, ppc64/spapr: Reorder group-to-container attaching") moved registration of groups with the vfio-kvm device from vfio_get_group() to vfio_connect_container(), but it missed the case where a group is attached to an existing container and takes an early exit. Perhaps this is a less common case on ppc64/spapr, but on x86 (without viommu) all groups are connected to the same container and thus only the first group gets registered with the vfio-kvm device. This becomes a problem if we then hot-unplug the devices associated with that first group and we end up with KVM being misinformed about any vfio connections that might remain. Fix by including the call to vfio_kvm_device_add_group() in this early exit path. Fixes: 8c37faa475f3 ("vfio-pci, ppc64/spapr: Reorder group-to-container attaching") Cc: qemu-stable@nongnu.org # qemu-2.10+ Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-17vfio-pci, ppc64/spapr: Reorder group-to-container attachingAlexey Kardashevskiy
At the moment VFIO PCI device initialization works as follows: vfio_realize vfio_get_group vfio_connect_container register memory listeners (1) update QEMU groups lists vfio_kvm_device_add_group Then (example for pseries) the machine reset hook triggers region_add() for all regions where listeners from (1) are listening: ppc_spapr_reset spapr_phb_reset spapr_tce_table_enable memory_region_add_subregion vfio_listener_region_add vfio_spapr_create_window This scheme works fine until we need to handle VFIO PCI device hotplug and we want to enable PPC64/sPAPR in-kernel TCE acceleration on, i.e. after PCI hotplug we need a place to call ioctl(vfio_kvm_device_fd, KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE). Since the ioctl needs a LIOBN fd (from sPAPRTCETable) and a IOMMU group fd (from VFIOGroup), vfio_listener_region_add() seems to be the only place for this ioctl(). However this only works during boot time because the machine reset happens strictly after all devices are finalized. When hotplug happens, vfio_listener_region_add() is called when a memory listener is registered but when this happens: 1. new group is not added to the container->group_list yet; 2. VFIO KVM device is unaware of the new IOMMU group. This moves bits around to have all necessary VFIO infrastructure in place for both initial startup and hotplug cases. [aw: ie, register vfio groups with kvm prior to memory listener registration such that kvm-vfio pseudo device ioctls are available during the region_add callback] Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-14memory/iommu: QOM'fy IOMMU MemoryRegionAlexey Kardashevskiy
This defines new QOM object - IOMMUMemoryRegion - with MemoryRegion as a parent. This moves IOMMU-related fields from MR to IOMMU MR. However to avoid dymanic QOM casting in fast path (address_space_translate, etc), this adds an @is_iommu boolean flag to MR and provides new helper to do simple cast to IOMMU MR - memory_region_get_iommu. The flag is set in the instance init callback. This defines memory_region_is_iommu as memory_region_get_iommu()!=NULL. This switches MemoryRegion to IOMMUMemoryRegion in most places except the ones where MemoryRegion may be an alias. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20170711035620.4232-2-aik@ozlabs.ru> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-10vfio: Test realized when using VFIOGroup.device_list iteratorAlex Williamson
VFIOGroup.device_list is effectively our reference tracking mechanism such that we can teardown a group when all of the device references are removed. However, we also use this list from our machine reset handler for processing resets that affect multiple devices. Generally device removals are fully processed (exitfn + finalize) when this reset handler is invoked, however if the removal is triggered via another reset handler (piix4_reset->acpi_pcihp_reset) then the device exitfn may run, but not finalize. In this case we hit asserts when we start trying to access PCI helpers since much of the PCI state of the device is released. To resolve this, add a pointer to the Object DeviceState in our common base-device and skip non-realized devices as we iterate. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-05-25memory: remove the last param in memory_region_iommu_replay()Peter Xu
We were always passing in that one as "false" to assume that's an read operation, and we also assume that IOMMU translation would always have that read permission. A better permission would be IOMMU_NONE since the replay is after all not a real read operation, but just a page table rebuilding process. CC: David Gibson <david@gibson.dropbear.id.au> CC: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com>
2017-05-03vfio: enable 8-byte reads/writes to vfioJose Ricardo Ziviani
This patch enables 8-byte writes and reads to VFIO. Such implemention is already done but it's missing the 'case' to handle such accesses in both vfio_region_write and vfio_region_read and the MemoryRegionOps: impl.max_access_size and impl.min_access_size. After this patch, 8-byte writes such as: qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc0, 0x4140c, 4) vfio_region_write (0001:03:00.0:region1+0xc4, 0xa0000, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 goes like this: qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc0, 0xbfd0008, 8) qemu_mutex_unlock unlocked mutex 0x10905ad8 Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-05-03vfio: Set MemoryRegionOps:max_access_size and min_access_sizeJose Ricardo Ziviani
Sets valid.max_access_size and valid.min_access_size to ensure safe 8-byte accesses to vfio. Today, 8-byte accesses are broken into pairs of 4-byte calls that goes unprotected: qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc0, 0x2020c, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc4, 0xa0000, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 which occasionally leads to: qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc0, 0x2030c, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc0, 0x1000c, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc4, 0xb0000, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc4, 0xa0000, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 causing strange errors in guest OS. With this patch, such accesses are protected by the same lock guard: qemu_mutex_lock locked mutex 0x10905ad8 vfio_region_write (0001:03:00.0:region1+0xc0, 0x2000c, 4) vfio_region_write (0001:03:00.0:region1+0xc4, 0xb0000, 4) qemu_mutex_unlock unlocked mutex 0x10905ad8 This happens because the 8-byte write should be broken into 4-byte writes by memory.c:access_with_adjusted_size() in order to be under the same lock. Today, it's done in exec.c:address_space_write_continue() which was able to handle only 4 bytes due to a zero'ed valid.max_access_size (see exec.c:memory_access_size()). Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-20memory: add section range info for IOMMU notifierPeter Xu
In this patch, IOMMUNotifier.{start|end} are introduced to store section information for a specific notifier. When notification occurs, we not only check the notification type (MAP|UNMAP), but also check whether the notified iova range overlaps with the range of specific IOMMU notifier, and skip those notifiers if not in the listened range. When removing an region, we need to make sure we removed the correct VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well. This patch is solving the problem that vfio-pci devices receive duplicated UNMAP notification on x86 platform when vIOMMU is there. The issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK this (splitted IOMMU region) is only happening on x86. This patch also helps vhost to leverage the new interface as well, so that vhost won't get duplicated cache flushes. In that sense, it's an slight performance improvement. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <1491562755-23867-2-git-send-email-peterx@redhat.com> [ehabkost: included extra vhost_iommu_region_del() change from Peter Xu] Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2017-02-17vfio: allow to notify unmap for very large regionPeter Xu
Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big region. This can be leveraged by QEMU IOMMU implementation to cleanup existing page mappings for an entire iova address space (by notifying with an IOTLB with extremely huge addr_mask). However current vfio_iommu_map_notify() does not allow that. It make sure that all the translated address in IOTLB is falling into RAM range. The check makes sense, but it should only be a sensible checker for mapping operations, and mean little for unmap operations. This patch moves this check into map logic only, so that we'll get faster unmap handling (no need to translate again), and also we can then better support unmapping a very big region when it covers non-ram ranges or even not-existing ranges. Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-02-17vfio: introduce vfio_get_vaddr()Peter Xu
A cleanup for vfio_iommu_map_notify(). Now we will fetch vaddr even if the operation is unmap, but it won't hurt much. One thing to mention is that we need the RCU read lock to protect the whole translation and map/unmap procedure. Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-02-17vfio: trace map/unmap for notify as wellPeter Xu
We traces its range, but we don't know whether it's a MAP/UNMAP. Let's dump it as well. Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>