[RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices

* [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
@ 2024-01-17  8:02 Eric Auger
  2024-01-17  8:02 ` [RFC 1/7] hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions Eric Auger
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

In [1] we attempted to fix a case where a VFIO-PCI device protected
with a virtio-iommu was assigned to an x86 guest. On x86 the physical
IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
virtio-iommu used to expose a 64b address space by default.
Hence the guest was trying to use the full 64b space and we hit
DMA MAP failures. To work around this issue we managed to pass
usable IOVA regions (excluding the out of range space) from VFIO 
to the virtio-iommu device. This was made feasible by introducing
a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
This latter gets called when the IOMMU MR is enabled which
causes the vfio_listener_region_add() to be called.

However with VFIO-PCI hotplug, this technique fails due to the
race between the call to the callback in the add memory listener
and the virtio-iommu probe request. Indeed the probe request gets
called before the attach to the domain. So in that case the usable
regions are communicated after the probe request and fail to be
conveyed to the guest. To be honest the problem was hinted by
Jean-Philippe in [1] and I should have been more careful at
listening to him and testing with hotplug :-(

For coldplugged device the technique works because we make sure all
the IOMMU MR are enabled once on the machine init done: 94df5b2180
("virtio-iommu: Fix 64kB host page size VFIO device assignment")
for granule freeze. But I would be keen to get rid of this trick.

Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
MR to have been enabled and the corresponding vfio_listener_region_add()
to be executed. Instead this series proposes to replace the usage of this
API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci: modify
pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
called earlier, once the usable IOVA regions have been collected by
VFIO, without the need for the IOMMU MR to be enabled.

This looks cleaner. In the short term this may also be used for
passing the page size mask, which would allow to get rid of the
hacky transient IOMMU MR enablement mentionned above.

[1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
    https://lore.kernel.org/all/20231019134651.842175-1-eric.auger@redhat.com/

[2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/

Extra Notes:
With that series, the reserved memory regions are communicated on time
so that the virtio-iommu probe request grabs them. However this is not
sufficient. In some cases (my case), I still see some DMA MAP failures
and the guest keeps on using IOVA ranges outside the geometry of the
physical IOMMU. This is due to the fact the VFIO-PCI device is in the
same iommu group as the pcie root port. Normally the kernel
iova_reserve_iommu_regions (dma-iommu.c) is supposed to call reserve_iova()
for each reserved IOVA, which carves them out of the allocator. When
iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
the iova domain is already allocated and set and we don't call
iova_reserve_iommu_regions() again for the vfio-pci device. So its
corresponding reserved regions are not properly taken into account.

This is not trivial to fix because theoretically the 1st attached
devices could already have allocated IOVAs within the reserved regions
of the second device. Also we are somehow hijacking the reserved
memory regions to model the geometry of the physical IOMMU so not sure
any attempt to fix that upstream will be accepted. At the moment one
solution is to make sure assigned devices end up in singleton group.
Another solution is to work on a different approach where the gaw
can be passed as an option to the virtio-iommu device, similarly at
what is done with intel iommu.

This series can be found at:
https://github.com/eauger/qemu/tree/hotplug-resv-rfc

Eric Auger (7):
  hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions
  hw/pci: Introduce pci_device_iommu_bus
  vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps
  virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
  virtio-iommu: Remove the implementation of iommu_set_iova_ranges
  hw/vfio: Remove memory_region_iommu_set_iova_ranges() call
  memory: Remove IOMMU MR iommu_set_iova_range API

 include/exec/memory.h    |  32 -------
 include/hw/pci/pci.h     |  16 ++++
 hw/pci/pci.c             |  16 ++++
 hw/vfio/common.c         |  10 --
 hw/vfio/pci.c            |  27 ++++++
 hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++-------------------
 system/memory.c          |  13 ---
 7 files changed, 160 insertions(+), 155 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 20+ messages in thread