qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
@ 2024-01-17  8:02 Eric Auger
  2024-01-17  8:02 ` [RFC 1/7] hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions Eric Auger
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

In [1] we attempted to fix a case where a VFIO-PCI device protected
with a virtio-iommu was assigned to an x86 guest. On x86 the physical
IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
virtio-iommu used to expose a 64b address space by default.
Hence the guest was trying to use the full 64b space and we hit
DMA MAP failures. To work around this issue we managed to pass
usable IOVA regions (excluding the out of range space) from VFIO 
to the virtio-iommu device. This was made feasible by introducing
a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
This latter gets called when the IOMMU MR is enabled which
causes the vfio_listener_region_add() to be called.

However with VFIO-PCI hotplug, this technique fails due to the
race between the call to the callback in the add memory listener
and the virtio-iommu probe request. Indeed the probe request gets
called before the attach to the domain. So in that case the usable
regions are communicated after the probe request and fail to be
conveyed to the guest. To be honest the problem was hinted by
Jean-Philippe in [1] and I should have been more careful at
listening to him and testing with hotplug :-(

For coldplugged device the technique works because we make sure all
the IOMMU MR are enabled once on the machine init done: 94df5b2180
("virtio-iommu: Fix 64kB host page size VFIO device assignment")
for granule freeze. But I would be keen to get rid of this trick.

Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
MR to have been enabled and the corresponding vfio_listener_region_add()
to be executed. Instead this series proposes to replace the usage of this
API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci: modify
pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
called earlier, once the usable IOVA regions have been collected by
VFIO, without the need for the IOMMU MR to be enabled.

This looks cleaner. In the short term this may also be used for
passing the page size mask, which would allow to get rid of the
hacky transient IOMMU MR enablement mentionned above.

[1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
    https://lore.kernel.org/all/20231019134651.842175-1-eric.auger@redhat.com/

[2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/

Extra Notes:
With that series, the reserved memory regions are communicated on time
so that the virtio-iommu probe request grabs them. However this is not
sufficient. In some cases (my case), I still see some DMA MAP failures
and the guest keeps on using IOVA ranges outside the geometry of the
physical IOMMU. This is due to the fact the VFIO-PCI device is in the
same iommu group as the pcie root port. Normally the kernel
iova_reserve_iommu_regions (dma-iommu.c) is supposed to call reserve_iova()
for each reserved IOVA, which carves them out of the allocator. When
iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
the iova domain is already allocated and set and we don't call
iova_reserve_iommu_regions() again for the vfio-pci device. So its
corresponding reserved regions are not properly taken into account.

This is not trivial to fix because theoretically the 1st attached
devices could already have allocated IOVAs within the reserved regions
of the second device. Also we are somehow hijacking the reserved
memory regions to model the geometry of the physical IOMMU so not sure
any attempt to fix that upstream will be accepted. At the moment one
solution is to make sure assigned devices end up in singleton group.
Another solution is to work on a different approach where the gaw
can be passed as an option to the virtio-iommu device, similarly at
what is done with intel iommu.

This series can be found at:
https://github.com/eauger/qemu/tree/hotplug-resv-rfc


Eric Auger (7):
  hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions
  hw/pci: Introduce pci_device_iommu_bus
  vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps
  virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
  virtio-iommu: Remove the implementation of iommu_set_iova_ranges
  hw/vfio: Remove memory_region_iommu_set_iova_ranges() call
  memory: Remove IOMMU MR iommu_set_iova_range API

 include/exec/memory.h    |  32 -------
 include/hw/pci/pci.h     |  16 ++++
 hw/pci/pci.c             |  16 ++++
 hw/vfio/common.c         |  10 --
 hw/vfio/pci.c            |  27 ++++++
 hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++-------------------
 system/memory.c          |  13 ---
 7 files changed, 160 insertions(+), 155 deletions(-)

-- 
2.41.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 1/7] hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
@ 2024-01-17  8:02 ` Eric Auger
  2024-01-17  8:02 ` [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus Eric Auger
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

This new callback will be used to convey usable IOVA regions
from VFIO-PCI devices to vIOMMUS (esp. virtio-iommu). The advantage
is that this callback can be called very early, once the device has
is known to be protected by a vIOMMU, after the get_address_space()
has been called by the parent device. Current solution to convey
IOVA regions relies on IOMMU MR callbacks but this requires an
IOMMU MR to be connected with the VFIO-PCI device which generally
comes with the enablement of the IOMMU MR (vIOMMU protection activated).
The downside is that is comes pretty late and in case of virtio-iommu,
after the probe.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/pci/pci.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index fa6313aabc..63c018b35a 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -385,6 +385,21 @@ typedef struct PCIIOMMUOps {
      * @devfn: device and function number
      */
    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int devfn);
+
+    /**
+     * @set_host_iova_ranges: convey the usable iova ranges for a given device
+     *
+     * Optional callback which returns 0 on success or an error value if any
+     * should be called after @get_address_space()
+     *
+     * @bus: the #PCIBus being accessed.
+     * @opaque: the data passed to pci_setup_iommu().
+     * @devfn: device and function number
+     * @iova_ranges: list of IOVA ranges usable by the device
+     * @errp: error handle
+     */
+   int (*set_host_iova_ranges)(PCIBus *bus, void *opaque, int devfn,
+                               GList *iova_ranges, Error **errp);
 } PCIIOMMUOps;
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
  2024-01-17  8:02 ` [RFC 1/7] hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions Eric Auger
@ 2024-01-17  8:02 ` Eric Auger
  2024-01-18  7:32   ` Duan, Zhenzhong
  2024-01-17  8:02 ` [RFC 3/7] vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps Eric Auger
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

This helper will allow subsequent patches to retrieve the IOMMU bus
and call its associated PCIIOMMUOps callbacks.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/pci/pci.h |  1 +
 hw/pci/pci.c         | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 63c018b35a..649b327f9f 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -403,6 +403,7 @@ typedef struct PCIIOMMUOps {
 } PCIIOMMUOps;
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
+PCIBus *pci_device_iommu_bus(PCIDevice *dev);
 
 /**
  * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 76080af580..5bf07662fe 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2672,6 +2672,22 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
     }
 }
 
+PCIBus *pci_device_iommu_bus(PCIDevice *dev)
+{
+    PCIBus *bus = pci_get_bus(dev);
+    PCIBus *iommu_bus = bus;
+
+    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
+        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
+
+        iommu_bus = parent_bus;
+    }
+    if (pci_bus_bypass_iommu(bus)) {
+        return NULL;
+    }
+    return iommu_bus;
+}
+
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 {
     PCIBus *bus = pci_get_bus(dev);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 3/7] vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
  2024-01-17  8:02 ` [RFC 1/7] hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions Eric Auger
  2024-01-17  8:02 ` [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus Eric Auger
@ 2024-01-17  8:02 ` Eric Auger
  2024-01-17  8:02 ` [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions Eric Auger
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

Pass the collected usable IOVA regions using the PCIIOMMUOps
set_host_iova_ranges() callback, if implemented.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/pci.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d7fe06715c..63937952bb 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2485,6 +2485,28 @@ int vfio_pci_get_pci_hot_reset_info(VFIOPCIDevice *vdev,
     return 0;
 }
 
+static int vfio_pci_set_iova_ranges(VFIOPCIDevice *vdev, Error **errp)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    PCIDevice *pdev = &vdev->pdev;
+    VFIOContainerBase *bcontainer = vbasedev->bcontainer;
+    PCIBus *bus, *iommu_bus;
+
+    if (!bcontainer->iova_ranges) {
+        return 0;
+    }
+
+    bus = pci_get_bus(pdev);
+    iommu_bus = pci_device_iommu_bus(pdev);
+    if (iommu_bus && iommu_bus->iommu_ops &&
+        iommu_bus->iommu_ops->set_host_iova_ranges) {
+        return iommu_bus->iommu_ops->set_host_iova_ranges(
+                   bus, iommu_bus->iommu_opaque,
+                   pdev->devfn, bcontainer->iova_ranges, errp);
+    }
+    return 0;
+}
+
 static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
 {
     VFIODevice *vbasedev = &vdev->vbasedev;
@@ -3004,6 +3026,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    ret = vfio_pci_set_iova_ranges(vdev, errp);
+    if (ret) {
+        goto error;
+    }
+
     vfio_populate_device(vdev, &err);
     if (err) {
         error_propagate(errp, err);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
                   ` (2 preceding siblings ...)
  2024-01-17  8:02 ` [RFC 3/7] vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps Eric Auger
@ 2024-01-17  8:02 ` Eric Auger
  2024-01-18  7:43   ` Duan, Zhenzhong
  2024-01-17  8:02 ` [RFC 5/7] virtio-iommu: Remove the implementation of iommu_set_iova_ranges Eric Auger
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

Reuse the implementation of virtio_iommu_set_iova_ranges() which
will be removed in subsequent patches.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/virtio/virtio-iommu.c | 134 +++++++++++++++++++++++++++++----------
 1 file changed, 101 insertions(+), 33 deletions(-)

diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 8a4bd933c6..716a3fcfbf 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -461,8 +461,109 @@ static AddressSpace *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
     return &sdev->as;
 }
 
+/**
+ * rebuild_resv_regions: rebuild resv regions with both the
+ * info of host resv ranges and property set resv ranges
+ */
+static int rebuild_resv_regions(IOMMUDevice *sdev)
+{
+    GList *l;
+    int i = 0;
+
+    /* free the existing list and rebuild it from scratch */
+    g_list_free_full(sdev->resv_regions, g_free);
+    sdev->resv_regions = NULL;
+
+    /* First add host reserved regions if any, all tagged as RESERVED */
+    for (l = sdev->host_resv_ranges; l; l = l->next) {
+        ReservedRegion *reg = g_new0(ReservedRegion, 1);
+        Range *r = (Range *)l->data;
+
+        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
+        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
+        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions, reg);
+        trace_virtio_iommu_host_resv_regions(sdev->iommu_mr.parent_obj.name, i,
+                                             range_lob(&reg->range),
+                                             range_upb(&reg->range));
+        i++;
+    }
+    /*
+     * then add higher priority reserved regions set by the machine
+     * through properties
+     */
+    add_prop_resv_regions(sdev);
+    return 0;
+}
+
+static int virtio_iommu_set_host_iova_ranges(PCIBus *bus, void *opaque,
+                                             int devfn, GList *iova_ranges,
+                                             Error **errp)
+{
+    VirtIOIOMMU *s = opaque;
+    IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
+    IOMMUDevice *sdev;
+    GList *current_ranges;
+    GList *l, *tmp, *new_ranges = NULL;
+    int ret = -EINVAL;
+
+    if (!sbus) {
+        error_report("%s no sbus", __func__);
+    }
+
+    sdev = sbus->pbdev[devfn];
+
+    current_ranges = sdev->host_resv_ranges;
+
+    warn_report("%s: host_resv_regions=%d", __func__, !!sdev->host_resv_ranges);
+    /* check that each new resv region is included in an existing one */
+    if (sdev->host_resv_ranges) {
+        range_inverse_array(iova_ranges,
+                            &new_ranges,
+                            0, UINT64_MAX);
+
+        for (tmp = new_ranges; tmp; tmp = tmp->next) {
+            Range *newr = (Range *)tmp->data;
+            bool included = false;
+
+            for (l = current_ranges; l; l = l->next) {
+                Range * r = (Range *)l->data;
+
+                if (range_contains_range(r, newr)) {
+                    included = true;
+                    break;
+                }
+            }
+            if (!included) {
+                goto error;
+            }
+        }
+        /* all new reserved ranges are included in existing ones */
+        ret = 0;
+        goto out;
+    }
+
+    if (sdev->probe_done) {
+        warn_report("%s: Notified about new host reserved regions after probe",
+                    __func__);
+    }
+
+    range_inverse_array(iova_ranges,
+                        &sdev->host_resv_ranges,
+                        0, UINT64_MAX);
+    rebuild_resv_regions(sdev);
+
+    return 0;
+error:
+    error_setg(errp, "%s Conflicting host reserved ranges set!",
+               __func__);
+out:
+    g_list_free_full(new_ranges, g_free);
+    return ret;
+}
+
 static const PCIIOMMUOps virtio_iommu_ops = {
     .get_address_space = virtio_iommu_find_add_as,
+    .set_host_iova_ranges = virtio_iommu_set_host_iova_ranges,
 };
 
 static int virtio_iommu_attach(VirtIOIOMMU *s,
@@ -1158,39 +1259,6 @@ static int virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
     return 0;
 }
 
-/**
- * rebuild_resv_regions: rebuild resv regions with both the
- * info of host resv ranges and property set resv ranges
- */
-static int rebuild_resv_regions(IOMMUDevice *sdev)
-{
-    GList *l;
-    int i = 0;
-
-    /* free the existing list and rebuild it from scratch */
-    g_list_free_full(sdev->resv_regions, g_free);
-    sdev->resv_regions = NULL;
-
-    /* First add host reserved regions if any, all tagged as RESERVED */
-    for (l = sdev->host_resv_ranges; l; l = l->next) {
-        ReservedRegion *reg = g_new0(ReservedRegion, 1);
-        Range *r = (Range *)l->data;
-
-        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
-        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
-        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions, reg);
-        trace_virtio_iommu_host_resv_regions(sdev->iommu_mr.parent_obj.name, i,
-                                             range_lob(&reg->range),
-                                             range_upb(&reg->range));
-        i++;
-    }
-    /*
-     * then add higher priority reserved regions set by the machine
-     * through properties
-     */
-    add_prop_resv_regions(sdev);
-    return 0;
-}
 
 /**
  * virtio_iommu_set_iova_ranges: Conveys the usable IOVA ranges
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 5/7] virtio-iommu: Remove the implementation of iommu_set_iova_ranges
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
                   ` (3 preceding siblings ...)
  2024-01-17  8:02 ` [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions Eric Auger
@ 2024-01-17  8:02 ` Eric Auger
  2024-01-17  8:02 ` [RFC 6/7] hw/vfio: Remove memory_region_iommu_set_iova_ranges() call Eric Auger
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

Now that we use PCIIOMMUOps to convey information about usable IOVA
ranges we do not to implement the iommu_set_iova_ranges IOMMU MR
callback.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/virtio/virtio-iommu.c | 67 ----------------------------------------
 1 file changed, 67 deletions(-)

diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 716a3fcfbf..1e2fc86f2f 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -1260,72 +1260,6 @@ static int virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
 }
 
 
-/**
- * virtio_iommu_set_iova_ranges: Conveys the usable IOVA ranges
- *
- * The function turns those into reserved ranges. Once some
- * reserved ranges have been set, new reserved regions cannot be
- * added outside of the original ones.
- *
- * @mr: IOMMU MR
- * @iova_ranges: list of usable IOVA ranges
- * @errp: error handle
- */
-static int virtio_iommu_set_iova_ranges(IOMMUMemoryRegion *mr,
-                                        GList *iova_ranges,
-                                        Error **errp)
-{
-    IOMMUDevice *sdev = container_of(mr, IOMMUDevice, iommu_mr);
-    GList *current_ranges = sdev->host_resv_ranges;
-    GList *l, *tmp, *new_ranges = NULL;
-    int ret = -EINVAL;
-
-    /* check that each new resv region is included in an existing one */
-    if (sdev->host_resv_ranges) {
-        range_inverse_array(iova_ranges,
-                            &new_ranges,
-                            0, UINT64_MAX);
-
-        for (tmp = new_ranges; tmp; tmp = tmp->next) {
-            Range *newr = (Range *)tmp->data;
-            bool included = false;
-
-            for (l = current_ranges; l; l = l->next) {
-                Range * r = (Range *)l->data;
-
-                if (range_contains_range(r, newr)) {
-                    included = true;
-                    break;
-                }
-            }
-            if (!included) {
-                goto error;
-            }
-        }
-        /* all new reserved ranges are included in existing ones */
-        ret = 0;
-        goto out;
-    }
-
-    if (sdev->probe_done) {
-        warn_report("%s: Notified about new host reserved regions after probe",
-                    mr->parent_obj.name);
-    }
-
-    range_inverse_array(iova_ranges,
-                        &sdev->host_resv_ranges,
-                        0, UINT64_MAX);
-    rebuild_resv_regions(sdev);
-
-    return 0;
-error:
-    error_setg(errp, "IOMMU mr=%s Conflicting host reserved ranges set!",
-               mr->parent_obj.name);
-out:
-    g_list_free_full(new_ranges, g_free);
-    return ret;
-}
-
 static void virtio_iommu_system_reset(void *opaque)
 {
     VirtIOIOMMU *s = opaque;
@@ -1621,7 +1555,6 @@ static void virtio_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->replay = virtio_iommu_replay;
     imrc->notify_flag_changed = virtio_iommu_notify_flag_changed;
     imrc->iommu_set_page_size_mask = virtio_iommu_set_page_size_mask;
-    imrc->iommu_set_iova_ranges = virtio_iommu_set_iova_ranges;
 }
 
 static const TypeInfo virtio_iommu_info = {
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 6/7] hw/vfio: Remove memory_region_iommu_set_iova_ranges() call
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
                   ` (4 preceding siblings ...)
  2024-01-17  8:02 ` [RFC 5/7] virtio-iommu: Remove the implementation of iommu_set_iova_ranges Eric Auger
@ 2024-01-17  8:02 ` Eric Auger
  2024-01-17  8:02 ` [RFC 7/7] memory: Remove IOMMU MR iommu_set_iova_range API Eric Auger
  2024-01-18  7:10 ` [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Duan, Zhenzhong
  7 siblings, 0 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

As we have just removed the only implementation of
iommu_set_iova_ranges IOMMU MR callback in the virtio-iommu,
let's remove the call to the memory wrapper. Usable IOVA ranges
are now conveyed through the PCIIOMMUOps in VFIO-PCI.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3352f2a9..d88602e050 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -636,16 +636,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
             goto fail;
         }
 
-        if (bcontainer->iova_ranges) {
-            ret = memory_region_iommu_set_iova_ranges(giommu->iommu_mr,
-                                                      bcontainer->iova_ranges,
-                                                      &err);
-            if (ret) {
-                g_free(giommu);
-                goto fail;
-            }
-        }
-
         ret = memory_region_register_iommu_notifier(section->mr, &giommu->n,
                                                     &err);
         if (ret) {
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 7/7] memory: Remove IOMMU MR iommu_set_iova_range API
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
                   ` (5 preceding siblings ...)
  2024-01-17  8:02 ` [RFC 6/7] hw/vfio: Remove memory_region_iommu_set_iova_ranges() call Eric Auger
@ 2024-01-17  8:02 ` Eric Auger
  2024-01-18  7:10 ` [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Duan, Zhenzhong
  7 siblings, 0 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-17  8:02 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, zhenzhong.duan, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

Since the host IOVA ranges are now passed through the
PCIIOMMUOps set_host_resv_regions and we have removed
the only implementation of iommu_set_iova_range() in
the virtio-iommu and the only call site in vfio/common,
let's retire the IOMMU MR API and its memory wrapper.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/exec/memory.h | 32 --------------------------------
 system/memory.c       | 13 -------------
 2 files changed, 45 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 177be23db7..7677b572c5 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -527,26 +527,6 @@ struct IOMMUMemoryRegionClass {
      int (*iommu_set_page_size_mask)(IOMMUMemoryRegion *iommu,
                                      uint64_t page_size_mask,
                                      Error **errp);
-    /**
-     * @iommu_set_iova_ranges:
-     *
-     * Propagate information about the usable IOVA ranges for a given IOMMU
-     * memory region. Used for example to propagate host physical device
-     * reserved memory region constraints to the virtual IOMMU.
-     *
-     * Optional method: if this method is not provided, then the default IOVA
-     * aperture is used.
-     *
-     * @iommu: the IOMMUMemoryRegion
-     *
-     * @iova_ranges: list of ordered IOVA ranges (at least one range)
-     *
-     * Returns 0 on success, or a negative error. In case of failure, the error
-     * object must be created.
-     */
-     int (*iommu_set_iova_ranges)(IOMMUMemoryRegion *iommu,
-                                  GList *iova_ranges,
-                                  Error **errp);
 };
 
 typedef struct RamDiscardListener RamDiscardListener;
@@ -1896,18 +1876,6 @@ int memory_region_iommu_set_page_size_mask(IOMMUMemoryRegion *iommu_mr,
                                            uint64_t page_size_mask,
                                            Error **errp);
 
-/**
- * memory_region_iommu_set_iova_ranges - Set the usable IOVA ranges
- * for a given IOMMU MR region
- *
- * @iommu: IOMMU memory region
- * @iova_ranges: list of ordered IOVA ranges (at least one range)
- * @errp: pointer to Error*, to store an error if it happens.
- */
-int memory_region_iommu_set_iova_ranges(IOMMUMemoryRegion *iommu,
-                                        GList *iova_ranges,
-                                        Error **errp);
-
 /**
  * memory_region_name: get a memory region's name
  *
diff --git a/system/memory.c b/system/memory.c
index a229a79988..b2fb12f87f 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1909,19 +1909,6 @@ int memory_region_iommu_set_page_size_mask(IOMMUMemoryRegion *iommu_mr,
     return ret;
 }
 
-int memory_region_iommu_set_iova_ranges(IOMMUMemoryRegion *iommu_mr,
-                                        GList *iova_ranges,
-                                        Error **errp)
-{
-    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
-    int ret = 0;
-
-    if (imrc->iommu_set_iova_ranges) {
-        ret = imrc->iommu_set_iova_ranges(iommu_mr, iova_ranges, errp);
-    }
-    return ret;
-}
-
 int memory_region_register_iommu_notifier(MemoryRegion *mr,
                                           IOMMUNotifier *n, Error **errp)
 {
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* RE: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
  2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
                   ` (6 preceding siblings ...)
  2024-01-17  8:02 ` [RFC 7/7] memory: Remove IOMMU MR iommu_set_iova_range API Eric Auger
@ 2024-01-18  7:10 ` Duan, Zhenzhong
  2024-01-18  9:43   ` Eric Auger
  7 siblings, 1 reply; 20+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  7:10 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini
  Cc: mst, clg

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Cc: mst@redhat.com; clg@redhat.com
>Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling
>for hotplugged devices
>
>In [1] we attempted to fix a case where a VFIO-PCI device protected
>with a virtio-iommu was assigned to an x86 guest. On x86 the physical
>IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
>virtio-iommu used to expose a 64b address space by default.
>Hence the guest was trying to use the full 64b space and we hit
>DMA MAP failures. To work around this issue we managed to pass
>usable IOVA regions (excluding the out of range space) from VFIO
>to the virtio-iommu device. This was made feasible by introducing
>a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
>This latter gets called when the IOMMU MR is enabled which
>causes the vfio_listener_region_add() to be called.
>
>However with VFIO-PCI hotplug, this technique fails due to the
>race between the call to the callback in the add memory listener
>and the virtio-iommu probe request. Indeed the probe request gets
>called before the attach to the domain. So in that case the usable
>regions are communicated after the probe request and fail to be
>conveyed to the guest. To be honest the problem was hinted by
>Jean-Philippe in [1] and I should have been more careful at
>listening to him and testing with hotplug :-(

It looks the global virtio_iommu_config.bypass is never cleared in guest.
When guest virtio_iommu driver enable IOMMU, should it clear this
bypass attribute?

If it could be cleared in viommu_probe(), then qemu will call
virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
to enable IOMMU MR. Then both coldplugged and hotplugged devices will work.

Intel iommu has a similar bit in register GCMD_REG.TE, when guest
intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all()
is called to enable IOMMU MRs.

>
>For coldplugged device the technique works because we make sure all
>the IOMMU MR are enabled once on the machine init done: 94df5b2180
>("virtio-iommu: Fix 64kB host page size VFIO device assignment")
>for granule freeze. But I would be keen to get rid of this trick.
>
>Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
>MR to have been enabled and the corresponding vfio_listener_region_add()
>to be executed. Instead this series proposes to replace the usage of this
>API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
>modify
>pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
>called earlier, once the usable IOVA regions have been collected by
>VFIO, without the need for the IOMMU MR to be enabled.
>
>This looks cleaner. In the short term this may also be used for
>passing the page size mask, which would allow to get rid of the
>hacky transient IOMMU MR enablement mentionned above.
>
>[1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
>    https://lore.kernel.org/all/20231019134651.842175-1-
>eric.auger@redhat.com/
>
>[2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/
>
>Extra Notes:
>With that series, the reserved memory regions are communicated on time
>so that the virtio-iommu probe request grabs them. However this is not
>sufficient. In some cases (my case), I still see some DMA MAP failures
>and the guest keeps on using IOVA ranges outside the geometry of the
>physical IOMMU. This is due to the fact the VFIO-PCI device is in the
>same iommu group as the pcie root port. Normally the kernel
>iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
>reserve_iova()
>for each reserved IOVA, which carves them out of the allocator. When
>iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
>the iova domain is already allocated and set and we don't call
>iova_reserve_iommu_regions() again for the vfio-pci device. So its
>corresponding reserved regions are not properly taken into account.

I suspect there is same issue with coldplugged devices. If those devices
are in same group, get iova_reserve_iommu_regions() is only called
for first device. But other devices's reserved regions are missed.

Curious how you make passthrough device and pcie root port under same
group.
When I start a x86 guest with passthrough device, I see passthrough
device and pcie root port are in different group.

-[0000:00]-+-00.0
           +-01.0
           +-02.0
           +-03.0-[01]----00.0

/sys/kernel/iommu_groups/3/devices:
0000:00:03.0
/sys/kernel/iommu_groups/7/devices:
0000:01:00.0

My qemu cmdline:
-device pcie-root-port,id=root0,slot=0
-device vfio-pci,host=6f:01.0,id=vfio0,bus=root0

Thanks
Zhenzhong

>
>This is not trivial to fix because theoretically the 1st attached
>devices could already have allocated IOVAs within the reserved regions
>of the second device. Also we are somehow hijacking the reserved
>memory regions to model the geometry of the physical IOMMU so not sure
>any attempt to fix that upstream will be accepted. At the moment one
>solution is to make sure assigned devices end up in singleton group.
>Another solution is to work on a different approach where the gaw
>can be passed as an option to the virtio-iommu device, similarly at
>what is done with intel iommu.
>
>This series can be found at:
>https://github.com/eauger/qemu/tree/hotplug-resv-rfc
>
>
>Eric Auger (7):
>  hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions
>  hw/pci: Introduce pci_device_iommu_bus
>  vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps
>  virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
>  virtio-iommu: Remove the implementation of iommu_set_iova_ranges
>  hw/vfio: Remove memory_region_iommu_set_iova_ranges() call
>  memory: Remove IOMMU MR iommu_set_iova_range API
>
> include/exec/memory.h    |  32 -------
> include/hw/pci/pci.h     |  16 ++++
> hw/pci/pci.c             |  16 ++++
> hw/vfio/common.c         |  10 --
> hw/vfio/pci.c            |  27 ++++++
> hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++-------------------
> system/memory.c          |  13 ---
> 7 files changed, 160 insertions(+), 155 deletions(-)
>
>--
>2.41.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus
  2024-01-17  8:02 ` [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus Eric Auger
@ 2024-01-18  7:32   ` Duan, Zhenzhong
  0 siblings, 0 replies; 20+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  7:32 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini
  Cc: mst, clg



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus
>
>This helper will allow subsequent patches to retrieve the IOMMU bus
>and call its associated PCIIOMMUOps callbacks.
>
>Signed-off-by: Eric Auger <eric.auger@redhat.com>
>---
> include/hw/pci/pci.h |  1 +
> hw/pci/pci.c         | 16 ++++++++++++++++
> 2 files changed, 17 insertions(+)
>
>diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>index 63c018b35a..649b327f9f 100644
>--- a/include/hw/pci/pci.h
>+++ b/include/hw/pci/pci.h
>@@ -403,6 +403,7 @@ typedef struct PCIIOMMUOps {
> } PCIIOMMUOps;
>
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>+PCIBus *pci_device_iommu_bus(PCIDevice *dev);
>
> /**
>  * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
>diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>index 76080af580..5bf07662fe 100644
>--- a/hw/pci/pci.c
>+++ b/hw/pci/pci.c
>@@ -2672,6 +2672,22 @@ static void
>pci_device_class_base_init(ObjectClass *klass, void *data)
>     }
> }
>
>+PCIBus *pci_device_iommu_bus(PCIDevice *dev)
>+{
>+    PCIBus *bus = pci_get_bus(dev);
>+    PCIBus *iommu_bus = bus;
>+
>+    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus-
>>parent_dev) {
>+        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>+
>+        iommu_bus = parent_bus;

Variable parent_bus can be removed.

>+    }
>+    if (pci_bus_bypass_iommu(bus)) {
>+        return NULL;
>+    }
>+    return iommu_bus;
>+}
>+
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> {
>     PCIBus *bus = pci_get_bus(dev);
>--
>2.41.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
  2024-01-17  8:02 ` [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions Eric Auger
@ 2024-01-18  7:43   ` Duan, Zhenzhong
  2024-01-18 12:25     ` Eric Auger
  0 siblings, 1 reply; 20+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  7:43 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini
  Cc: mst, clg

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps
>set_host_resv_regions
>
>Reuse the implementation of virtio_iommu_set_iova_ranges() which
>will be removed in subsequent patches.
>
>Signed-off-by: Eric Auger <eric.auger@redhat.com>
>---
> hw/virtio/virtio-iommu.c | 134 +++++++++++++++++++++++++++++---------
>-
> 1 file changed, 101 insertions(+), 33 deletions(-)
>
>diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>index 8a4bd933c6..716a3fcfbf 100644
>--- a/hw/virtio/virtio-iommu.c
>+++ b/hw/virtio/virtio-iommu.c
>@@ -461,8 +461,109 @@ static AddressSpace
>*virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
>     return &sdev->as;
> }
>
>+/**
>+ * rebuild_resv_regions: rebuild resv regions with both the
>+ * info of host resv ranges and property set resv ranges
>+ */
>+static int rebuild_resv_regions(IOMMUDevice *sdev)
>+{
>+    GList *l;
>+    int i = 0;
>+
>+    /* free the existing list and rebuild it from scratch */
>+    g_list_free_full(sdev->resv_regions, g_free);
>+    sdev->resv_regions = NULL;
>+
>+    /* First add host reserved regions if any, all tagged as RESERVED */
>+    for (l = sdev->host_resv_ranges; l; l = l->next) {
>+        ReservedRegion *reg = g_new0(ReservedRegion, 1);
>+        Range *r = (Range *)l->data;
>+
>+        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
>+        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
>+        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions, reg);
>+        trace_virtio_iommu_host_resv_regions(sdev-
>>iommu_mr.parent_obj.name, i,
>+                                             range_lob(&reg->range),
>+                                             range_upb(&reg->range));
>+        i++;
>+    }
>+    /*
>+     * then add higher priority reserved regions set by the machine
>+     * through properties
>+     */
>+    add_prop_resv_regions(sdev);
>+    return 0;
>+}
>+
>+static int virtio_iommu_set_host_iova_ranges(PCIBus *bus, void *opaque,
>+                                             int devfn, GList *iova_ranges,
>+                                             Error **errp)
>+{
>+    VirtIOIOMMU *s = opaque;
>+    IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
>+    IOMMUDevice *sdev;
>+    GList *current_ranges;
>+    GList *l, *tmp, *new_ranges = NULL;
>+    int ret = -EINVAL;
>+
>+    if (!sbus) {
>+        error_report("%s no sbus", __func__);
>+    }

Do we plan to support multiple devices in same iommu group?
as_by_busptr is hashed by bus which is an aliased bus of the device.
So sbus may be NULL if device's bus is different from aliased bus.

>+
>+    sdev = sbus->pbdev[devfn];
>+
>+    current_ranges = sdev->host_resv_ranges;
>+
>+    warn_report("%s: host_resv_regions=%d", __func__, !!sdev-
>>host_resv_ranges);
>+    /* check that each new resv region is included in an existing one */
>+    if (sdev->host_resv_ranges) {

May be we could just error out as vfio_realize should not call
set_host_iova_ranges() twice.

Thanks
Zhenzhong
>+        range_inverse_array(iova_ranges,
>+                            &new_ranges,
>+                            0, UINT64_MAX);
>+
>+        for (tmp = new_ranges; tmp; tmp = tmp->next) {
>+            Range *newr = (Range *)tmp->data;
>+            bool included = false;
>+
>+            for (l = current_ranges; l; l = l->next) {
>+                Range * r = (Range *)l->data;
>+
>+                if (range_contains_range(r, newr)) {
>+                    included = true;
>+                    break;
>+                }
>+            }
>+            if (!included) {
>+                goto error;
>+            }
>+        }
>+        /* all new reserved ranges are included in existing ones */
>+        ret = 0;
>+        goto out;
>+    }
>+
>+    if (sdev->probe_done) {
>+        warn_report("%s: Notified about new host reserved regions after
>probe",
>+                    __func__);
>+    }
>+
>+    range_inverse_array(iova_ranges,
>+                        &sdev->host_resv_ranges,
>+                        0, UINT64_MAX);
>+    rebuild_resv_regions(sdev);
>+
>+    return 0;
>+error:
>+    error_setg(errp, "%s Conflicting host reserved ranges set!",
>+               __func__);
>+out:
>+    g_list_free_full(new_ranges, g_free);
>+    return ret;
>+}
>+
> static const PCIIOMMUOps virtio_iommu_ops = {
>     .get_address_space = virtio_iommu_find_add_as,
>+    .set_host_iova_ranges = virtio_iommu_set_host_iova_ranges,
> };
>
> static int virtio_iommu_attach(VirtIOIOMMU *s,
>@@ -1158,39 +1259,6 @@ static int
>virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
>     return 0;
> }
>
>-/**
>- * rebuild_resv_regions: rebuild resv regions with both the
>- * info of host resv ranges and property set resv ranges
>- */
>-static int rebuild_resv_regions(IOMMUDevice *sdev)
>-{
>-    GList *l;
>-    int i = 0;
>-
>-    /* free the existing list and rebuild it from scratch */
>-    g_list_free_full(sdev->resv_regions, g_free);
>-    sdev->resv_regions = NULL;
>-
>-    /* First add host reserved regions if any, all tagged as RESERVED */
>-    for (l = sdev->host_resv_ranges; l; l = l->next) {
>-        ReservedRegion *reg = g_new0(ReservedRegion, 1);
>-        Range *r = (Range *)l->data;
>-
>-        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
>-        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
>-        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions, reg);
>-        trace_virtio_iommu_host_resv_regions(sdev-
>>iommu_mr.parent_obj.name, i,
>-                                             range_lob(&reg->range),
>-                                             range_upb(&reg->range));
>-        i++;
>-    }
>-    /*
>-     * then add higher priority reserved regions set by the machine
>-     * through properties
>-     */
>-    add_prop_resv_regions(sdev);
>-    return 0;
>-}
>
> /**
>  * virtio_iommu_set_iova_ranges: Conveys the usable IOVA ranges
>--
>2.41.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
  2024-01-18  7:10 ` [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Duan, Zhenzhong
@ 2024-01-18  9:43   ` Eric Auger
  2024-01-19  6:46     ` Duan, Zhenzhong
  2024-01-25 18:48     ` Jean-Philippe Brucker
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-18  9:43 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger.pro, qemu-devel, qemu-arm,
	jean-philippe, alex.williamson, peter.maydell, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

Hi Zhenzhong,
On 1/18/24 08:10, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Cc: mst@redhat.com; clg@redhat.com
>> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling
>> for hotplugged devices
>>
>> In [1] we attempted to fix a case where a VFIO-PCI device protected
>> with a virtio-iommu was assigned to an x86 guest. On x86 the physical
>> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
>> virtio-iommu used to expose a 64b address space by default.
>> Hence the guest was trying to use the full 64b space and we hit
>> DMA MAP failures. To work around this issue we managed to pass
>> usable IOVA regions (excluding the out of range space) from VFIO
>> to the virtio-iommu device. This was made feasible by introducing
>> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
>> This latter gets called when the IOMMU MR is enabled which
>> causes the vfio_listener_region_add() to be called.
>>
>> However with VFIO-PCI hotplug, this technique fails due to the
>> race between the call to the callback in the add memory listener
>> and the virtio-iommu probe request. Indeed the probe request gets
>> called before the attach to the domain. So in that case the usable
>> regions are communicated after the probe request and fail to be
>> conveyed to the guest. To be honest the problem was hinted by
>> Jean-Philippe in [1] and I should have been more careful at
>> listening to him and testing with hotplug :-(
> It looks the global virtio_iommu_config.bypass is never cleared in guest.
> When guest virtio_iommu driver enable IOMMU, should it clear this
> bypass attribute?
> If it could be cleared in viommu_probe(), then qemu will call
> virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
> to enable IOMMU MR. Then both coldplugged and hotplugged devices will work.

this field is iommu wide while the probe applies on a one device.In
general I would prefer not to be dependent on the MR enablement. We know
that the device is likely to be protected and we can collect its
requirements beforehand.
>
> Intel iommu has a similar bit in register GCMD_REG.TE, when guest
> intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all()
> is called to enable IOMMU MRs.
interesting.

Would be curious to get Jean Philippe's pov.
>
>> For coldplugged device the technique works because we make sure all
>> the IOMMU MR are enabled once on the machine init done: 94df5b2180
>> ("virtio-iommu: Fix 64kB host page size VFIO device assignment")
>> for granule freeze. But I would be keen to get rid of this trick.
>>
>> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
>> MR to have been enabled and the corresponding vfio_listener_region_add()
>> to be executed. Instead this series proposes to replace the usage of this
>> API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
>> modify
>> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
>> called earlier, once the usable IOVA regions have been collected by
>> VFIO, without the need for the IOMMU MR to be enabled.
>>
>> This looks cleaner. In the short term this may also be used for
>> passing the page size mask, which would allow to get rid of the
>> hacky transient IOMMU MR enablement mentionned above.
>>
>> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
>>    https://lore.kernel.org/all/20231019134651.842175-1-
>> eric.auger@redhat.com/
>>
>> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/
>>
>> Extra Notes:
>> With that series, the reserved memory regions are communicated on time
>> so that the virtio-iommu probe request grabs them. However this is not
>> sufficient. In some cases (my case), I still see some DMA MAP failures
>> and the guest keeps on using IOVA ranges outside the geometry of the
>> physical IOMMU. This is due to the fact the VFIO-PCI device is in the
>> same iommu group as the pcie root port. Normally the kernel
>> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
>> reserve_iova()
>> for each reserved IOVA, which carves them out of the allocator. When
>> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
>> the iova domain is already allocated and set and we don't call
>> iova_reserve_iommu_regions() again for the vfio-pci device. So its
>> corresponding reserved regions are not properly taken into account.
> I suspect there is same issue with coldplugged devices. If those devices
> are in same group, get iova_reserve_iommu_regions() is only called
> for first device. But other devices's reserved regions are missed.

Correct
>
> Curious how you make passthrough device and pcie root port under same
> group.
> When I start a x86 guest with passthrough device, I see passthrough
> device and pcie root port are in different group.
>
> -[0000:00]-+-00.0
>            +-01.0
>            +-02.0
>            +-03.0-[01]----00.0
>
> /sys/kernel/iommu_groups/3/devices:
> 0000:00:03.0
> /sys/kernel/iommu_groups/7/devices:
> 0000:01:00.0
>
> My qemu cmdline:
> -device pcie-root-port,id=root0,slot=0
> -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0

I just replayed the scenario:
- if you have a coldplugged vfio-pci device, the pci root port and the
passthroughed device end up in different iommu groups. On my end I use
ioh3420 but you confirmed that's the same for the generic pcie-root-port
- however if you hotplug the vfio-pci device that's a different story:
they end up in the same group. Don't ask me why. I tried with
both virtio-iommu and intel iommu and I end up with the same topology.
That looks really weird to me.

I initially thought this was an ACS issue but I am now puzzled.

Thanks!

Eric
>
> Thanks
> Zhenzhong
>
>> This is not trivial to fix because theoretically the 1st attached
>> devices could already have allocated IOVAs within the reserved regions
>> of the second device. Also we are somehow hijacking the reserved
>> memory regions to model the geometry of the physical IOMMU so not sure
>> any attempt to fix that upstream will be accepted. At the moment one
>> solution is to make sure assigned devices end up in singleton group.
>> Another solution is to work on a different approach where the gaw
>> can be passed as an option to the virtio-iommu device, similarly at
>> what is done with intel iommu.
>>
>> This series can be found at:
>> https://github.com/eauger/qemu/tree/hotplug-resv-rfc
>>
>>
>> Eric Auger (7):
>>  hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions
>>  hw/pci: Introduce pci_device_iommu_bus
>>  vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps
>>  virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
>>  virtio-iommu: Remove the implementation of iommu_set_iova_ranges
>>  hw/vfio: Remove memory_region_iommu_set_iova_ranges() call
>>  memory: Remove IOMMU MR iommu_set_iova_range API
>>
>> include/exec/memory.h    |  32 -------
>> include/hw/pci/pci.h     |  16 ++++
>> hw/pci/pci.c             |  16 ++++
>> hw/vfio/common.c         |  10 --
>> hw/vfio/pci.c            |  27 ++++++
>> hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++-------------------
>> system/memory.c          |  13 ---
>> 7 files changed, 160 insertions(+), 155 deletions(-)
>>
>> --
>> 2.41.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
  2024-01-18  7:43   ` Duan, Zhenzhong
@ 2024-01-18 12:25     ` Eric Auger
  2024-01-19  7:00       ` Duan, Zhenzhong
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Auger @ 2024-01-18 12:25 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger.pro, qemu-devel, qemu-arm,
	jean-philippe, alex.williamson, peter.maydell, peterx, yanghliu,
	pbonzini
  Cc: mst, clg

Hi Zhenzhong,
On 1/18/24 08:43, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps
>> set_host_resv_regions
>>
>> Reuse the implementation of virtio_iommu_set_iova_ranges() which
>> will be removed in subsequent patches.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> ---
>> hw/virtio/virtio-iommu.c | 134 +++++++++++++++++++++++++++++---------
>> -
>> 1 file changed, 101 insertions(+), 33 deletions(-)
>>
>> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>> index 8a4bd933c6..716a3fcfbf 100644
>> --- a/hw/virtio/virtio-iommu.c
>> +++ b/hw/virtio/virtio-iommu.c
>> @@ -461,8 +461,109 @@ static AddressSpace
>> *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
>>     return &sdev->as;
>> }
>>
>> +/**
>> + * rebuild_resv_regions: rebuild resv regions with both the
>> + * info of host resv ranges and property set resv ranges
>> + */
>> +static int rebuild_resv_regions(IOMMUDevice *sdev)
>> +{
>> +    GList *l;
>> +    int i = 0;
>> +
>> +    /* free the existing list and rebuild it from scratch */
>> +    g_list_free_full(sdev->resv_regions, g_free);
>> +    sdev->resv_regions = NULL;
>> +
>> +    /* First add host reserved regions if any, all tagged as RESERVED */
>> +    for (l = sdev->host_resv_ranges; l; l = l->next) {
>> +        ReservedRegion *reg = g_new0(ReservedRegion, 1);
>> +        Range *r = (Range *)l->data;
>> +
>> +        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
>> +        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
>> +        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions, reg);
>> +        trace_virtio_iommu_host_resv_regions(sdev-
>>> iommu_mr.parent_obj.name, i,
>> +                                             range_lob(&reg->range),
>> +                                             range_upb(&reg->range));
>> +        i++;
>> +    }
>> +    /*
>> +     * then add higher priority reserved regions set by the machine
>> +     * through properties
>> +     */
>> +    add_prop_resv_regions(sdev);
>> +    return 0;
>> +}
>> +
>> +static int virtio_iommu_set_host_iova_ranges(PCIBus *bus, void *opaque,
>> +                                             int devfn, GList *iova_ranges,
>> +                                             Error **errp)
>> +{
>> +    VirtIOIOMMU *s = opaque;
>> +    IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
>> +    IOMMUDevice *sdev;
>> +    GList *current_ranges;
>> +    GList *l, *tmp, *new_ranges = NULL;
>> +    int ret = -EINVAL;
>> +
>> +    if (!sbus) {
>> +        error_report("%s no sbus", __func__);
>> +    }
> Do we plan to support multiple devices in same iommu group?
> as_by_busptr is hashed by bus which is an aliased bus of the device.
> So sbus may be NULL if device's bus is different from aliased bus.
If I understand you remark properly this means that

virtio_iommu_set_host_iova_ranges should take as arg the aliased bus and not the bus, right?
I think we shall support non singleton groups too.

>
>> +
>> +    sdev = sbus->pbdev[devfn];
>> +
>> +    current_ranges = sdev->host_resv_ranges;
>> +
>> +    warn_report("%s: host_resv_regions=%d", __func__, !!sdev-
>>> host_resv_ranges);
>> +    /* check that each new resv region is included in an existing one */
>> +    if (sdev->host_resv_ranges) {
> May be we could just error out as vfio_realize should not call
> set_host_iova_ranges() twice.
so if we have several devices in the group,

set_host_iova_ranges()

 may be called several times. Right?

Eric
>
> Thanks
> Zhenzhong
>> +        range_inverse_array(iova_ranges,
>> +                            &new_ranges,
>> +                            0, UINT64_MAX);
>> +
>> +        for (tmp = new_ranges; tmp; tmp = tmp->next) {
>> +            Range *newr = (Range *)tmp->data;
>> +            bool included = false;
>> +
>> +            for (l = current_ranges; l; l = l->next) {
>> +                Range * r = (Range *)l->data;
>> +
>> +                if (range_contains_range(r, newr)) {
>> +                    included = true;
>> +                    break;
>> +                }
>> +            }
>> +            if (!included) {
>> +                goto error;
>> +            }
>> +        }
>> +        /* all new reserved ranges are included in existing ones */
>> +        ret = 0;
>> +        goto out;
>> +    }
>> +
>> +    if (sdev->probe_done) {
>> +        warn_report("%s: Notified about new host reserved regions after
>> probe",
>> +                    __func__);
>> +    }
>> +
>> +    range_inverse_array(iova_ranges,
>> +                        &sdev->host_resv_ranges,
>> +                        0, UINT64_MAX);
>> +    rebuild_resv_regions(sdev);
>> +
>> +    return 0;
>> +error:
>> +    error_setg(errp, "%s Conflicting host reserved ranges set!",
>> +               __func__);
>> +out:
>> +    g_list_free_full(new_ranges, g_free);
>> +    return ret;
>> +}
>> +
>> static const PCIIOMMUOps virtio_iommu_ops = {
>>     .get_address_space = virtio_iommu_find_add_as,
>> +    .set_host_iova_ranges = virtio_iommu_set_host_iova_ranges,
>> };
>>
>> static int virtio_iommu_attach(VirtIOIOMMU *s,
>> @@ -1158,39 +1259,6 @@ static int
>> virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
>>     return 0;
>> }
>>
>> -/**
>> - * rebuild_resv_regions: rebuild resv regions with both the
>> - * info of host resv ranges and property set resv ranges
>> - */
>> -static int rebuild_resv_regions(IOMMUDevice *sdev)
>> -{
>> -    GList *l;
>> -    int i = 0;
>> -
>> -    /* free the existing list and rebuild it from scratch */
>> -    g_list_free_full(sdev->resv_regions, g_free);
>> -    sdev->resv_regions = NULL;
>> -
>> -    /* First add host reserved regions if any, all tagged as RESERVED */
>> -    for (l = sdev->host_resv_ranges; l; l = l->next) {
>> -        ReservedRegion *reg = g_new0(ReservedRegion, 1);
>> -        Range *r = (Range *)l->data;
>> -
>> -        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
>> -        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
>> -        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions, reg);
>> -        trace_virtio_iommu_host_resv_regions(sdev-
>>> iommu_mr.parent_obj.name, i,
>> -                                             range_lob(&reg->range),
>> -                                             range_upb(&reg->range));
>> -        i++;
>> -    }
>> -    /*
>> -     * then add higher priority reserved regions set by the machine
>> -     * through properties
>> -     */
>> -    add_prop_resv_regions(sdev);
>> -    return 0;
>> -}
>>
>> /**
>>  * virtio_iommu_set_iova_ranges: Conveys the usable IOVA ranges
>> --
>> 2.41.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
  2024-01-18  9:43   ` Eric Auger
@ 2024-01-19  6:46     ` Duan, Zhenzhong
  2024-01-25 18:48     ` Jean-Philippe Brucker
  1 sibling, 0 replies; 20+ messages in thread
From: Duan, Zhenzhong @ 2024-01-19  6:46 UTC (permalink / raw)
  To: eric.auger, eric.auger.pro, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini
  Cc: mst, clg



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry
>handling for hotplugged devices
>
>Hi Zhenzhong,
>On 1/18/24 08:10, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Cc: mst@redhat.com; clg@redhat.com
>>> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry
>handling
>>> for hotplugged devices
>>>
>>> In [1] we attempted to fix a case where a VFIO-PCI device protected
>>> with a virtio-iommu was assigned to an x86 guest. On x86 the physical
>>> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
>>> virtio-iommu used to expose a 64b address space by default.
>>> Hence the guest was trying to use the full 64b space and we hit
>>> DMA MAP failures. To work around this issue we managed to pass
>>> usable IOVA regions (excluding the out of range space) from VFIO
>>> to the virtio-iommu device. This was made feasible by introducing
>>> a new IOMMU Memory Region callback dubbed
>iommu_set_iova_regions().
>>> This latter gets called when the IOMMU MR is enabled which
>>> causes the vfio_listener_region_add() to be called.
>>>
>>> However with VFIO-PCI hotplug, this technique fails due to the
>>> race between the call to the callback in the add memory listener
>>> and the virtio-iommu probe request. Indeed the probe request gets
>>> called before the attach to the domain. So in that case the usable
>>> regions are communicated after the probe request and fail to be
>>> conveyed to the guest. To be honest the problem was hinted by
>>> Jean-Philippe in [1] and I should have been more careful at
>>> listening to him and testing with hotplug :-(
>> It looks the global virtio_iommu_config.bypass is never cleared in guest.
>> When guest virtio_iommu driver enable IOMMU, should it clear this
>> bypass attribute?
>> If it could be cleared in viommu_probe(), then qemu will call
>> virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
>> to enable IOMMU MR. Then both coldplugged and hotplugged devices will
>work.
>
>this field is iommu wide while the probe applies on a one device.In
>general I would prefer not to be dependent on the MR enablement. We know
>that the device is likely to be protected and we can collect its
>requirements beforehand.	

Agree that your new patch is cleaner.

>>
>> Intel iommu has a similar bit in register GCMD_REG.TE, when guest
>> intel_iommu driver probe set it, on qemu side,
>vtd_address_space_refresh_all()
>> is called to enable IOMMU MRs.
>interesting.
>
>Would be curious to get Jean Philippe's pov.
>>
>>> For coldplugged device the technique works because we make sure all
>>> the IOMMU MR are enabled once on the machine init done: 94df5b2180
>>> ("virtio-iommu: Fix 64kB host page size VFIO device assignment")
>>> for granule freeze. But I would be keen to get rid of this trick.
>>>
>>> Using an IOMMU MR Ops is unpractical because this relies on the
>IOMMU
>>> MR to have been enabled and the corresponding
>vfio_listener_region_add()
>>> to be executed. Instead this series proposes to replace the usage of this
>>> API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
>>> modify
>>> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
>>> called earlier, once the usable IOVA regions have been collected by
>>> VFIO, without the need for the IOMMU MR to be enabled.
>>>
>>> This looks cleaner. In the short term this may also be used for
>>> passing the page size mask, which would allow to get rid of the
>>> hacky transient IOMMU MR enablement mentionned above.
>>>
>>> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA
>space
>>>    https://lore.kernel.org/all/20231019134651.842175-1-
>>> eric.auger@redhat.com/
>>>
>>> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/
>>>
>>> Extra Notes:
>>> With that series, the reserved memory regions are communicated on
>time
>>> so that the virtio-iommu probe request grabs them. However this is not
>>> sufficient. In some cases (my case), I still see some DMA MAP failures
>>> and the guest keeps on using IOVA ranges outside the geometry of the
>>> physical IOMMU. This is due to the fact the VFIO-PCI device is in the
>>> same iommu group as the pcie root port. Normally the kernel
>>> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
>>> reserve_iova()
>>> for each reserved IOVA, which carves them out of the allocator. When
>>> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
>>> the iova domain is already allocated and set and we don't call
>>> iova_reserve_iommu_regions() again for the vfio-pci device. So its
>>> corresponding reserved regions are not properly taken into account.
>> I suspect there is same issue with coldplugged devices. If those devices
>> are in same group, get iova_reserve_iommu_regions() is only called
>> for first device. But other devices's reserved regions are missed.
>
>Correct
>>
>> Curious how you make passthrough device and pcie root port under same
>> group.
>> When I start a x86 guest with passthrough device, I see passthrough
>> device and pcie root port are in different group.
>>
>> -[0000:00]-+-00.0
>>            +-01.0
>>            +-02.0
>>            +-03.0-[01]----00.0
>>
>> /sys/kernel/iommu_groups/3/devices:
>> 0000:00:03.0
>> /sys/kernel/iommu_groups/7/devices:
>> 0000:01:00.0
>>
>> My qemu cmdline:
>> -device pcie-root-port,id=root0,slot=0
>> -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0
>
>I just replayed the scenario:
>- if you have a coldplugged vfio-pci device, the pci root port and the
>passthroughed device end up in different iommu groups. On my end I use
>ioh3420 but you confirmed that's the same for the generic pcie-root-port
>- however if you hotplug the vfio-pci device that's a different story:
>they end up in the same group. Don't ask me why. I tried with
>both virtio-iommu and intel iommu and I end up with the same topology.
>That looks really weird to me.

That's strange. I tested two vfio devices with ioh3420, one coldplug, the other hotplug.
ioh3420 and vfio devices are in same group.

-[0000:00]-+-00.0
           +-01.0
           +-02.0
           +-03.0-[01]----00.0
           +-04.0-[02]----00.0

/sys/kernel/iommu_groups/3/devices:
0000:00:03.0  0000:01:00.0

/sys/kernel/iommu_groups/4/devices:
0000:00:04.0  0000:02:00.0

Thanks
Zhenzhong

>
>I initially thought this was an ACS issue but I am now puzzled.
>
>Thanks!
>
>Eric
>>
>> Thanks
>> Zhenzhong
>>
>>> This is not trivial to fix because theoretically the 1st attached
>>> devices could already have allocated IOVAs within the reserved regions
>>> of the second device. Also we are somehow hijacking the reserved
>>> memory regions to model the geometry of the physical IOMMU so not
>sure
>>> any attempt to fix that upstream will be accepted. At the moment one
>>> solution is to make sure assigned devices end up in singleton group.
>>> Another solution is to work on a different approach where the gaw
>>> can be passed as an option to the virtio-iommu device, similarly at
>>> what is done with intel iommu.
>>>
>>> This series can be found at:
>>> https://github.com/eauger/qemu/tree/hotplug-resv-rfc
>>>
>>>
>>> Eric Auger (7):
>>>  hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions
>>>  hw/pci: Introduce pci_device_iommu_bus
>>>  vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps
>>>  virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
>>>  virtio-iommu: Remove the implementation of iommu_set_iova_ranges
>>>  hw/vfio: Remove memory_region_iommu_set_iova_ranges() call
>>>  memory: Remove IOMMU MR iommu_set_iova_range API
>>>
>>> include/exec/memory.h    |  32 -------
>>> include/hw/pci/pci.h     |  16 ++++
>>> hw/pci/pci.c             |  16 ++++
>>> hw/vfio/common.c         |  10 --
>>> hw/vfio/pci.c            |  27 ++++++
>>> hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++-------------------
>>> system/memory.c          |  13 ---
>>> 7 files changed, 160 insertions(+), 155 deletions(-)
>>>
>>> --
>>> 2.41.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
  2024-01-18 12:25     ` Eric Auger
@ 2024-01-19  7:00       ` Duan, Zhenzhong
  2024-01-22  7:17         ` Duan, Zhenzhong
  0 siblings, 1 reply; 20+ messages in thread
From: Duan, Zhenzhong @ 2024-01-19  7:00 UTC (permalink / raw)
  To: eric.auger, eric.auger.pro, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini
  Cc: mst, clg



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps
>set_host_resv_regions
>
>Hi Zhenzhong,
>On 1/18/24 08:43, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Subject: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps
>>> set_host_resv_regions
>>>
>>> Reuse the implementation of virtio_iommu_set_iova_ranges() which
>>> will be removed in subsequent patches.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>> ---
>>> hw/virtio/virtio-iommu.c | 134 +++++++++++++++++++++++++++++------
>---
>>> -
>>> 1 file changed, 101 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>>> index 8a4bd933c6..716a3fcfbf 100644
>>> --- a/hw/virtio/virtio-iommu.c
>>> +++ b/hw/virtio/virtio-iommu.c
>>> @@ -461,8 +461,109 @@ static AddressSpace
>>> *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
>>>     return &sdev->as;
>>> }
>>>
>>> +/**
>>> + * rebuild_resv_regions: rebuild resv regions with both the
>>> + * info of host resv ranges and property set resv ranges
>>> + */
>>> +static int rebuild_resv_regions(IOMMUDevice *sdev)
>>> +{
>>> +    GList *l;
>>> +    int i = 0;
>>> +
>>> +    /* free the existing list and rebuild it from scratch */
>>> +    g_list_free_full(sdev->resv_regions, g_free);
>>> +    sdev->resv_regions = NULL;
>>> +
>>> +    /* First add host reserved regions if any, all tagged as RESERVED */
>>> +    for (l = sdev->host_resv_ranges; l; l = l->next) {
>>> +        ReservedRegion *reg = g_new0(ReservedRegion, 1);
>>> +        Range *r = (Range *)l->data;
>>> +
>>> +        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
>>> +        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
>>> +        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions,
>reg);
>>> +        trace_virtio_iommu_host_resv_regions(sdev-
>>>> iommu_mr.parent_obj.name, i,
>>> +                                             range_lob(&reg->range),
>>> +                                             range_upb(&reg->range));
>>> +        i++;
>>> +    }
>>> +    /*
>>> +     * then add higher priority reserved regions set by the machine
>>> +     * through properties
>>> +     */
>>> +    add_prop_resv_regions(sdev);
>>> +    return 0;
>>> +}
>>> +
>>> +static int virtio_iommu_set_host_iova_ranges(PCIBus *bus, void
>*opaque,
>>> +                                             int devfn, GList *iova_ranges,
>>> +                                             Error **errp)
>>> +{
>>> +    VirtIOIOMMU *s = opaque;
>>> +    IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
>>> +    IOMMUDevice *sdev;
>>> +    GList *current_ranges;
>>> +    GList *l, *tmp, *new_ranges = NULL;
>>> +    int ret = -EINVAL;
>>> +
>>> +    if (!sbus) {
>>> +        error_report("%s no sbus", __func__);
>>> +    }
>> Do we plan to support multiple devices in same iommu group?
>> as_by_busptr is hashed by bus which is an aliased bus of the device.
>> So sbus may be NULL if device's bus is different from aliased bus.
>If I understand you remark properly this means that
>
>virtio_iommu_set_host_iova_ranges should take as arg the aliased bus and
>not the bus, right?
>I think we shall support non singleton groups too.

Not exactly. I think we should pass device's real BDF, not aliased BDF. Or else
we setup reserved ranges of all devices in same group to aliased BDF.

I'm just suspecting that the hash lookup with real BDF index will return NULL if
multiple devices are in same group. If that’s true, then iova_ranges is never
passed to guest.

>
>>
>>> +
>>> +    sdev = sbus->pbdev[devfn];
>>> +
>>> +    current_ranges = sdev->host_resv_ranges;
>>> +
>>> +    warn_report("%s: host_resv_regions=%d", __func__, !!sdev-
>>>> host_resv_ranges);
>>> +    /* check that each new resv region is included in an existing one */
>>> +    if (sdev->host_resv_ranges) {
>> May be we could just error out as vfio_realize should not call
>> set_host_iova_ranges() twice.
>so if we have several devices in the group,
>
>set_host_iova_ranges()

Yes,

>
> may be called several times. Right?

but should be on different sdev due to different real BDF, not only on aliased BDF.

Thanks
Zhenzhong

>
>Eric
>>
>> Thanks
>> Zhenzhong
>>> +        range_inverse_array(iova_ranges,
>>> +                            &new_ranges,
>>> +                            0, UINT64_MAX);
>>> +
>>> +        for (tmp = new_ranges; tmp; tmp = tmp->next) {
>>> +            Range *newr = (Range *)tmp->data;
>>> +            bool included = false;
>>> +
>>> +            for (l = current_ranges; l; l = l->next) {
>>> +                Range * r = (Range *)l->data;
>>> +
>>> +                if (range_contains_range(r, newr)) {
>>> +                    included = true;
>>> +                    break;
>>> +                }
>>> +            }
>>> +            if (!included) {
>>> +                goto error;
>>> +            }
>>> +        }
>>> +        /* all new reserved ranges are included in existing ones */
>>> +        ret = 0;
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (sdev->probe_done) {
>>> +        warn_report("%s: Notified about new host reserved regions after
>>> probe",
>>> +                    __func__);
>>> +    }
>>> +
>>> +    range_inverse_array(iova_ranges,
>>> +                        &sdev->host_resv_ranges,
>>> +                        0, UINT64_MAX);
>>> +    rebuild_resv_regions(sdev);
>>> +
>>> +    return 0;
>>> +error:
>>> +    error_setg(errp, "%s Conflicting host reserved ranges set!",
>>> +               __func__);
>>> +out:
>>> +    g_list_free_full(new_ranges, g_free);
>>> +    return ret;
>>> +}
>>> +
>>> static const PCIIOMMUOps virtio_iommu_ops = {
>>>     .get_address_space = virtio_iommu_find_add_as,
>>> +    .set_host_iova_ranges = virtio_iommu_set_host_iova_ranges,
>>> };
>>>
>>> static int virtio_iommu_attach(VirtIOIOMMU *s,
>>> @@ -1158,39 +1259,6 @@ static int
>>> virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
>>>     return 0;
>>> }
>>>
>>> -/**
>>> - * rebuild_resv_regions: rebuild resv regions with both the
>>> - * info of host resv ranges and property set resv ranges
>>> - */
>>> -static int rebuild_resv_regions(IOMMUDevice *sdev)
>>> -{
>>> -    GList *l;
>>> -    int i = 0;
>>> -
>>> -    /* free the existing list and rebuild it from scratch */
>>> -    g_list_free_full(sdev->resv_regions, g_free);
>>> -    sdev->resv_regions = NULL;
>>> -
>>> -    /* First add host reserved regions if any, all tagged as RESERVED */
>>> -    for (l = sdev->host_resv_ranges; l; l = l->next) {
>>> -        ReservedRegion *reg = g_new0(ReservedRegion, 1);
>>> -        Range *r = (Range *)l->data;
>>> -
>>> -        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
>>> -        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
>>> -        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions,
>reg);
>>> -        trace_virtio_iommu_host_resv_regions(sdev-
>>>> iommu_mr.parent_obj.name, i,
>>> -                                             range_lob(&reg->range),
>>> -                                             range_upb(&reg->range));
>>> -        i++;
>>> -    }
>>> -    /*
>>> -     * then add higher priority reserved regions set by the machine
>>> -     * through properties
>>> -     */
>>> -    add_prop_resv_regions(sdev);
>>> -    return 0;
>>> -}
>>>
>>> /**
>>>  * virtio_iommu_set_iova_ranges: Conveys the usable IOVA ranges
>>> --
>>> 2.41.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions
  2024-01-19  7:00       ` Duan, Zhenzhong
@ 2024-01-22  7:17         ` Duan, Zhenzhong
  0 siblings, 0 replies; 20+ messages in thread
From: Duan, Zhenzhong @ 2024-01-22  7:17 UTC (permalink / raw)
  To: eric.auger, eric.auger.pro, qemu-devel, qemu-arm, jean-philippe,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini
  Cc: mst, clg

Hi Eric,

>-----Original Message-----
>From: Duan, Zhenzhong
>Subject: RE: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps
>set_host_resv_regions
>
>
>
>>-----Original Message-----
>>From: Eric Auger <eric.auger@redhat.com>
>>Subject: Re: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps
>>set_host_resv_regions
>>
>>Hi Zhenzhong,
>>On 1/18/24 08:43, Duan, Zhenzhong wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Subject: [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps
>>>> set_host_resv_regions
>>>>
>>>> Reuse the implementation of virtio_iommu_set_iova_ranges() which
>>>> will be removed in subsequent patches.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>> ---
>>>> hw/virtio/virtio-iommu.c | 134 +++++++++++++++++++++++++++++----
>--
>>---
>>>> -
>>>> 1 file changed, 101 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>>>> index 8a4bd933c6..716a3fcfbf 100644
>>>> --- a/hw/virtio/virtio-iommu.c
>>>> +++ b/hw/virtio/virtio-iommu.c
>>>> @@ -461,8 +461,109 @@ static AddressSpace
>>>> *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
>>>>     return &sdev->as;
>>>> }
>>>>
>>>> +/**
>>>> + * rebuild_resv_regions: rebuild resv regions with both the
>>>> + * info of host resv ranges and property set resv ranges
>>>> + */
>>>> +static int rebuild_resv_regions(IOMMUDevice *sdev)
>>>> +{
>>>> +    GList *l;
>>>> +    int i = 0;
>>>> +
>>>> +    /* free the existing list and rebuild it from scratch */
>>>> +    g_list_free_full(sdev->resv_regions, g_free);
>>>> +    sdev->resv_regions = NULL;
>>>> +
>>>> +    /* First add host reserved regions if any, all tagged as RESERVED */
>>>> +    for (l = sdev->host_resv_ranges; l; l = l->next) {
>>>> +        ReservedRegion *reg = g_new0(ReservedRegion, 1);
>>>> +        Range *r = (Range *)l->data;
>>>> +
>>>> +        reg->type = VIRTIO_IOMMU_RESV_MEM_T_RESERVED;
>>>> +        range_set_bounds(&reg->range, range_lob(r), range_upb(r));
>>>> +        sdev->resv_regions = resv_region_list_insert(sdev->resv_regions,
>>reg);
>>>> +        trace_virtio_iommu_host_resv_regions(sdev-
>>>>> iommu_mr.parent_obj.name, i,
>>>> +                                             range_lob(&reg->range),
>>>> +                                             range_upb(&reg->range));
>>>> +        i++;
>>>> +    }
>>>> +    /*
>>>> +     * then add higher priority reserved regions set by the machine
>>>> +     * through properties
>>>> +     */
>>>> +    add_prop_resv_regions(sdev);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int virtio_iommu_set_host_iova_ranges(PCIBus *bus, void
>>*opaque,
>>>> +                                             int devfn, GList *iova_ranges,
>>>> +                                             Error **errp)
>>>> +{
>>>> +    VirtIOIOMMU *s = opaque;
>>>> +    IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
>>>> +    IOMMUDevice *sdev;
>>>> +    GList *current_ranges;
>>>> +    GList *l, *tmp, *new_ranges = NULL;
>>>> +    int ret = -EINVAL;
>>>> +
>>>> +    if (!sbus) {
>>>> +        error_report("%s no sbus", __func__);
>>>> +    }
>>> Do we plan to support multiple devices in same iommu group?
>>> as_by_busptr is hashed by bus which is an aliased bus of the device.
>>> So sbus may be NULL if device's bus is different from aliased bus.
>>If I understand you remark properly this means that
>>
>>virtio_iommu_set_host_iova_ranges should take as arg the aliased bus and
>>not the bus, right?
>>I think we shall support non singleton groups too.
>
>Not exactly. I think we should pass device's real BDF, not aliased BDF. Or else
>we setup reserved ranges of all devices in same group to aliased BDF.
>
>I'm just suspecting that the hash lookup with real BDF index will return NULL
>if
>multiple devices are in same group. If that’s true, then iova_ranges is never
>passed to guest.

I guess https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg04153.html
may help on the discussion here, it's a complementation to this series to make
multiple devices in same IOMMU group works. Comments welcome.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
  2024-01-18  9:43   ` Eric Auger
  2024-01-19  6:46     ` Duan, Zhenzhong
@ 2024-01-25 18:48     ` Jean-Philippe Brucker
  2024-01-29 16:38       ` Eric Auger
  1 sibling, 1 reply; 20+ messages in thread
From: Jean-Philippe Brucker @ 2024-01-25 18:48 UTC (permalink / raw)
  To: Eric Auger
  Cc: Duan, Zhenzhong, eric.auger.pro, qemu-devel, qemu-arm,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini, mst,
	clg

Hi,

On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote:
> Hi Zhenzhong,
> On 1/18/24 08:10, Duan, Zhenzhong wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Cc: mst@redhat.com; clg@redhat.com
> >> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling
> >> for hotplugged devices
> >>
> >> In [1] we attempted to fix a case where a VFIO-PCI device protected
> >> with a virtio-iommu was assigned to an x86 guest. On x86 the physical
> >> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
> >> virtio-iommu used to expose a 64b address space by default.
> >> Hence the guest was trying to use the full 64b space and we hit
> >> DMA MAP failures. To work around this issue we managed to pass
> >> usable IOVA regions (excluding the out of range space) from VFIO
> >> to the virtio-iommu device. This was made feasible by introducing
> >> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
> >> This latter gets called when the IOMMU MR is enabled which
> >> causes the vfio_listener_region_add() to be called.
> >>
> >> However with VFIO-PCI hotplug, this technique fails due to the
> >> race between the call to the callback in the add memory listener
> >> and the virtio-iommu probe request. Indeed the probe request gets
> >> called before the attach to the domain. So in that case the usable
> >> regions are communicated after the probe request and fail to be
> >> conveyed to the guest. To be honest the problem was hinted by
> >> Jean-Philippe in [1] and I should have been more careful at
> >> listening to him and testing with hotplug :-(
> > It looks the global virtio_iommu_config.bypass is never cleared in guest.
> > When guest virtio_iommu driver enable IOMMU, should it clear this
> > bypass attribute?
> > If it could be cleared in viommu_probe(), then qemu will call
> > virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
> > to enable IOMMU MR. Then both coldplugged and hotplugged devices will work.
> 
> this field is iommu wide while the probe applies on a one device.In
> general I would prefer not to be dependent on the MR enablement. We know
> that the device is likely to be protected and we can collect its
> requirements beforehand.
> 
> >
> > Intel iommu has a similar bit in register GCMD_REG.TE, when guest
> > intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all()
> > is called to enable IOMMU MRs.
> interesting.
> 
> Would be curious to get Jean Philippe's pov.

I'd rather not rely on this, it's hard to justify a driver change based
only on QEMU internals. And QEMU can't count on the driver always clearing
bypass. There could be situations where the guest can't afford to do it,
like if an endpoint is owned by the firmware and has to keep running.

There may be a separate argument for clearing bypass. With a coldplugged
VFIO device the flow is:

1. Map the whole guest address space in VFIO to implement boot-bypass.
   This allocates all guest pages, which takes a while and is wasteful.
   I've actually crashed a host that way, when spawning a guest with too
   much RAM.
2. Start the VM
3. When the virtio-iommu driver attaches a (non-identity) domain to the
   assigned endpoint, then unmap the whole address space in VFIO, and most
   pages are given back to the host.

We can't disable boot-bypass because the BIOS needs it. But instead the
flow could be:

1. Start the VM, with only the virtual endpoints. Nothing to pin.
2. The virtio-iommu driver disables bypass during boot
3. Hotplug the VFIO device. With bypass disabled there is no need to pin
   the whole guest address space, unless the guest explicitly asks for an
   identity domain.

However, I don't know if this is a realistic scenario that will actually
be used.

By the way, do you have an easy way to reproduce the issue described here?
I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
just allocates 32-bit IOVAs.

> >
> >> For coldplugged device the technique works because we make sure all
> >> the IOMMU MR are enabled once on the machine init done: 94df5b2180
> >> ("virtio-iommu: Fix 64kB host page size VFIO device assignment")
> >> for granule freeze. But I would be keen to get rid of this trick.
> >>
> >> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
> >> MR to have been enabled and the corresponding vfio_listener_region_add()
> >> to be executed. Instead this series proposes to replace the usage of this
> >> API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
> >> modify
> >> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
> >> called earlier, once the usable IOVA regions have been collected by
> >> VFIO, without the need for the IOMMU MR to be enabled.
> >>
> >> This looks cleaner. In the short term this may also be used for
> >> passing the page size mask, which would allow to get rid of the
> >> hacky transient IOMMU MR enablement mentionned above.
> >>
> >> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
> >>    https://lore.kernel.org/all/20231019134651.842175-1-
> >> eric.auger@redhat.com/
> >>
> >> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/
> >>
> >> Extra Notes:
> >> With that series, the reserved memory regions are communicated on time
> >> so that the virtio-iommu probe request grabs them. However this is not
> >> sufficient. In some cases (my case), I still see some DMA MAP failures
> >> and the guest keeps on using IOVA ranges outside the geometry of the
> >> physical IOMMU. This is due to the fact the VFIO-PCI device is in the
> >> same iommu group as the pcie root port. Normally the kernel
> >> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
> >> reserve_iova()
> >> for each reserved IOVA, which carves them out of the allocator. When
> >> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
> >> the iova domain is already allocated and set and we don't call
> >> iova_reserve_iommu_regions() again for the vfio-pci device. So its
> >> corresponding reserved regions are not properly taken into account.
> > I suspect there is same issue with coldplugged devices. If those devices
> > are in same group, get iova_reserve_iommu_regions() is only called
> > for first device. But other devices's reserved regions are missed.
> 
> Correct
> >
> > Curious how you make passthrough device and pcie root port under same
> > group.
> > When I start a x86 guest with passthrough device, I see passthrough
> > device and pcie root port are in different group.
> >
> > -[0000:00]-+-00.0
> >            +-01.0
> >            +-02.0
> >            +-03.0-[01]----00.0
> >
> > /sys/kernel/iommu_groups/3/devices:
> > 0000:00:03.0
> > /sys/kernel/iommu_groups/7/devices:
> > 0000:01:00.0
> >
> > My qemu cmdline:
> > -device pcie-root-port,id=root0,slot=0
> > -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0
> 
> I just replayed the scenario:
> - if you have a coldplugged vfio-pci device, the pci root port and the
> passthroughed device end up in different iommu groups. On my end I use
> ioh3420 but you confirmed that's the same for the generic pcie-root-port
> - however if you hotplug the vfio-pci device that's a different story:
> they end up in the same group. Don't ask me why. I tried with
> both virtio-iommu and intel iommu and I end up with the same topology.
> That looks really weird to me.

It also took me a while to get hotplug to work on x86:
pcie_cap_slot_plug_cb() didn't get called, instead it would call
ich9_pm_device_plug_cb(). Not sure what I'm doing wrong.
To work around that I instantiated a second pxb-pcie root bus and then a
pcie root port on there. So my command-line looks like:

 -device virtio-iommu
 -device pxb-pcie,id=pcie.1,bus_nr=1
 -device pcie-root-port,chassis=2,id=pcie.2,bus=pcie.1

 device_add vfio-pci,host=00:04.0,bus=pcie.2

And somehow pcieport and the assigned device do end up in separate IOMMU
groups.

Thanks,
Jean



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
  2024-01-25 18:48     ` Jean-Philippe Brucker
@ 2024-01-29 16:38       ` Eric Auger
  2024-01-30 18:22         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Auger @ 2024-01-29 16:38 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Duan, Zhenzhong, eric.auger.pro, qemu-devel, qemu-arm,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini, mst,
	clg

Hi Jean,

On 1/25/24 19:48, Jean-Philippe Brucker wrote:
> Hi,
>
> On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote:
>> Hi Zhenzhong,
>> On 1/18/24 08:10, Duan, Zhenzhong wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Cc: mst@redhat.com; clg@redhat.com
>>>> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling
>>>> for hotplugged devices
>>>>
>>>> In [1] we attempted to fix a case where a VFIO-PCI device protected
>>>> with a virtio-iommu was assigned to an x86 guest. On x86 the physical
>>>> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
>>>> virtio-iommu used to expose a 64b address space by default.
>>>> Hence the guest was trying to use the full 64b space and we hit
>>>> DMA MAP failures. To work around this issue we managed to pass
>>>> usable IOVA regions (excluding the out of range space) from VFIO
>>>> to the virtio-iommu device. This was made feasible by introducing
>>>> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
>>>> This latter gets called when the IOMMU MR is enabled which
>>>> causes the vfio_listener_region_add() to be called.
>>>>
>>>> However with VFIO-PCI hotplug, this technique fails due to the
>>>> race between the call to the callback in the add memory listener
>>>> and the virtio-iommu probe request. Indeed the probe request gets
>>>> called before the attach to the domain. So in that case the usable
>>>> regions are communicated after the probe request and fail to be
>>>> conveyed to the guest. To be honest the problem was hinted by
>>>> Jean-Philippe in [1] and I should have been more careful at
>>>> listening to him and testing with hotplug :-(
>>> It looks the global virtio_iommu_config.bypass is never cleared in guest.
>>> When guest virtio_iommu driver enable IOMMU, should it clear this
>>> bypass attribute?
>>> If it could be cleared in viommu_probe(), then qemu will call
>>> virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
>>> to enable IOMMU MR. Then both coldplugged and hotplugged devices will work.
>> this field is iommu wide while the probe applies on a one device.In
>> general I would prefer not to be dependent on the MR enablement. We know
>> that the device is likely to be protected and we can collect its
>> requirements beforehand.
>>
>>> Intel iommu has a similar bit in register GCMD_REG.TE, when guest
>>> intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all()
>>> is called to enable IOMMU MRs.
>> interesting.
>>
>> Would be curious to get Jean Philippe's pov.
> I'd rather not rely on this, it's hard to justify a driver change based
> only on QEMU internals. And QEMU can't count on the driver always clearing
> bypass. There could be situations where the guest can't afford to do it,
> like if an endpoint is owned by the firmware and has to keep running.
>
> There may be a separate argument for clearing bypass. With a coldplugged
> VFIO device the flow is:
>
> 1. Map the whole guest address space in VFIO to implement boot-bypass.
>    This allocates all guest pages, which takes a while and is wasteful.
>    I've actually crashed a host that way, when spawning a guest with too
>    much RAM.
interesting
> 2. Start the VM
> 3. When the virtio-iommu driver attaches a (non-identity) domain to the
>    assigned endpoint, then unmap the whole address space in VFIO, and most
>    pages are given back to the host.
>
> We can't disable boot-bypass because the BIOS needs it. But instead the
> flow could be:
>
> 1. Start the VM, with only the virtual endpoints. Nothing to pin.
> 2. The virtio-iommu driver disables bypass during boot
We needed this boot-bypass mode for booting with virtio-blk-scsi
protected with virtio-iommu for instance.
That was needed because we don't have any virtio-iommu driver in edk2 as
opposed to intel iommu driver, right?
> 3. Hotplug the VFIO device. With bypass disabled there is no need to pin
>    the whole guest address space, unless the guest explicitly asks for an
>    identity domain.
>
> However, I don't know if this is a realistic scenario that will actually
> be used.
>
> By the way, do you have an easy way to reproduce the issue described here?
> I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
> just allocates 32-bit IOVAs.
I don't have a simple generic reproducer. It happens when assigning this
device:
Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)

I have not encountered that issue with another device yet.
I see on guest side in dmesg:
[    6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses

That's emitted in dma-iommu.c iommu_dma_alloc_iova().
Looks like the guest first tries to allocate an iova in the 32-bit AS
and if this fails use the whole dma_limit.
Seems the 32b IOVA alloc failed here ;-)

Thanks

Eric





>
>>>> For coldplugged device the technique works because we make sure all
>>>> the IOMMU MR are enabled once on the machine init done: 94df5b2180
>>>> ("virtio-iommu: Fix 64kB host page size VFIO device assignment")
>>>> for granule freeze. But I would be keen to get rid of this trick.
>>>>
>>>> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
>>>> MR to have been enabled and the corresponding vfio_listener_region_add()
>>>> to be executed. Instead this series proposes to replace the usage of this
>>>> API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
>>>> modify
>>>> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
>>>> called earlier, once the usable IOVA regions have been collected by
>>>> VFIO, without the need for the IOMMU MR to be enabled.
>>>>
>>>> This looks cleaner. In the short term this may also be used for
>>>> passing the page size mask, which would allow to get rid of the
>>>> hacky transient IOMMU MR enablement mentionned above.
>>>>
>>>> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
>>>>    https://lore.kernel.org/all/20231019134651.842175-1-
>>>> eric.auger@redhat.com/
>>>>
>>>> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/
>>>>
>>>> Extra Notes:
>>>> With that series, the reserved memory regions are communicated on time
>>>> so that the virtio-iommu probe request grabs them. However this is not
>>>> sufficient. In some cases (my case), I still see some DMA MAP failures
>>>> and the guest keeps on using IOVA ranges outside the geometry of the
>>>> physical IOMMU. This is due to the fact the VFIO-PCI device is in the
>>>> same iommu group as the pcie root port. Normally the kernel
>>>> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
>>>> reserve_iova()
>>>> for each reserved IOVA, which carves them out of the allocator. When
>>>> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
>>>> the iova domain is already allocated and set and we don't call
>>>> iova_reserve_iommu_regions() again for the vfio-pci device. So its
>>>> corresponding reserved regions are not properly taken into account.
>>> I suspect there is same issue with coldplugged devices. If those devices
>>> are in same group, get iova_reserve_iommu_regions() is only called
>>> for first device. But other devices's reserved regions are missed.
>> Correct
>>> Curious how you make passthrough device and pcie root port under same
>>> group.
>>> When I start a x86 guest with passthrough device, I see passthrough
>>> device and pcie root port are in different group.
>>>
>>> -[0000:00]-+-00.0
>>>            +-01.0
>>>            +-02.0
>>>            +-03.0-[01]----00.0
>>>
>>> /sys/kernel/iommu_groups/3/devices:
>>> 0000:00:03.0
>>> /sys/kernel/iommu_groups/7/devices:
>>> 0000:01:00.0
>>>
>>> My qemu cmdline:
>>> -device pcie-root-port,id=root0,slot=0
>>> -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0
>> I just replayed the scenario:
>> - if you have a coldplugged vfio-pci device, the pci root port and the
>> passthroughed device end up in different iommu groups. On my end I use
>> ioh3420 but you confirmed that's the same for the generic pcie-root-port
>> - however if you hotplug the vfio-pci device that's a different story:
>> they end up in the same group. Don't ask me why. I tried with
>> both virtio-iommu and intel iommu and I end up with the same topology.
>> That looks really weird to me.
> It also took me a while to get hotplug to work on x86:
> pcie_cap_slot_plug_cb() didn't get called, instead it would call
> ich9_pm_device_plug_cb(). Not sure what I'm doing wrong.
> To work around that I instantiated a second pxb-pcie root bus and then a
> pcie root port on there. So my command-line looks like:
>
>  -device virtio-iommu
>  -device pxb-pcie,id=pcie.1,bus_nr=1
>  -device pcie-root-port,chassis=2,id=pcie.2,bus=pcie.1
>
>  device_add vfio-pci,host=00:04.0,bus=pcie.2
>
> And somehow pcieport and the assigned device do end up in separate IOMMU
> groups.
>
> Thanks,
> Jean
>



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
  2024-01-29 16:38       ` Eric Auger
@ 2024-01-30 18:22         ` Jean-Philippe Brucker
  2024-01-31 11:22           ` Eric Auger
  0 siblings, 1 reply; 20+ messages in thread
From: Jean-Philippe Brucker @ 2024-01-30 18:22 UTC (permalink / raw)
  To: Eric Auger
  Cc: Duan, Zhenzhong, eric.auger.pro, qemu-devel, qemu-arm,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini, mst,
	clg

On Mon, Jan 29, 2024 at 05:38:55PM +0100, Eric Auger wrote:
> > There may be a separate argument for clearing bypass. With a coldplugged
> > VFIO device the flow is:
> >
> > 1. Map the whole guest address space in VFIO to implement boot-bypass.
> >    This allocates all guest pages, which takes a while and is wasteful.
> >    I've actually crashed a host that way, when spawning a guest with too
> >    much RAM.
> interesting
> > 2. Start the VM
> > 3. When the virtio-iommu driver attaches a (non-identity) domain to the
> >    assigned endpoint, then unmap the whole address space in VFIO, and most
> >    pages are given back to the host.
> >
> > We can't disable boot-bypass because the BIOS needs it. But instead the
> > flow could be:
> >
> > 1. Start the VM, with only the virtual endpoints. Nothing to pin.
> > 2. The virtio-iommu driver disables bypass during boot
> We needed this boot-bypass mode for booting with virtio-blk-scsi
> protected with virtio-iommu for instance.
> That was needed because we don't have any virtio-iommu driver in edk2 as
> opposed to intel iommu driver, right?

Yes. What I had in mind is the x86 SeaBIOS which doesn't have any IOMMU
driver and accesses the default SATA device:

 $ qemu-system-x86_64 -M q35 -device virtio-iommu,boot-bypass=off
 qemu: virtio_iommu_translate sid=250 is not known!!
 qemu: no buffer available in event queue to report event
 qemu: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address

But it's the same problem with edk2. Also a guest OS without a
virtio-iommu driver needs boot-bypass. Once firmware boot is complete, the
OS with a virtio-iommu driver normally can turn bypass off in the config
space, it's not useful anymore. If it needs to put some endpoints in
bypass, then it can attach them to a bypass domain.

> > 3. Hotplug the VFIO device. With bypass disabled there is no need to pin
> >    the whole guest address space, unless the guest explicitly asks for an
> >    identity domain.
> >
> > However, I don't know if this is a realistic scenario that will actually
> > be used.
> >
> > By the way, do you have an easy way to reproduce the issue described here?
> > I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
> > just allocates 32-bit IOVAs.
> I don't have a simple generic reproducer. It happens when assigning this
> device:
> Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)
> 
> I have not encountered that issue with another device yet.
> I see on guest side in dmesg:
> [    6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses
> 
> That's emitted in dma-iommu.c iommu_dma_alloc_iova().
> Looks like the guest first tries to allocate an iova in the 32-bit AS
> and if this fails use the whole dma_limit.
> Seems the 32b IOVA alloc failed here ;-)

Interesting, are you running some demanding workload and a lot of CPUs?
That's a lot of IOVAs used up, I'm curious about what kind of DMA pattern
does that.

Thanks,
Jean


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
  2024-01-30 18:22         ` Jean-Philippe Brucker
@ 2024-01-31 11:22           ` Eric Auger
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Auger @ 2024-01-31 11:22 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Duan, Zhenzhong, eric.auger.pro, qemu-devel, qemu-arm,
	alex.williamson, peter.maydell, peterx, yanghliu, pbonzini, mst,
	clg

Hi Jean,

On 1/30/24 19:22, Jean-Philippe Brucker wrote:
> On Mon, Jan 29, 2024 at 05:38:55PM +0100, Eric Auger wrote:
>>> There may be a separate argument for clearing bypass. With a coldplugged
>>> VFIO device the flow is:
>>>
>>> 1. Map the whole guest address space in VFIO to implement boot-bypass.
>>>    This allocates all guest pages, which takes a while and is wasteful.
>>>    I've actually crashed a host that way, when spawning a guest with too
>>>    much RAM.
>> interesting
>>> 2. Start the VM
>>> 3. When the virtio-iommu driver attaches a (non-identity) domain to the
>>>    assigned endpoint, then unmap the whole address space in VFIO, and most
>>>    pages are given back to the host.
>>>
>>> We can't disable boot-bypass because the BIOS needs it. But instead the
>>> flow could be:
>>>
>>> 1. Start the VM, with only the virtual endpoints. Nothing to pin.
>>> 2. The virtio-iommu driver disables bypass during boot
>> We needed this boot-bypass mode for booting with virtio-blk-scsi
>> protected with virtio-iommu for instance.
>> That was needed because we don't have any virtio-iommu driver in edk2 as
>> opposed to intel iommu driver, right?
> Yes. What I had in mind is the x86 SeaBIOS which doesn't have any IOMMU
> driver and accesses the default SATA device:
>
>  $ qemu-system-x86_64 -M q35 -device virtio-iommu,boot-bypass=off
>  qemu: virtio_iommu_translate sid=250 is not known!!
>  qemu: no buffer available in event queue to report event
>  qemu: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address
>
> But it's the same problem with edk2. Also a guest OS without a
> virtio-iommu driver needs boot-bypass. Once firmware boot is complete, the
> OS with a virtio-iommu driver normally can turn bypass off in the config
> space, it's not useful anymore. If it needs to put some endpoints in
> bypass, then it can attach them to a bypass domain.

yup
>
>>> 3. Hotplug the VFIO device. With bypass disabled there is no need to pin
>>>    the whole guest address space, unless the guest explicitly asks for an
>>>    identity domain.
>>>
>>> However, I don't know if this is a realistic scenario that will actually
>>> be used.
>>>
>>> By the way, do you have an easy way to reproduce the issue described here?
>>> I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
>>> just allocates 32-bit IOVAs.
>> I don't have a simple generic reproducer. It happens when assigning this
>> device:
>> Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)
>>
>> I have not encountered that issue with another device yet.
>> I see on guest side in dmesg:
>> [    6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses
>>
>> That's emitted in dma-iommu.c iommu_dma_alloc_iova().
>> Looks like the guest first tries to allocate an iova in the 32-bit AS
>> and if this fails use the whole dma_limit.
>> Seems the 32b IOVA alloc failed here ;-)
> Interesting, are you running some demanding workload and a lot of CPUs?
> That's a lot of IOVAs used up, I'm curious about what kind of DMA pattern
> does that.
Well nothing smart, just booting the guest with the assigned NIC. 8 vcpus

Thanks

Eric
>
> Thanks,
> Jean
>



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-01-31 11:23 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-17  8:02 [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Eric Auger
2024-01-17  8:02 ` [RFC 1/7] hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions Eric Auger
2024-01-17  8:02 ` [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus Eric Auger
2024-01-18  7:32   ` Duan, Zhenzhong
2024-01-17  8:02 ` [RFC 3/7] vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps Eric Auger
2024-01-17  8:02 ` [RFC 4/7] virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions Eric Auger
2024-01-18  7:43   ` Duan, Zhenzhong
2024-01-18 12:25     ` Eric Auger
2024-01-19  7:00       ` Duan, Zhenzhong
2024-01-22  7:17         ` Duan, Zhenzhong
2024-01-17  8:02 ` [RFC 5/7] virtio-iommu: Remove the implementation of iommu_set_iova_ranges Eric Auger
2024-01-17  8:02 ` [RFC 6/7] hw/vfio: Remove memory_region_iommu_set_iova_ranges() call Eric Auger
2024-01-17  8:02 ` [RFC 7/7] memory: Remove IOMMU MR iommu_set_iova_range API Eric Auger
2024-01-18  7:10 ` [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices Duan, Zhenzhong
2024-01-18  9:43   ` Eric Auger
2024-01-19  6:46     ` Duan, Zhenzhong
2024-01-25 18:48     ` Jean-Philippe Brucker
2024-01-29 16:38       ` Eric Auger
2024-01-30 18:22         ` Jean-Philippe Brucker
2024-01-31 11:22           ` Eric Auger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).