All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU
@ 2023-05-30 17:59 Joao Martins
  2023-05-30 17:59 ` [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
                   ` (14 more replies)
  0 siblings, 15 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Hey folks,

This series introduces support for vIOMMU with VFIO device migration,
particurlarly related to how we do the dirty page tracking. This is a v3
follow-up containing the vIOMMU subset of v2[0], but with some new changes,
some simplifications and finer grained scope.

Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
provide dma translation services for guests to provide some form of
guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
required for big VMs with VFs with more than 255 vcpus. We tackle both
and remove the migration blocker when vIOMMU is present provided the
conditions are met. I have both use-cases here in one series, but I am happy
to tackle them in separate series.

As I found out we don't necessarily need to expose the whole vIOMMU
functionality in order to just support interrupt remapping. x86 IOMMUs
on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
can instantiate a IOMMU just for interrupt remapping without needing to
be advertised/support DMA translation. AMD IOMMU in theory can provide
the same, but Linux doesn't quite support the IR-only part there yet,
only intel-iommu.

The series is organized as following:

Patches 1-5: Today we can't gather vIOMMU details before the guest
establishes their first DMA mapping via the vIOMMU. So these first four
patches add a way for vIOMMUs to be asked of their properties.
Essentially by being able to return IOMMU MR to upper layers. The last
patch of this uses the new pci_setup_iommu_info() in intel-iommu, which
then lets VFIO fetch the IOMMU MR and use it to gather the necessary
properties that are added in follow-up patches. I choose the least churn
possible way for now (as opposed to a treewide conversion) and allow easy
conversion a posteriori.

Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
DMA translation allowed. Today the 'dma-translation' attribute is
x86-iommu only, but the way this series is structured nothing stops from
other vIOMMUs supporting it too as long as they use
pci_setup_iommu_info() and the necessary IOMMU MR get_attr attributes
are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
we've then tackled item (1) of the second paragraph.

Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
IOVA address space, leveraging the logic we use to compose the dirty ranges.
The blocker is once again relaxed for vIOMMUs that advertise their IOVA
addressing limits. This tackles item (2). So far I mainly use it with
intel-iommu, although I have a small set of patches for virtio-iommu per
Alex's suggestion in v2.

Comments, suggestions welcome. Thanks for the review!

Regards,
	Joao

Changes since v2[3]:
* New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
introduce ways to know various IOMMU model attributes via the IOMMU MR. This
is partly meant to address a comment in previous versions where we can't
access the IOMMU MR prior to the DMA mapping happening. Before this series
vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
guest. As well as better tackling of the IOMMU usage for interrupt-remapping
only purposes. 
* Dropped Peter Xu ack on patch 9 given that the code changed a bit.
* Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
* The patches that existed in v2 of vIOMMU dirty tracking, are mostly
* untouched, except patch 12 which was greatly simplified.

Changes since v1[4]:
- Rebased on latest master branch. As part of it, made some changes in
  pre-copy to adjust it to Juan's new patches:
  1. Added a new patch that passes threshold_size parameter to
     .state_pending_{estimate,exact}() handlers.
  2. Added a new patch that refactors vfio_save_block().
  3. Changed the pre-copy patch to cache and report pending pre-copy
     size in the .state_pending_estimate() handler.
- Removed unnecessary P2P code. This should be added later on when P2P
  support is added. (Alex)
- Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
  (patch #11). (Alex)
- Stored vfio_devices_all_device_dirty_tracking()'s value in a local
  variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
- Refactored the viommu device dirty tracking ranges creation code to
  make it clearer (patch #15).
- Changed overflow check in vfio_iommu_range_is_device_tracked() to
  emphasize that we specifically check for 2^64 wrap around (patch #15).
- Added R-bs / Acks.

[0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
[1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
[2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
[3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
[4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/

Avihai Horon (4):
  memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  vfio/common: Optimize device dirty page tracking with vIOMMU

Joao Martins (11):
  hw/pci: Refactor pci_device_iommu_address_space()
  hw/pci: Add a pci_setup_iommu_info() helper
  hw/pci: Add a pci_device_iommu_memory_region() helper
  intel-iommu: Switch to pci_setup_iommu_info()
  vfio/common: Track the IOMMU MR behind the device in addition to the AS
  memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
  intel-iommu: Implement get_attr() method
  vfio/common: Relax vIOMMU detection when DMA translation is off
  vfio/common: Move dirty tracking ranges update to helper
  vfio/common: Support device dirty page tracking with vIOMMU
  vfio/common: Block migration with vIOMMUs without address width limits

 hw/i386/intel_iommu.c         |  40 ++++-
 hw/pci/pci.c                  |  21 ++-
 hw/vfio/common.c              | 265 +++++++++++++++++++++++++++-------
 hw/vfio/pci.c                 |   4 +
 include/exec/memory.h         |   4 +-
 include/hw/pci/pci.h          |  38 ++++-
 include/hw/pci/pci_bus.h      |   1 +
 include/hw/vfio/vfio-common.h |   1 +
 8 files changed, 309 insertions(+), 65 deletions(-)

-- 
2.39.3



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 22:04   ` Philippe Mathieu-Daudé
  2023-05-30 17:59 ` [PATCH v3 02/15] hw/pci: Add a pci_setup_iommu_info() helper Joao Martins
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Rename pci_device_iommu_address_space() into pci_device_iommu_info().
In the new function return a new type PCIAddressSpace that encapsulates
the AddressSpace pointer that originally was returned.

The new type is added in preparation to expanding it to include the IOMMU
memory region as a new field, such that we are able to fetch attributes of
the vIOMMU e.g. at vfio migration setup.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/pci/pci.c         |  9 ++++++---
 include/hw/pci/pci.h | 21 ++++++++++++++++++++-
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 1cc7c89036b5..ecf8a543aa77 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2633,11 +2633,12 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
     }
 }
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+PCIAddressSpace pci_device_iommu_info(PCIDevice *dev)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;
+    AddressSpace *as = NULL;
 
     while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
@@ -2678,10 +2679,12 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
+
+    as = &address_space_memory;
     if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
-        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+        as = iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
     }
-    return &address_space_memory;
+    return as_to_pci_as(as);
 }
 
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index e6d0574a2999..9ffaf47fe2ab 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -363,9 +363,28 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
+typedef struct PCIAddressSpace {
+    AddressSpace *as;
+} PCIAddressSpace;
+
 typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+static inline PCIAddressSpace as_to_pci_as(AddressSpace *as)
+{
+    PCIAddressSpace ret = { .as = as };
+
+    return ret;
+}
+static inline AddressSpace *pci_as_to_as(PCIAddressSpace pci_as)
+{
+    return pci_as.as;
+}
+
+PCIAddressSpace pci_device_iommu_info(PCIDevice *dev);
+static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+    return pci_as_to_as(pci_device_iommu_info(dev));
+}
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
 pcibus_t pci_bar_address(PCIDevice *d,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 02/15] hw/pci: Add a pci_setup_iommu_info() helper
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
  2023-05-30 17:59 ` [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper Joao Martins
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Provide a second PCI iommu bus initialization function which returns
PCIAddressSpace rather than an AddressSpace. The function is meant to
superseed pci_setup_iommu().

Under the hood in pci_device_iommu_info() if the new function pointer is
set in the device bus, use that instead and return the new object.

This is preparation for vIOMMU MR to be stored in PCIAddressSpace, thus
made available regardless of guest behaviour.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/pci/pci.c             | 14 ++++++++++++--
 include/hw/pci/pci.h     |  2 ++
 include/hw/pci/pci_bus.h |  1 +
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index ecf8a543aa77..ac10333b5097 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2681,8 +2681,12 @@ PCIAddressSpace pci_device_iommu_info(PCIDevice *dev)
     }
 
     as = &address_space_memory;
-    if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
-        as = iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+    if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
+        if (iommu_bus->iommu_fn) {
+            as = iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+        } else if (iommu_bus->iommu_as_fn) {
+            return iommu_bus->iommu_as_fn(bus, iommu_bus->iommu_opaque, devfn);
+        }
     }
     return as_to_pci_as(as);
 }
@@ -2693,6 +2697,12 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
     bus->iommu_opaque = opaque;
 }
 
+void pci_setup_iommu_info(PCIBus *bus, PCIIOMMUASFunc fn, void *opaque)
+{
+    bus->iommu_as_fn = fn;
+    bus->iommu_opaque = opaque;
+}
+
 static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
 {
     Range *range = opaque;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 9ffaf47fe2ab..d2c87d87a24e 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -368,6 +368,7 @@ typedef struct PCIAddressSpace {
 } PCIAddressSpace;
 
 typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef PCIAddressSpace (*PCIIOMMUASFunc)(PCIBus *, void *, int);
 static inline PCIAddressSpace as_to_pci_as(AddressSpace *as)
 {
     PCIAddressSpace ret = { .as = as };
@@ -386,6 +387,7 @@ static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 }
 
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
+void pci_setup_iommu_info(PCIBus *bus, PCIIOMMUASFunc fn, void *opaque);
 
 pcibus_t pci_bar_address(PCIDevice *d,
                          int reg, uint8_t type, pcibus_t size);
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 56531759578f..a2795b23a3b0 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -34,6 +34,7 @@ struct PCIBus {
     BusState qbus;
     enum PCIBusFlags flags;
     PCIIOMMUFunc iommu_fn;
+    PCIIOMMUASFunc iommu_as_fn;
     void *iommu_opaque;
     uint8_t devfn_min;
     uint32_t slot_reserved_mask;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
  2023-05-30 17:59 ` [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
  2023-05-30 17:59 ` [PATCH v3 02/15] hw/pci: Add a pci_setup_iommu_info() helper Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-06-05 16:57   ` Peter Xu
  2023-05-30 17:59 ` [PATCH v3 04/15] intel-iommu: Switch to pci_setup_iommu_info() Joao Martins
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Today's VFIO model of understanding if an IOMMU is behind the PCIDevice
is to check whether the address space backing the device is memory or
not, which is done via pci_device_iommu_address_space(pdev).

On the other hand, the IOMMU MR is used for example to fetch attributes,
or when doing an vIOMMU map and figuring out if it's under the vIOMMU.
However, such object is only available today by the time the IOMMU map
notifier is called which depends on the guest doing a DMA map or not.
Thus there's no way to get access to the IOMMU memory region early i.e.
at vfio device initialization where we attest migration support and
impose LM blockers or not.

Much like pci_device_iommu_address_space() fetches the IOMMU AS, add a
pci_device_iommu_memory_region() which lets it return an the IOMMU MR
associated with it. The IOMMU MR is returned correctly for vIOMMUs using
pci_setup_iommu_info(). Note that today most vIOMMUs create the address
space and IOMMU MR at the same time, it's just mainly that there's API
to make the latter available.

This is in preparation for VFIO to track both the AS and IOMMU MR into
VFIOSpace without being tied to a map taking place by the guest. The
IOMMU MR will then provide access to upper layers about various IOMMU
attributes.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/hw/pci/pci.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index d2c87d87a24e..0177f50e96a3 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -365,13 +365,14 @@ void pci_device_deassert_intx(PCIDevice *dev);
 
 typedef struct PCIAddressSpace {
     AddressSpace *as;
+    IOMMUMemoryRegion *iommu_mr;
 } PCIAddressSpace;
 
 typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 typedef PCIAddressSpace (*PCIIOMMUASFunc)(PCIBus *, void *, int);
 static inline PCIAddressSpace as_to_pci_as(AddressSpace *as)
 {
-    PCIAddressSpace ret = { .as = as };
+    PCIAddressSpace ret = { .as = as, .iommu_mr = NULL };
 
     return ret;
 }
@@ -379,6 +380,13 @@ static inline AddressSpace *pci_as_to_as(PCIAddressSpace pci_as)
 {
     return pci_as.as;
 }
+static inline PCIAddressSpace to_pci_as(AddressSpace *as,
+                                        IOMMUMemoryRegion *iommu_mr)
+{
+    PCIAddressSpace ret = { .as = as, .iommu_mr = iommu_mr };
+
+    return ret;
+}
 
 PCIAddressSpace pci_device_iommu_info(PCIDevice *dev);
 static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
@@ -386,6 +394,13 @@ static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     return pci_as_to_as(pci_device_iommu_info(dev));
 }
 
+static inline IOMMUMemoryRegion *pci_device_iommu_memory_region(PCIDevice *dev)
+{
+    PCIAddressSpace ret = pci_device_iommu_info(dev);
+
+    return ret.iommu_mr;
+}
+
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 void pci_setup_iommu_info(PCIBus *bus, PCIIOMMUASFunc fn, void *opaque);
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 04/15] intel-iommu: Switch to pci_setup_iommu_info()
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (2 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 05/15] vfio/common: Track the IOMMU MR behind the device in addition to the AS Joao Martins
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Use the PCI IOMMU setup function that allows returning the
IOMMUMemoryRegion in addition to the AddressSpace. To faciliate
conversion use the new to_pci_as()/pci_as_to_as() helpers.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 94d52f4205d2..c344d49fe290 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4021,7 +4021,7 @@ static void vtd_reset(DeviceState *dev)
     vtd_address_space_refresh_all(s);
 }
 
-static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
+static PCIAddressSpace vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 {
     IntelIOMMUState *s = opaque;
     VTDAddressSpace *vtd_as;
@@ -4029,7 +4029,7 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
 
     vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
-    return &vtd_as->as;
+    return to_pci_as(&vtd_as->as, &vtd_as->iommu);
 }
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
@@ -4155,9 +4155,10 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                       g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
-    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
+    pci_setup_iommu_info(bus, vtd_host_dma_iommu, dev);
     /* Pseudo address space under root PCI bus. */
-    x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
+    x86ms->ioapic_as = pci_as_to_as(vtd_host_dma_iommu(bus,
+                                                s, Q35_PSEUDO_DEVFN_IOAPIC));
     qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
 }
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 05/15] vfio/common: Track the IOMMU MR behind the device in addition to the AS
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (3 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 04/15] intel-iommu: Switch to pci_setup_iommu_info() Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 06/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute Joao Martins
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

vfio_get_group() allocates and fills the group/container/space on
success which will store the AddressSpace inside the VFIOSpace struct.
Use the newly added pci_device_iommu_memory_region() and store the IOMMU
MR too.

Today, this means only intel-iommu sets it.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/pci.c                 | 4 ++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 73874a94de12..9dc9a7d834ec 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2900,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
     VFIODevice *vbasedev_iter;
+    VFIOAddressSpace *space;
     VFIOGroup *group;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
     Error *err = NULL;
@@ -2961,6 +2962,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    space = group->container->space;
+    space->iommu_mr = pci_device_iommu_memory_region(pdev);
+
     QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
         if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
             error_setg(errp, "device is already attached");
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eed244f25f34..92e6514fd757 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -70,6 +70,7 @@ typedef struct VFIOMigration {
 
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
+    IOMMUMemoryRegion *iommu_mr;
     QLIST_HEAD(, VFIOContainer) containers;
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 06/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (4 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 05/15] vfio/common: Track the IOMMU MR behind the device in addition to the AS Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 07/15] intel-iommu: Implement get_attr() method Joao Martins
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Add a new IOMMU attribute IOMMU_ATTR_DMA_TRANSLATION which indicates
whether the IOMMU supports DMA Translation.

This attribute will be used by VFIO device dirty page tracking such that
it can restrict the IOVA under tracking to the memory map when vIOMMU is
configured in IR-only mode without dma translation supported/advertised.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/exec/memory.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c3661b2276c7..a02496f34180 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -316,7 +316,8 @@ typedef struct MemoryRegionClass {
 
 
 enum IOMMUMemoryRegionAttr {
-    IOMMU_ATTR_SPAPR_TCE_FD
+    IOMMU_ATTR_SPAPR_TCE_FD,
+    IOMMU_ATTR_DMA_TRANSLATION,
 };
 
 /*
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 07/15] intel-iommu: Implement get_attr() method
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (5 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 06/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off Joao Martins
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Implement get_attr() method and use the IntelIOMMUState::dma_translation
field to report the IOMMU_ATTR_DMA_TRANSLATION attribute.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c344d49fe290..1906f3a67960 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3861,6 +3861,29 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
+static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
+                              enum IOMMUMemoryRegionAttr attr, void *data)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    int ret = 0;
+
+    switch (attr) {
+    case IOMMU_ATTR_DMA_TRANSLATION:
+    {
+        bool *enabled = data;
+
+        *enabled = s->dma_translation;
+        break;
+    }
+    default:
+        ret = -EINVAL;
+        break;
+    }
+
+    return ret;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -4194,6 +4217,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = vtd_iommu_translate;
     imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
     imrc->replay = vtd_iommu_replay;
+    imrc->get_attr = vtd_iommu_get_attr;
 }
 
 static const TypeInfo vtd_iommu_memory_region_info = {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (6 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 07/15] intel-iommu: Implement get_attr() method Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 21:39   ` Philippe Mathieu-Daudé
  2023-05-30 17:59 ` [PATCH v3 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Joao Martins
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Relax the vIOMMU migration blocker when the underlying IOMMU reports that
DMA translation disabled. When it is disabled there will be no DMA mappings
via the vIOMMU and the guest only uses it for Interrupt Remapping.

The latter is done via the vfio_viommu_preset() return value whereby in
addition to validating that the address space is memory, we also check
whether the IOMMU MR has DMA translation on. Assume it is enabled if
there's no IOMMU MR or if no dma-translation property is supported.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fa8fd949b1cf..060acccb3443 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -414,12 +414,32 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }
 
+static bool vfio_viommu_dma_translation_enabled(VFIOAddressSpace *space)
+{
+    bool enabled = false;
+    int ret;
+
+    if (!space->iommu_mr) {
+        return true;
+    }
+
+    ret = memory_region_iommu_get_attr(space->iommu_mr,
+                                       IOMMU_ATTR_DMA_TRANSLATION,
+                                       &enabled);
+    if (ret || enabled) {
+        return true;
+    }
+
+    return false;
+}
+
 static bool vfio_viommu_preset(void)
 {
     VFIOAddressSpace *space;
 
     QLIST_FOREACH(space, &vfio_address_spaces, list) {
-        if (space->as != &address_space_memory) {
+        if ((space->as != &address_space_memory) &&
+            vfio_viommu_dma_translation_enabled(space)) {
             return true;
         }
     }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (7 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

Add a new IOMMU attribute IOMMU_ATTR_MAX_IOVA which indicates the
maximal IOVA that an IOMMU can use.

This attribute will be used by VFIO device dirty page tracking so it can
track the entire IOVA space when needed (i.e. when vIOMMU is enabled).

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index a02496f34180..b09b1c9846fe 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -318,6 +318,7 @@ typedef struct MemoryRegionClass {
 enum IOMMUMemoryRegionAttr {
     IOMMU_ATTR_SPAPR_TCE_FD,
     IOMMU_ATTR_DMA_TRANSLATION,
+    IOMMU_ATTR_MAX_IOVA,
 };
 
 /*
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (8 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 21:45   ` Philippe Mathieu-Daudé
  2023-05-30 17:59 ` [PATCH v3 11/15] vfio/common: Move dirty tracking ranges update to helper Joao Martins
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

Implement get_attr() method and use the address width property to report
the IOMMU_ATTR_MAX_IOVA attribute.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1906f3a67960..829dd6eadc6c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
         *enabled = s->dma_translation;
         break;
     }
+    case IOMMU_ATTR_MAX_IOVA:
+    {
+        hwaddr *max_iova = data;
+
+        *max_iova = (1ULL << s->aw_bits) - 1;
+        break;
+    }
     default:
         ret = -EINVAL;
         break;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 11/15] vfio/common: Move dirty tracking ranges update to helper
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (9 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Separate the code that update the ranges from the listener, in
preparation to expand its use to vIOMMU support.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 060acccb3443..b8d97577f856 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1432,20 +1432,10 @@ typedef struct VFIODirtyRangesListener {
     MemoryListener listener;
 } VFIODirtyRangesListener;
 
-static void vfio_dirty_tracking_update(MemoryListener *listener,
-                                       MemoryRegionSection *section)
+static void vfio_dirty_tracking_update(hwaddr iova, hwaddr end,
+                                       VFIODirtyRanges *range)
 {
-    VFIODirtyRangesListener *dirty = container_of(listener,
-                                                  VFIODirtyRangesListener,
-                                                  listener);
-    VFIODirtyRanges *range = &dirty->ranges;
-    hwaddr iova, end, *min, *max;
-
-    if (!vfio_listener_valid_section(section, "tracking_update") ||
-        !vfio_get_section_iova_range(dirty->container, section,
-                                     &iova, &end, NULL)) {
-        return;
-    }
+    hwaddr *min, *max;
 
     /*
      * The address space passed to the dirty tracker is reduced to two ranges:
@@ -1469,12 +1459,28 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
     }
 
     trace_vfio_device_dirty_tracking_update(iova, end, *min, *max);
-    return;
+}
+
+static void vfio_listener_dirty_tracking_update(MemoryListener *listener,
+                                                MemoryRegionSection *section)
+{
+    VFIODirtyRangesListener *dirty = container_of(listener,
+                                                  VFIODirtyRangesListener,
+                                                  listener);
+    hwaddr iova, end;
+
+    if (!vfio_listener_valid_section(section, "tracking_update") ||
+        !vfio_get_section_iova_range(dirty->container, section,
+                                     &iova, &end, NULL)) {
+        return;
+    }
+
+    vfio_dirty_tracking_update(iova, end, &dirty->ranges);
 }
 
 static const MemoryListener vfio_dirty_tracking_listener = {
     .name = "vfio-tracking",
-    .region_add = vfio_dirty_tracking_update,
+    .region_add = vfio_listener_dirty_tracking_update,
 };
 
 static void vfio_dirty_tracking_init(VFIOContainer *container,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (10 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 11/15] vfio/common: Move dirty tracking ranges update to helper Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Joao Martins
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Currently, device dirty page tracking with vIOMMU is not supported,
and a blocker is added and the migration is prevented.

When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
requesting by the vIOMMU. These IOVA ranges can potentially be mapped
anywhere in the vIOMMU IOVA space as advertised by the VMM.

To support device dirty tracking when vIOMMU enabled instead create the
dirty ranges based on the vIOMMU provided limits, which leads to the
tracking of the whole IOVA space regardless of what devices use.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 64 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 54 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b8d97577f856..d2897aceedae 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -447,6 +447,31 @@ static bool vfio_viommu_preset(void)
     return false;
 }
 
+static int vfio_viommu_get_max_iova(hwaddr *max_iova)
+{
+    VFIOAddressSpace *space;
+    int ret = -EINVAL;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        if (space->as == &address_space_memory) {
+            continue;
+        }
+
+        if (!space->iommu_mr) {
+            break;
+        }
+
+        ret = memory_region_iommu_get_attr(space->iommu_mr,
+                                           IOMMU_ATTR_MAX_IOVA,
+                                           max_iova);
+        if (ret) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
 int vfio_block_giommu_migration(Error **errp)
 {
     int ret;
@@ -1483,10 +1508,11 @@ static const MemoryListener vfio_dirty_tracking_listener = {
     .region_add = vfio_listener_dirty_tracking_update,
 };
 
-static void vfio_dirty_tracking_init(VFIOContainer *container,
+static int vfio_dirty_tracking_init(VFIOContainer *container,
                                      VFIODirtyRanges *ranges)
 {
     VFIODirtyRangesListener dirty;
+    int ret;
 
     memset(&dirty, 0, sizeof(dirty));
     dirty.ranges.min32 = UINT32_MAX;
@@ -1494,17 +1520,29 @@ static void vfio_dirty_tracking_init(VFIOContainer *container,
     dirty.listener = vfio_dirty_tracking_listener;
     dirty.container = container;
 
-    memory_listener_register(&dirty.listener,
-                             container->space->as);
+    if (vfio_viommu_preset()) {
+        hwaddr iommu_max_iova;
+
+        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
+        if (ret) {
+            return ret;
+        }
+
+        vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
+    } else {
+        memory_listener_register(&dirty.listener,
+                                 container->space->as);
+        /*
+         * The memory listener is synchronous, and used to calculate the range
+         * to dirty tracking. Unregister it after we are done as we are not
+         * interested in any follow-up updates.
+         */
+        memory_listener_unregister(&dirty.listener);
+    }
 
     *ranges = dirty.ranges;
 
-    /*
-     * The memory listener is synchronous, and used to calculate the range
-     * to dirty tracking. Unregister it after we are done as we are not
-     * interested in any follow-up updates.
-     */
-    memory_listener_unregister(&dirty.listener);
+    return 0;
 }
 
 static void vfio_devices_dma_logging_stop(VFIOContainer *container)
@@ -1609,7 +1647,13 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
     VFIOGroup *group;
     int ret = 0;
 
-    vfio_dirty_tracking_init(container, &ranges);
+    ret = vfio_dirty_tracking_init(container, &ranges);
+    if (ret) {
+        error_report("Failed to init DMA logging ranges, err %d",
+                      ret);
+        return -EOPNOTSUPP;
+    }
+
     feature = vfio_device_feature_dma_logging_start_create(container,
                                                            &ranges);
     if (!feature) {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (11 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU Joao Martins
  2023-05-30 17:59 ` [PATCH v3 15/15] vfio/common: Block migration with vIOMMUs without address width limits Joao Martins
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

Extract vIOMMU code from vfio_sync_dirty_bitmap() to a new function and
restructure the code.

This is done in preparation for optimizing vIOMMU deviice dirty page
tracking. No functional changes intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 63 +++++++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d2897aceedae..733f0bd7825f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1939,37 +1939,50 @@ static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
                                                 &vrdl);
 }
 
+static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
+                                        MemoryRegionSection *section)
+{
+    VFIOGuestIOMMU *giommu;
+    bool found = false;
+    Int128 llend;
+    vfio_giommu_dirty_notifier gdn;
+    int idx;
+
+    QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+        if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
+            giommu->n.start == section->offset_within_region) {
+            found = true;
+            break;
+        }
+    }
+
+    if (!found) {
+        return 0;
+    }
+
+    gdn.giommu = giommu;
+    idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
+                                             MEMTXATTRS_UNSPECIFIED);
+
+    llend = int128_add(int128_make64(section->offset_within_region),
+                       section->size);
+    llend = int128_sub(llend, int128_one());
+
+    iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
+                        section->offset_within_region, int128_get64(llend),
+                        idx);
+    memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
+
+    return 0;
+}
+
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                                   MemoryRegionSection *section)
 {
     ram_addr_t ram_addr;
 
     if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
-                giommu->n.start == section->offset_within_region) {
-                Int128 llend;
-                vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
-                                                       MEMTXATTRS_UNSPECIFIED);
-
-                llend = int128_add(int128_make64(section->offset_within_region),
-                                   section->size);
-                llend = int128_sub(llend, int128_one());
-
-                iommu_notifier_init(&gdn.n,
-                                    vfio_iommu_map_dirty_notify,
-                                    IOMMU_NOTIFIER_MAP,
-                                    section->offset_within_region,
-                                    int128_get64(llend),
-                                    idx);
-                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
-                break;
-            }
-        }
-        return 0;
+        return vfio_sync_iommu_dirty_bitmap(container, section);
     } else if (memory_region_has_ram_discard_manager(section->mr)) {
         return vfio_sync_ram_discard_listener_dirty_bitmap(container, section);
     }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (12 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  2023-05-30 17:59 ` [PATCH v3 15/15] vfio/common: Block migration with vIOMMUs without address width limits Joao Martins
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

When vIOMMU is enabled, syncing dirty page bitmaps is done by replaying
the vIOMMU mappings and querying the dirty bitmap for each mapping.

With device dirty tracking this causes a lot of overhead, since the HW
is queried many times (even with small idle guest this can end up with
thousands of calls to HW).

Optimize this by de-coupling dirty bitmap query from vIOMMU replay.
Now a single dirty bitmap is queried per vIOMMU MR section, which is
then used for all corresponding vIOMMU mappings within that MR section.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 733f0bd7825f..5b211380306a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1857,8 +1857,36 @@ out:
 typedef struct {
     IOMMUNotifier n;
     VFIOGuestIOMMU *giommu;
+    VFIOBitmap vbmap;
 } vfio_giommu_dirty_notifier;
 
+static int vfio_iommu_set_dirty_bitmap(VFIOContainer *container,
+                                       vfio_giommu_dirty_notifier *gdn,
+                                       hwaddr iova, hwaddr size,
+                                       ram_addr_t ram_addr)
+{
+    VFIOBitmap *vbmap = &gdn->vbmap;
+    VFIOBitmap dst_vbmap;
+    hwaddr start_iova = REAL_HOST_PAGE_ALIGN(gdn->n.start);
+    hwaddr copy_offset;
+    int ret;
+
+    ret = vfio_bitmap_alloc(&dst_vbmap, size);
+    if (ret) {
+        return -ENOMEM;
+    }
+
+    copy_offset = (iova - start_iova) / qemu_real_host_page_size();
+    bitmap_copy_with_src_offset(dst_vbmap.bitmap, vbmap->bitmap, copy_offset,
+                                dst_vbmap.pages);
+
+    cpu_physical_memory_set_dirty_lebitmap(dst_vbmap.bitmap, ram_addr,
+                                           dst_vbmap.pages);
+    g_free(dst_vbmap.bitmap);
+
+    return 0;
+}
+
 static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     vfio_giommu_dirty_notifier *gdn = container_of(n,
@@ -1879,8 +1907,15 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 
     rcu_read_lock();
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
-        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
-                                    translated_addr);
+        if (gdn->vbmap.bitmap) {
+            ret = vfio_iommu_set_dirty_bitmap(container, gdn, iova,
+                                              iotlb->addr_mask + 1,
+                                              translated_addr);
+        } else {
+            ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+                                        translated_addr);
+        }
+
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%s)",
@@ -1961,6 +1996,7 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
     }
 
     gdn.giommu = giommu;
+    gdn.vbmap.bitmap = NULL;
     idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
                                              MEMTXATTRS_UNSPECIFIED);
 
@@ -1968,10 +2004,44 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
                        section->size);
     llend = int128_sub(llend, int128_one());
 
+    /*
+     * Optimize device dirty tracking if the MR section is at least partially
+     * tracked. Optimization is done by querying a single dirty bitmap for the
+     * entire range instead of querying dirty bitmap for each vIOMMU mapping.
+     */
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        hwaddr start = REAL_HOST_PAGE_ALIGN(section->offset_within_region);
+        hwaddr end = int128_get64(llend);
+        hwaddr iommu_max_iova;
+        hwaddr size;
+        int ret;
+
+        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
+        if (ret) {
+            return -EINVAL;
+        }
+
+        size = REAL_HOST_PAGE_ALIGN(MIN(iommu_max_iova, end) - start);
+
+        ret = vfio_bitmap_alloc(&gdn.vbmap, size);
+        if (ret) {
+            return -ENOMEM;
+        }
+
+        ret = vfio_devices_query_dirty_bitmap(container, &gdn.vbmap,
+                                              start, size);
+        if (ret) {
+            g_free(gdn.vbmap.bitmap);
+
+            return ret;
+        }
+    }
+
     iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
                         section->offset_within_region, int128_get64(llend),
                         idx);
     memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
+    g_free(gdn.vbmap.bitmap);
 
     return 0;
 }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 15/15] vfio/common: Block migration with vIOMMUs without address width limits
  2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (13 preceding siblings ...)
  2023-05-30 17:59 ` [PATCH v3 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU Joao Martins
@ 2023-05-30 17:59 ` Joao Martins
  14 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-30 17:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Only block the case when the underlying vIOMMU model does not report any
address space limits, in addition to DMA translation being off, or no
vIOMMU being present. The limits are needed such that can define the IOVA
limits to arm the device dirty tracker.

Additionally, reword the migration blocker error message to clarify that
we the configured vIOMMU does not support migration, as opposed to
implying that just being there blocks migration.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5b211380306a..b0cf86559032 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -474,15 +474,17 @@ static int vfio_viommu_get_max_iova(hwaddr *max_iova)
 
 int vfio_block_giommu_migration(Error **errp)
 {
+    hwaddr max;
     int ret;
 
     if (giommu_migration_blocker ||
-        !vfio_viommu_preset()) {
+        !vfio_viommu_preset() ||
+        (vfio_viommu_preset() && !vfio_viommu_get_max_iova(&max))) {
         return 0;
     }
 
     error_setg(&giommu_migration_blocker,
-               "Migration is currently not supported with vIOMMU enabled");
+               "Migration is currently not supported with the configured vIOMMU");
     ret = migrate_add_blocker(giommu_migration_blocker, errp);
     if (ret < 0) {
         error_free(giommu_migration_blocker);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off
  2023-05-30 17:59 ` [PATCH v3 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off Joao Martins
@ 2023-05-30 21:39   ` Philippe Mathieu-Daudé
  2023-05-31  9:39     ` Joao Martins
  0 siblings, 1 reply; 30+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-05-30 21:39 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe

On 30/5/23 19:59, Joao Martins wrote:
> Relax the vIOMMU migration blocker when the underlying IOMMU reports that
> DMA translation disabled. When it is disabled there will be no DMA mappings
> via the vIOMMU and the guest only uses it for Interrupt Remapping.
> 
> The latter is done via the vfio_viommu_preset() return value whereby in
> addition to validating that the address space is memory, we also check
> whether the IOMMU MR has DMA translation on.

> Assume it is enabled if
> there's no IOMMU MR or if no dma-translation property is supported.

This comment ^ ...

> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/vfio/common.c | 22 +++++++++++++++++++++-
>   1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fa8fd949b1cf..060acccb3443 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -414,12 +414,32 @@ void vfio_unblock_multiple_devices_migration(void)
>       multiple_devices_migration_blocker = NULL;
>   }
>   
> +static bool vfio_viommu_dma_translation_enabled(VFIOAddressSpace *space)
> +{
> +    bool enabled = false;
> +    int ret;
> +
> +    if (!space->iommu_mr) {
> +        return true;
> +    }
> +
> +    ret = memory_region_iommu_get_attr(space->iommu_mr,
> +                                       IOMMU_ATTR_DMA_TRANSLATION,
> +                                       &enabled);
> +    if (ret || enabled) {

... could be helpful if duplicated here.

> +        return true;
> +    }
> +
> +    return false;
> +}



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-05-30 17:59 ` [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
@ 2023-05-30 21:45   ` Philippe Mathieu-Daudé
  2023-05-31  9:54     ` Joao Martins
  0 siblings, 1 reply; 30+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-05-30 21:45 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe

On 30/5/23 19:59, Joao Martins wrote:
> From: Avihai Horon <avihaih@nvidia.com>
> 
> Implement get_attr() method and use the address width property to report
> the IOMMU_ATTR_MAX_IOVA attribute.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/i386/intel_iommu.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1906f3a67960..829dd6eadc6c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>           *enabled = s->dma_translation;
>           break;
>       }
> +    case IOMMU_ATTR_MAX_IOVA:
> +    {
> +        hwaddr *max_iova = data;

Shouldn't we cast to uintptr_t to be safe?

> +        *max_iova = (1ULL << s->aw_bits) - 1;

Alternatively:

            *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);

> +        break;
> +    }
>       default:
>           ret = -EINVAL;
>           break;



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-05-30 17:59 ` [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
@ 2023-05-30 22:04   ` Philippe Mathieu-Daudé
  2023-05-31 10:03     ` Joao Martins
  0 siblings, 1 reply; 30+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-05-30 22:04 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe

Hi Joao,

On 30/5/23 19:59, Joao Martins wrote:
> Rename pci_device_iommu_address_space() into pci_device_iommu_info().
> In the new function return a new type PCIAddressSpace that encapsulates
> the AddressSpace pointer that originally was returned.
> 
> The new type is added in preparation to expanding it to include the IOMMU
> memory region as a new field, such that we are able to fetch attributes of
> the vIOMMU e.g. at vfio migration setup.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/pci/pci.c         |  9 ++++++---
>   include/hw/pci/pci.h | 21 ++++++++++++++++++++-

Please consider using scripts/git.orderfile.

>   2 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 1cc7c89036b5..ecf8a543aa77 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2633,11 +2633,12 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>       }
>   }
>   
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev)
>   {

This function is PCI specific, ...

>   }
>   
>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index e6d0574a2999..9ffaf47fe2ab 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -363,9 +363,28 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
>   
>   void pci_device_deassert_intx(PCIDevice *dev);
>   
> +typedef struct PCIAddressSpace {
> +    AddressSpace *as;

... but here I fail to understand what is PCI specific in this
structure. You are just trying to an AS with a IOMMU MR, right?

> +} PCIAddressSpace;
> +
>   typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> +static inline PCIAddressSpace as_to_pci_as(AddressSpace *as)
> +{
> +    PCIAddressSpace ret = { .as = as };
> +
> +    return ret;
> +}
> +static inline AddressSpace *pci_as_to_as(PCIAddressSpace pci_as)
> +{
> +    return pci_as.as;
> +}
> +
> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev);
> +static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    return pci_as_to_as(pci_device_iommu_info(dev));
> +}
>   
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>   
>   pcibus_t pci_bar_address(PCIDevice *d,



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off
  2023-05-30 21:39   ` Philippe Mathieu-Daudé
@ 2023-05-31  9:39     ` Joao Martins
  0 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-05-31  9:39 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe



On 30/05/2023 22:39, Philippe Mathieu-Daudé wrote:
> On 30/5/23 19:59, Joao Martins wrote:
>> Relax the vIOMMU migration blocker when the underlying IOMMU reports that
>> DMA translation disabled. When it is disabled there will be no DMA mappings
>> via the vIOMMU and the guest only uses it for Interrupt Remapping.
>>
>> The latter is done via the vfio_viommu_preset() return value whereby in
>> addition to validating that the address space is memory, we also check
>> whether the IOMMU MR has DMA translation on.
> 
>> Assume it is enabled if
>> there's no IOMMU MR or if no dma-translation property is supported.
> 
> This comment ^ ...
> 
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/vfio/common.c | 22 +++++++++++++++++++++-
>>   1 file changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index fa8fd949b1cf..060acccb3443 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -414,12 +414,32 @@ void vfio_unblock_multiple_devices_migration(void)
>>       multiple_devices_migration_blocker = NULL;
>>   }
>>   +static bool vfio_viommu_dma_translation_enabled(VFIOAddressSpace *space)
>> +{
>> +    bool enabled = false;
>> +    int ret;
>> +
>> +    if (!space->iommu_mr) {
>> +        return true;
>> +    }
>> +
>> +    ret = memory_region_iommu_get_attr(space->iommu_mr,
>> +                                       IOMMU_ATTR_DMA_TRANSLATION,
>> +                                       &enabled);
>> +    if (ret || enabled) {
> 
> ... could be helpful if duplicated here.
> 

I'll add it.

>> +        return true;
>> +    }
>> +
>> +    return false;
>> +}
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-05-30 21:45   ` Philippe Mathieu-Daudé
@ 2023-05-31  9:54     ` Joao Martins
  2023-05-31 13:59       ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 30+ messages in thread
From: Joao Martins @ 2023-05-31  9:54 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe



On 30/05/2023 22:45, Philippe Mathieu-Daudé wrote:
> On 30/5/23 19:59, Joao Martins wrote:
>> From: Avihai Horon <avihaih@nvidia.com>
>>
>> Implement get_attr() method and use the address width property to report
>> the IOMMU_ATTR_MAX_IOVA attribute.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/i386/intel_iommu.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1906f3a67960..829dd6eadc6c 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>>           *enabled = s->dma_translation;
>>           break;
>>       }
>> +    case IOMMU_ATTR_MAX_IOVA:
>> +    {
>> +        hwaddr *max_iova = data;
> 
> Shouldn't we cast to uintptr_t to be safe?
> 
Perhaps you mean something like this:

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 829dd6eadc6c..479307f1228f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3878,7 +3878,7 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
     }
     case IOMMU_ATTR_MAX_IOVA:
     {
-        hwaddr *max_iova = data;
+        hwaddr *max_iova = (hwaddr *)(uintptr_t) data;

         *max_iova = (1ULL << s->aw_bits) - 1;
         break;

I guess the thinking is to prevent 32-bit failures.

>> +        *max_iova = (1ULL << s->aw_bits) - 1;
> 
> Alternatively:
> 
>            *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);
> 

I'll switch to your suggestion. Wasn't aware of this macro :)

>> +        break;
>> +    }
>>       default:
>>           ret = -EINVAL;
>>           break;
> 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-05-30 22:04   ` Philippe Mathieu-Daudé
@ 2023-05-31 10:03     ` Joao Martins
  2023-06-22 20:50       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Joao Martins @ 2023-05-31 10:03 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe

On 30/05/2023 23:04, Philippe Mathieu-Daudé wrote:
> Hi Joao,
> 
> On 30/5/23 19:59, Joao Martins wrote:
>> Rename pci_device_iommu_address_space() into pci_device_iommu_info().
>> In the new function return a new type PCIAddressSpace that encapsulates
>> the AddressSpace pointer that originally was returned.
>>
>> The new type is added in preparation to expanding it to include the IOMMU
>> memory region as a new field, such that we are able to fetch attributes of
>> the vIOMMU e.g. at vfio migration setup.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/pci/pci.c         |  9 ++++++---
>>   include/hw/pci/pci.h | 21 ++++++++++++++++++++-
> 
> Please consider using scripts/git.orderfile.
> 
Will do -- wasn't aware of that script.

>>   2 files changed, 26 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 1cc7c89036b5..ecf8a543aa77 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2633,11 +2633,12 @@ static void pci_device_class_base_init(ObjectClass
>> *klass, void *data)
>>       }
>>   }
>>   -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev)
>>   {
> 
> This function is PCI specific, ...
> 
>>   }
>>     void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index e6d0574a2999..9ffaf47fe2ab 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -363,9 +363,28 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
>>     void pci_device_deassert_intx(PCIDevice *dev);
>>   +typedef struct PCIAddressSpace {
>> +    AddressSpace *as;
> 
> ... but here I fail to understand what is PCI specific in this
> structure. You are just trying to an AS with a IOMMU MR, right?
> 
Right. The patch is trying to better split the changes to use one function to
return everything (via pci_device_iommu_info) with the PCIAddressSpace
intermediate structure as retval, such that patch 3 just adds a
IOMMUMemoryRegion* in the latter for usage with the
pci_device_iommu_memory_region().

I've named the structure with a 'PCI' prefix, because it seemed to me that it is
the only case (AIUI) that cares about whether a PCI has a different address
space that the memory map.

>> +} PCIAddressSpace;
>> +
>>   typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>> +static inline PCIAddressSpace as_to_pci_as(AddressSpace *as)
>> +{
>> +    PCIAddressSpace ret = { .as = as };
>> +
>> +    return ret;
>> +}
>> +static inline AddressSpace *pci_as_to_as(PCIAddressSpace pci_as)
>> +{
>> +    return pci_as.as;
>> +}
>> +
>> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev);
>> +static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +{
>> +    return pci_as_to_as(pci_device_iommu_info(dev));
>> +}
>>   -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>>     pcibus_t pci_bar_address(PCIDevice *d,
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-05-31  9:54     ` Joao Martins
@ 2023-05-31 13:59       ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 30+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-05-31 13:59 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe

On 31/5/23 11:54, Joao Martins wrote:
> 
> 
> On 30/05/2023 22:45, Philippe Mathieu-Daudé wrote:
>> On 30/5/23 19:59, Joao Martins wrote:
>>> From: Avihai Horon <avihaih@nvidia.com>
>>>
>>> Implement get_attr() method and use the address width property to report
>>> the IOMMU_ATTR_MAX_IOVA attribute.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>    hw/i386/intel_iommu.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)


>>> +    case IOMMU_ATTR_MAX_IOVA:
>>> +    {
>>> +        hwaddr *max_iova = data;
>>
>> Shouldn't we cast to uintptr_t to be safe?
>>
> Perhaps you mean something like this:
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 829dd6eadc6c..479307f1228f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3878,7 +3878,7 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>       }
>       case IOMMU_ATTR_MAX_IOVA:
>       {
> -        hwaddr *max_iova = data;
> +        hwaddr *max_iova = (hwaddr *)(uintptr_t) data;
> 
>           *max_iova = (1ULL << s->aw_bits) - 1;
>           break;
> 
> I guess the thinking is to prevent 32-bit failures.

Exactly.

>>> +        *max_iova = (1ULL << s->aw_bits) - 1;
>>
>> Alternatively:
>>
>>             *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);
>>
> 
> I'll switch to your suggestion. Wasn't aware of this macro :)

Thanks, it is a no-brainer when reviewing.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper
  2023-05-30 17:59 ` [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper Joao Martins
@ 2023-06-05 16:57   ` Peter Xu
  2023-06-06 11:22     ` Joao Martins
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2023-06-05 16:57 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

On Tue, May 30, 2023 at 06:59:25PM +0100, Joao Martins wrote:
> Much like pci_device_iommu_address_space() fetches the IOMMU AS, add a
> pci_device_iommu_memory_region() which lets it return an the IOMMU MR
> associated with it. The IOMMU MR is returned correctly for vIOMMUs using
> pci_setup_iommu_info(). Note that today most vIOMMUs create the address
> space and IOMMU MR at the same time, it's just mainly that there's API
> to make the latter available.

Have you looked into other archs outside x86?  IIRC on some other arch one
address space can have >1 IOMMU memory regions.. at least with such AS and
MR layering it seems always possible?  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper
  2023-06-05 16:57   ` Peter Xu
@ 2023-06-06 11:22     ` Joao Martins
  2023-06-06 15:03       ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Joao Martins @ 2023-06-06 11:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

On 05/06/2023 17:57, Peter Xu wrote:
> On Tue, May 30, 2023 at 06:59:25PM +0100, Joao Martins wrote:
>> Much like pci_device_iommu_address_space() fetches the IOMMU AS, add a
>> pci_device_iommu_memory_region() which lets it return an the IOMMU MR
>> associated with it. The IOMMU MR is returned correctly for vIOMMUs using
>> pci_setup_iommu_info(). Note that today most vIOMMUs create the address
>> space and IOMMU MR at the same time, it's just mainly that there's API
>> to make the latter available.
> 
> Have you looked into other archs outside x86?  IIRC on some other arch one
> address space can have >1 IOMMU memory regions.. at least with such AS and
> MR layering it seems always possible?  Thanks,
> 

I looked at all callers of pci_setup_iommu() restricting to those that actually
track an IOMMUMemoryRegion when they create a address space... as this is where
pci_device_iommu_memory_region() is applicable. From looking at those[*], I see
always a 1:1 association between the AS and the IOMMU-MR in their initialization
when iommu_fn is called. Unless I missed something... Is there an arch you were
thinking specifically?

[I am not sure we can track today an 1:N AS->IOMMU association today in Qemu]

[*] alpha, arm smmu, ppc, s390, virtio, and some pci bridges (pnv_phb3 and pnv_phb4)

	Joao


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper
  2023-06-06 11:22     ` Joao Martins
@ 2023-06-06 15:03       ` Peter Xu
  2023-06-06 15:05         ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2023-06-06 15:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

On Tue, Jun 06, 2023 at 12:22:16PM +0100, Joao Martins wrote:
> On 05/06/2023 17:57, Peter Xu wrote:
> > On Tue, May 30, 2023 at 06:59:25PM +0100, Joao Martins wrote:
> >> Much like pci_device_iommu_address_space() fetches the IOMMU AS, add a
> >> pci_device_iommu_memory_region() which lets it return an the IOMMU MR
> >> associated with it. The IOMMU MR is returned correctly for vIOMMUs using
> >> pci_setup_iommu_info(). Note that today most vIOMMUs create the address
> >> space and IOMMU MR at the same time, it's just mainly that there's API
> >> to make the latter available.
> > 
> > Have you looked into other archs outside x86?  IIRC on some other arch one
> > address space can have >1 IOMMU memory regions.. at least with such AS and
> > MR layering it seems always possible?  Thanks,
> > 
> 
> I looked at all callers of pci_setup_iommu() restricting to those that actually
> track an IOMMUMemoryRegion when they create a address space... as this is where
> pci_device_iommu_memory_region() is applicable. From looking at those[*], I see
> always a 1:1 association between the AS and the IOMMU-MR in their initialization
> when iommu_fn is called. Unless I missed something... Is there an arch you were
> thinking specifically?

If only observing the ones that "track an IOMMUMemoryRegion when they
create a address space", probably we're fine.  I was thinking ppc but I
don't really know the details, and I assume that's not in the scope.
Copying David Gibson just in case he got some comments here.

> 
> [I am not sure we can track today an 1:N AS->IOMMU association today in Qemu]

IIUC we can?  The address space only have a root MR, and with that after
translate() upon the root mr (per address_space_translate_iommu(), it can
even be a few rounds of nested translations) it can go into whatever MR
under it IIUC.  Different ranges can map to a different IOMMU MR logically.

> 
> [*] alpha, arm smmu, ppc, s390, virtio, and some pci bridges (pnv_phb3 and pnv_phb4)

I just worried what we need here is not a MR object but a higher level
object like the vIOMMU object.  We used to have a requirement with Scalable
IOV (SVA) on Intel.  I tried to dig a bit in my inbox, not sure whether
it's the latest status, just to show what I meant:

https://lore.kernel.org/r/20210302203827.437645-6-yi.l.liu@intel.com

Copy Yi too for that too.  From that aspect it makes more sense to me to
fetching things from either an IOMMUops or "an iommu object", rather than
relying on a specific MR (it'll also make it even harder when we can have
>1 vIOMMUs so different MR can point to different IOMMUs in the future).

I assume the two goals have similar requirement, iiuc.  If that's the case,
we'd better make sure we'll have one way to work for both.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper
  2023-06-06 15:03       ` Peter Xu
@ 2023-06-06 15:05         ` Peter Xu
  2023-06-06 17:44           ` Joao Martins
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2023-06-06 15:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu, David Gibson

[ I forgot to really copy anyone, as usual.. trying again ]

On Tue, Jun 06, 2023 at 11:03:11AM -0400, Peter Xu wrote:
> On Tue, Jun 06, 2023 at 12:22:16PM +0100, Joao Martins wrote:
> > On 05/06/2023 17:57, Peter Xu wrote:
> > > On Tue, May 30, 2023 at 06:59:25PM +0100, Joao Martins wrote:
> > >> Much like pci_device_iommu_address_space() fetches the IOMMU AS, add a
> > >> pci_device_iommu_memory_region() which lets it return an the IOMMU MR
> > >> associated with it. The IOMMU MR is returned correctly for vIOMMUs using
> > >> pci_setup_iommu_info(). Note that today most vIOMMUs create the address
> > >> space and IOMMU MR at the same time, it's just mainly that there's API
> > >> to make the latter available.
> > > 
> > > Have you looked into other archs outside x86?  IIRC on some other arch one
> > > address space can have >1 IOMMU memory regions.. at least with such AS and
> > > MR layering it seems always possible?  Thanks,
> > > 
> > 
> > I looked at all callers of pci_setup_iommu() restricting to those that actually
> > track an IOMMUMemoryRegion when they create a address space... as this is where
> > pci_device_iommu_memory_region() is applicable. From looking at those[*], I see
> > always a 1:1 association between the AS and the IOMMU-MR in their initialization
> > when iommu_fn is called. Unless I missed something... Is there an arch you were
> > thinking specifically?
> 
> If only observing the ones that "track an IOMMUMemoryRegion when they
> create a address space", probably we're fine.  I was thinking ppc but I
> don't really know the details, and I assume that's not in the scope.
> Copying David Gibson just in case he got some comments here.
> 
> > 
> > [I am not sure we can track today an 1:N AS->IOMMU association today in Qemu]
> 
> IIUC we can?  The address space only have a root MR, and with that after
> translate() upon the root mr (per address_space_translate_iommu(), it can
> even be a few rounds of nested translations) it can go into whatever MR
> under it IIUC.  Different ranges can map to a different IOMMU MR logically.
> 
> > 
> > [*] alpha, arm smmu, ppc, s390, virtio, and some pci bridges (pnv_phb3 and pnv_phb4)
> 
> I just worried what we need here is not a MR object but a higher level
> object like the vIOMMU object.  We used to have a requirement with Scalable
> IOV (SVA) on Intel.  I tried to dig a bit in my inbox, not sure whether
> it's the latest status, just to show what I meant:
> 
> https://lore.kernel.org/r/20210302203827.437645-6-yi.l.liu@intel.com
> 
> Copy Yi too for that too.  From that aspect it makes more sense to me to
> fetching things from either an IOMMUops or "an iommu object", rather than
> relying on a specific MR (it'll also make it even harder when we can have
> >1 vIOMMUs so different MR can point to different IOMMUs in the future).
> 
> I assume the two goals have similar requirement, iiuc.  If that's the case,
> we'd better make sure we'll have one way to work for both.
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper
  2023-06-06 15:05         ` Peter Xu
@ 2023-06-06 17:44           ` Joao Martins
  0 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-06-06 17:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Paolo Bonzini,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu, David Gibson

On 06/06/2023 16:05, Peter Xu wrote:
> On Tue, Jun 06, 2023 at 11:03:11AM -0400, Peter Xu wrote:
>> On Tue, Jun 06, 2023 at 12:22:16PM +0100, Joao Martins wrote:
>>> On 05/06/2023 17:57, Peter Xu wrote:
>>>> On Tue, May 30, 2023 at 06:59:25PM +0100, Joao Martins wrote:
>>>>> Much like pci_device_iommu_address_space() fetches the IOMMU AS, add a
>>>>> pci_device_iommu_memory_region() which lets it return an the IOMMU MR
>>>>> associated with it. The IOMMU MR is returned correctly for vIOMMUs using
>>>>> pci_setup_iommu_info(). Note that today most vIOMMUs create the address
>>>>> space and IOMMU MR at the same time, it's just mainly that there's API
>>>>> to make the latter available.
>>>>
>>>> Have you looked into other archs outside x86?  IIRC on some other arch one
>>>> address space can have >1 IOMMU memory regions.. at least with such AS and
>>>> MR layering it seems always possible?  Thanks,
>>>>
>>>
>>> I looked at all callers of pci_setup_iommu() restricting to those that actually
>>> track an IOMMUMemoryRegion when they create a address space... as this is where
>>> pci_device_iommu_memory_region() is applicable. From looking at those[*], I see
>>> always a 1:1 association between the AS and the IOMMU-MR in their initialization
>>> when iommu_fn is called. Unless I missed something... Is there an arch you were
>>> thinking specifically?
>>
>> If only observing the ones that "track an IOMMUMemoryRegion when they
>> create a address space", probably we're fine.  I was thinking ppc but I
>> don't really know the details, and I assume that's not in the scope.
>> Copying David Gibson just in case he got some comments here.
>>/me nods

>>>
>>> [I am not sure we can track today an 1:N AS->IOMMU association today in Qemu]
>>
>> IIUC we can?  The address space only have a root MR, and with that after
>> translate() upon the root mr (per address_space_translate_iommu(), it can
>> even be a few rounds of nested translations) it can go into whatever MR
>> under it IIUC.  Different ranges can map to a different IOMMU MR logically.
>>

I'll look some more into address_space_translate_iommu(). From a data structure
PoV wasn't obvious how two different AS can be routed via two different IOMMU
MRs from different AS (or vice versa). Thanks for clarifying;

>>>
>>> [*] alpha, arm smmu, ppc, s390, virtio, and some pci bridges (pnv_phb3 and pnv_phb4)
>>
>> I just worried what we need here is not a MR object but a higher level
>> object like the vIOMMU object.  We used to have a requirement with Scalable
>> IOV (SVA) on Intel.  I tried to dig a bit in my inbox, not sure whether
>> it's the latest status, just to show what I meant:
>>
>> https://lore.kernel.org/r/20210302203827.437645-6-yi.l.liu@intel.com
>>
Oh nice; I wasn't aware of this series.

>> Copy Yi too for that too.  From that aspect it makes more sense to me to
>> fetching things from either an IOMMUops or "an iommu object", rather than
>> relying on a specific MR (it'll also make it even harder when we can have
>> 1 vIOMMUs so different MR can point to different IOMMUs in the future).
>>
>> I assume the two goals have similar requirement, iiuc.  If that's the case,
>> we'd better make sure we'll have one way to work for both.

Yeap, makes sense, definitely more future-proof. We essentially were trying to
do the exact same thing in the PCI layer just different purposes. All I meant in
this series is a way to fetch some vIOMMU attribute that tell me if DMA
translation is enabled or not and max IOVA for the IOMMU under a particular PCI
device.

Perhaps I would instead do a bit like this series and have a
pci_setup_iommu_ops() and convert existing users to it as a separate step or
series, to avoid regressing those who don't care.

I am happy to pick those up if Yi's is OK with it -- should help for the nesting
work down the road too.

	Joao


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-05-31 10:03     ` Joao Martins
@ 2023-06-22 20:50       ` Michael S. Tsirkin
  2023-06-22 21:01         ` Joao Martins
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-06-22 20:50 UTC (permalink / raw)
  To: Joao Martins
  Cc: Philippe Mathieu-Daudé,
	qemu-devel, Alex Williamson, Cedric Le Goater, Paolo Bonzini,
	Peter Xu, David Hildenbrand, Marcel Apfelbaum, Jason Wang,
	Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe

On Wed, May 31, 2023 at 11:03:23AM +0100, Joao Martins wrote:
> On 30/05/2023 23:04, Philippe Mathieu-Daudé wrote:
> > Hi Joao,
> > 
> > On 30/5/23 19:59, Joao Martins wrote:
> >> Rename pci_device_iommu_address_space() into pci_device_iommu_info().
> >> In the new function return a new type PCIAddressSpace that encapsulates
> >> the AddressSpace pointer that originally was returned.
> >>
> >> The new type is added in preparation to expanding it to include the IOMMU
> >> memory region as a new field, such that we are able to fetch attributes of
> >> the vIOMMU e.g. at vfio migration setup.
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>   hw/pci/pci.c         |  9 ++++++---
> >>   include/hw/pci/pci.h | 21 ++++++++++++++++++++-
> > 
> > Please consider using scripts/git.orderfile.
> > 
> Will do -- wasn't aware of that script.
> 
> >>   2 files changed, 26 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 1cc7c89036b5..ecf8a543aa77 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -2633,11 +2633,12 @@ static void pci_device_class_base_init(ObjectClass
> >> *klass, void *data)
> >>       }
> >>   }
> >>   -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev)
> >>   {
> > 
> > This function is PCI specific, ...
> > 
> >>   }
> >>     void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
> >> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >> index e6d0574a2999..9ffaf47fe2ab 100644
> >> --- a/include/hw/pci/pci.h
> >> +++ b/include/hw/pci/pci.h
> >> @@ -363,9 +363,28 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
> >>     void pci_device_deassert_intx(PCIDevice *dev);
> >>   +typedef struct PCIAddressSpace {
> >> +    AddressSpace *as;
> > 
> > ... but here I fail to understand what is PCI specific in this
> > structure. You are just trying to an AS with a IOMMU MR, right?
> > 
> Right. The patch is trying to better split the changes to use one function to
> return everything (via pci_device_iommu_info) with the PCIAddressSpace
> intermediate structure as retval, such that patch 3 just adds a
> IOMMUMemoryRegion* in the latter for usage with the
> pci_device_iommu_memory_region().
> 
> I've named the structure with a 'PCI' prefix, because it seemed to me that it is
> the only case (AIUI) that cares about whether a PCI has a different address
> space that the memory map.


yea keep that pls. It should be possible to figure out the header
from the name.

> >> +} PCIAddressSpace;
> >> +
> >>   typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> >> +static inline PCIAddressSpace as_to_pci_as(AddressSpace *as)
> >> +{
> >> +    PCIAddressSpace ret = { .as = as };
> >> +
> >> +    return ret;
> >> +}
> >> +static inline AddressSpace *pci_as_to_as(PCIAddressSpace pci_as)
> >> +{
> >> +    return pci_as.as;
> >> +}
> >> +
> >> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev);
> >> +static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >> +{
> >> +    return pci_as_to_as(pci_device_iommu_info(dev));
> >> +}
> >>   -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> >>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> >>     pcibus_t pci_bar_address(PCIDevice *d,
> > 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-06-22 20:50       ` Michael S. Tsirkin
@ 2023-06-22 21:01         ` Joao Martins
  0 siblings, 0 replies; 30+ messages in thread
From: Joao Martins @ 2023-06-22 21:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Philippe Mathieu-Daudé,
	qemu-devel, Alex Williamson, Cedric Le Goater, Paolo Bonzini,
	Peter Xu, David Hildenbrand, Marcel Apfelbaum, Jason Wang,
	Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe



On 22/06/2023 21:50, Michael S. Tsirkin wrote:
> On Wed, May 31, 2023 at 11:03:23AM +0100, Joao Martins wrote:
>> On 30/05/2023 23:04, Philippe Mathieu-Daudé wrote:
>>> Hi Joao,
>>>
>>> On 30/5/23 19:59, Joao Martins wrote:
>>>> Rename pci_device_iommu_address_space() into pci_device_iommu_info().
>>>> In the new function return a new type PCIAddressSpace that encapsulates
>>>> the AddressSpace pointer that originally was returned.
>>>>
>>>> The new type is added in preparation to expanding it to include the IOMMU
>>>> memory region as a new field, such that we are able to fetch attributes of
>>>> the vIOMMU e.g. at vfio migration setup.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>   hw/pci/pci.c         |  9 ++++++---
>>>>   include/hw/pci/pci.h | 21 ++++++++++++++++++++-
>>>
>>> Please consider using scripts/git.orderfile.
>>>
>> Will do -- wasn't aware of that script.
>>
>>>>   2 files changed, 26 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>> index 1cc7c89036b5..ecf8a543aa77 100644
>>>> --- a/hw/pci/pci.c
>>>> +++ b/hw/pci/pci.c
>>>> @@ -2633,11 +2633,12 @@ static void pci_device_class_base_init(ObjectClass
>>>> *klass, void *data)
>>>>       }
>>>>   }
>>>>   -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev)
>>>>   {
>>>
>>> This function is PCI specific, ...
>>>
>>>>   }
>>>>     void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>> index e6d0574a2999..9ffaf47fe2ab 100644
>>>> --- a/include/hw/pci/pci.h
>>>> +++ b/include/hw/pci/pci.h
>>>> @@ -363,9 +363,28 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
>>>>     void pci_device_deassert_intx(PCIDevice *dev);
>>>>   +typedef struct PCIAddressSpace {
>>>> +    AddressSpace *as;
>>>
>>> ... but here I fail to understand what is PCI specific in this
>>> structure. You are just trying to an AS with a IOMMU MR, right?
>>>
>> Right. The patch is trying to better split the changes to use one function to
>> return everything (via pci_device_iommu_info) with the PCIAddressSpace
>> intermediate structure as retval, such that patch 3 just adds a
>> IOMMUMemoryRegion* in the latter for usage with the
>> pci_device_iommu_memory_region().
>>
>> I've named the structure with a 'PCI' prefix, because it seemed to me that it is
>> the only case (AIUI) that cares about whether a PCI has a different address
>> space that the memory map.
> 
> 
> yea keep that pls. It should be possible to figure out the header
> from the name.
> 

OK.

I am about to respin v4 series. It mainly reworks the first four patch enterily.
Essentially I'm following Peter's suggestion of picking Yi's old patches[0][1]
and avoid the direct manipulation of an IOMMU MR. The structure is very similar,
but the difference is avoid the direct manipulation of an IOMMU MR[2]. The end
goal in hw/pci is similar, fetching the backing IOMMU attribute from a PCI device.

[0] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
[1] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
[2] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/

>>>> +} PCIAddressSpace;
>>>> +
>>>>   typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>>>> +static inline PCIAddressSpace as_to_pci_as(AddressSpace *as)
>>>> +{
>>>> +    PCIAddressSpace ret = { .as = as };
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +static inline AddressSpace *pci_as_to_as(PCIAddressSpace pci_as)
>>>> +{
>>>> +    return pci_as.as;
>>>> +}
>>>> +
>>>> +PCIAddressSpace pci_device_iommu_info(PCIDevice *dev);
>>>> +static inline AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>> +{
>>>> +    return pci_as_to_as(pci_device_iommu_info(dev));
>>>> +}
>>>>   -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>>>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>>>>     pcibus_t pci_bar_address(PCIDevice *d,
>>>
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-06-22 21:02 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-30 17:59 [PATCH v3 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
2023-05-30 17:59 ` [PATCH v3 01/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
2023-05-30 22:04   ` Philippe Mathieu-Daudé
2023-05-31 10:03     ` Joao Martins
2023-06-22 20:50       ` Michael S. Tsirkin
2023-06-22 21:01         ` Joao Martins
2023-05-30 17:59 ` [PATCH v3 02/15] hw/pci: Add a pci_setup_iommu_info() helper Joao Martins
2023-05-30 17:59 ` [PATCH v3 03/15] hw/pci: Add a pci_device_iommu_memory_region() helper Joao Martins
2023-06-05 16:57   ` Peter Xu
2023-06-06 11:22     ` Joao Martins
2023-06-06 15:03       ` Peter Xu
2023-06-06 15:05         ` Peter Xu
2023-06-06 17:44           ` Joao Martins
2023-05-30 17:59 ` [PATCH v3 04/15] intel-iommu: Switch to pci_setup_iommu_info() Joao Martins
2023-05-30 17:59 ` [PATCH v3 05/15] vfio/common: Track the IOMMU MR behind the device in addition to the AS Joao Martins
2023-05-30 17:59 ` [PATCH v3 06/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute Joao Martins
2023-05-30 17:59 ` [PATCH v3 07/15] intel-iommu: Implement get_attr() method Joao Martins
2023-05-30 17:59 ` [PATCH v3 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off Joao Martins
2023-05-30 21:39   ` Philippe Mathieu-Daudé
2023-05-31  9:39     ` Joao Martins
2023-05-30 17:59 ` [PATCH v3 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Joao Martins
2023-05-30 17:59 ` [PATCH v3 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
2023-05-30 21:45   ` Philippe Mathieu-Daudé
2023-05-31  9:54     ` Joao Martins
2023-05-31 13:59       ` Philippe Mathieu-Daudé
2023-05-30 17:59 ` [PATCH v3 11/15] vfio/common: Move dirty tracking ranges update to helper Joao Martins
2023-05-30 17:59 ` [PATCH v3 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
2023-05-30 17:59 ` [PATCH v3 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Joao Martins
2023-05-30 17:59 ` [PATCH v3 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU Joao Martins
2023-05-30 17:59 ` [PATCH v3 15/15] vfio/common: Block migration with vIOMMUs without address width limits Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.