All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
@ 2023-06-22 21:48 Joao Martins
  2023-06-22 21:48 ` [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper Joao Martins
                   ` (16 more replies)
  0 siblings, 17 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins, Yi Liu

Hey,

This series introduces support for vIOMMU with VFIO device migration,
particurlarly related to how we do the dirty page tracking.

Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
provide dma translation services for guests to provide some form of
guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
required for big VMs with VFs with more than 255 vcpus. We tackle both
and remove the migration blocker when vIOMMU is present provided the
conditions are met. I have both use-cases here in one series, but I am happy
to tackle them in separate series.

As I found out we don't necessarily need to expose the whole vIOMMU
functionality in order to just support interrupt remapping. x86 IOMMUs
on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
can instantiate a IOMMU just for interrupt remapping without needing to
be advertised/support DMA translation. AMD IOMMU in theory can provide
the same, but Linux doesn't quite support the IR-only part there yet,
only intel-iommu.

The series is organized as following:

Patches 1-5: Today we can't gather vIOMMU details before the guest
establishes their first DMA mapping via the vIOMMU. So these first four
patches add a way for vIOMMUs to be asked of their properties at start
of day. I choose the least churn possible way for now (as opposed to a
treewide conversion) and allow easy conversion a posteriori. As
suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
allows us to fetch PCI backing vIOMMU attributes, without necessarily
tieing the caller (VFIO or anyone else) to an IOMMU MR like I
was doing in v3.

Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
DMA translation allowed. Today the 'dma-translation' attribute is
x86-iommu only, but the way this series is structured nothing stops from
other vIOMMUs supporting it too as long as they use
pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
we've then tackled item (1) of the second paragraph.

Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
IOVA address space, leveraging the logic we use to compose the dirty ranges.
The blocker is once again relaxed for vIOMMUs that advertise their IOVA
addressing limits. This tackles item (2). So far I mainly use it with
intel-iommu, although I have a small set of patches for virtio-iommu per
Alex's suggestion in v2.

Comments, suggestions welcome. Thanks for the review!

Regards,
	Joao

Changes since v3[8]:
* Pick up Yi's patches[5][6], and rework the first four patches.
  These are a bit better splitted, and make the new iommu_ops *optional*
  as opposed to a treewide conversion. Rather than returning an IOMMU MR
  and let VFIO operate on it to fetch attributes, we instead let the
  underlying IOMMU driver fetch the desired IOMMU MR and ask for the
  desired IOMMU attribute. Callers only care about PCI Device backing
  vIOMMU attributes regardless of its topology/association. (Peter Xu)
  These patches are a bit better splitted compared to original ones,
  and I've kept all the same authorship and note the changes from
  original where applicable.
* Because of the rework of the first four patches, switch to
  individual attributes in the VFIOSpace that track dma_translation
  and the max_iova. All are expected to be unused when zero to retain
  the defaults of today in common code.
* Improve the migration blocker message of the last patch to be
  more obvious that vIOMMU migration blocker is added when no vIOMMU
  address space limits are advertised. (Patch 15)
* Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
* Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
* Change diffstat of patches with scripts/git.orderfile (Philippe).

Changes since v2[3]:
* New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
introduce ways to know various IOMMU model attributes via the IOMMU MR. This
is partly meant to address a comment in previous versions where we can't
access the IOMMU MR prior to the DMA mapping happening. Before this series
vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
guest. As well as better tackling of the IOMMU usage for interrupt-remapping
only purposes. 
* Dropped Peter Xu ack on patch 9 given that the code changed a bit.
* Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
* The patches that existed in v2 of vIOMMU dirty tracking, are mostly
* untouched, except patch 12 which was greatly simplified.

Changes since v1[4]:
- Rebased on latest master branch. As part of it, made some changes in
  pre-copy to adjust it to Juan's new patches:
  1. Added a new patch that passes threshold_size parameter to
     .state_pending_{estimate,exact}() handlers.
  2. Added a new patch that refactors vfio_save_block().
  3. Changed the pre-copy patch to cache and report pending pre-copy
     size in the .state_pending_estimate() handler.
- Removed unnecessary P2P code. This should be added later on when P2P
  support is added. (Alex)
- Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
  (patch #11). (Alex)
- Stored vfio_devices_all_device_dirty_tracking()'s value in a local
  variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
- Refactored the viommu device dirty tracking ranges creation code to
  make it clearer (patch #15).
- Changed overflow check in vfio_iommu_range_is_device_tracked() to
  emphasize that we specifically check for 2^64 wrap around (patch #15).
- Added R-bs / Acks.

[0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
[1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
[2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
[3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
[4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
[5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
[6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
[7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
[8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/

Avihai Horon (4):
  memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  vfio/common: Optimize device dirty page tracking with vIOMMU

Joao Martins (7):
  memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
  intel-iommu: Implement get_attr() method
  vfio/common: Track whether DMA Translation is enabled on the vIOMMU
  vfio/common: Relax vIOMMU detection when DMA translation is off
  vfio/common: Move dirty tracking ranges update to helper
  vfio/common: Support device dirty page tracking with vIOMMU
  vfio/common: Block migration with vIOMMUs without address width limits

Yi Liu (4):
  hw/pci: Add a pci_setup_iommu_ops() helper
  hw/pci: Refactor pci_device_iommu_address_space()
  hw/pci: Introduce pci_device_iommu_get_attr()
  intel-iommu: Switch to pci_setup_iommu_ops()

 include/exec/memory.h         |   4 +-
 include/hw/pci/pci.h          |  11 ++
 include/hw/pci/pci_bus.h      |   1 +
 include/hw/vfio/vfio-common.h |   2 +
 hw/i386/intel_iommu.c         |  53 +++++++-
 hw/pci/pci.c                  |  58 +++++++-
 hw/vfio/common.c              | 241 ++++++++++++++++++++++++++--------
 hw/vfio/pci.c                 |  22 +++-
 8 files changed, 329 insertions(+), 63 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-10-02 15:12   ` Cédric Le Goater
  2023-10-06  8:45   ` Eric Auger
  2023-06-22 21:48 ` [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins, Yi Liu

From: Yi Liu <yi.l.liu@intel.com>

Add a pci_setup_iommu_ops() that uses a newly added structure
(PCIIOMMUOps) instead of using PCIIOMMUFunc. The old pci_setup_iommu()
that uses PCIIOMMUFunc is still kept for other IOMMUs to get an
an address space for a PCI device in vendor specific way.

In preparation to expand to supplying vIOMMU attributes, add a
alternate helper pci_setup_iommu_ops() to setup the PCI device IOMMU.
For now the PCIIOMMUOps just defines the address_space, but it will
be extended to have another callback.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
[joao: Massage commit message and subject, and make it a complementary
rather than changing every single consumer of pci_setup_iommu()]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
v1: https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
---
 include/hw/pci/pci.h     |  7 +++++++
 include/hw/pci/pci_bus.h |  1 +
 hw/pci/pci.c             | 26 +++++++++++++++++++++++---
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index e6d0574a2999..f59aef5a329a 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -368,6 +368,13 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
+typedef struct PCIIOMMUOps PCIIOMMUOps;
+struct PCIIOMMUOps {
+    AddressSpace * (*get_address_space)(PCIBus *bus,
+                                void *opaque, int32_t devfn);
+};
+void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
+
 pcibus_t pci_bar_address(PCIDevice *d,
                          int reg, uint8_t type, pcibus_t size);
 
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 56531759578f..fb770b236d69 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -35,6 +35,7 @@ struct PCIBus {
     enum PCIBusFlags flags;
     PCIIOMMUFunc iommu_fn;
     void *iommu_opaque;
+    const PCIIOMMUOps *iommu_ops;
     uint8_t devfn_min;
     uint32_t slot_reserved_mask;
     pci_set_irq_fn set_irq;
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index bf38905b7dc0..4e32c09e81d6 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2639,7 +2639,15 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;
 
-    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
+    /*
+     * get_address_space() callback is mandatory when iommu uses
+     * pci_setup_iommu_ops(), so needs to ensure its presence in
+     * the iommu_bus search.
+     */
+    while (iommu_bus &&
+           !(iommu_bus->iommu_fn ||
+            (iommu_bus->iommu_ops && iommu_bus->iommu_ops->get_address_space)) &&
+           iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
 
         /*
@@ -2678,8 +2686,14 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
-        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+    if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
+        if (iommu_bus->iommu_fn) {
+           return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+        } else if (iommu_bus->iommu_ops &&
+                   iommu_bus->iommu_ops->get_address_space) {
+           return iommu_bus->iommu_ops->get_address_space(bus,
+                                           iommu_bus->iommu_opaque, devfn);
+        }
     }
     return &address_space_memory;
 }
@@ -2690,6 +2704,12 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
     bus->iommu_opaque = opaque;
 }
 
+void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
+{
+    bus->iommu_ops = ops;
+    bus->iommu_opaque = opaque;
+}
+
 static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
 {
     Range *range = opaque;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
  2023-06-22 21:48 ` [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-10-02 15:22   ` Cédric Le Goater
                     ` (2 more replies)
  2023-06-22 21:48 ` [PATCH v4 03/15] hw/pci: Introduce pci_device_iommu_get_attr() Joao Martins
                   ` (14 subsequent siblings)
  16 siblings, 3 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins, Yi Liu

From: Yi Liu <yi.l.liu@intel.com>

Refactor pci_device_iommu_address_space() and move the
code that fetches the device bus and iommu bus into its
own private helper pci_device_get_iommu_bus_devfn().

This is in preparation to introduce pci_device_iommu_get_attr()
which will need to use it too.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
[joao: Commit message, and better splitting]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Splitted from v1:
https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
---
 hw/pci/pci.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 4e32c09e81d6..90ae92a43d85 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2632,8 +2632,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
         assert(conventional || pcie || cxl);
     }
 }
-
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **pdevbus,
+                                           PCIBus **pbus, uint8_t *pdevfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
@@ -2686,6 +2686,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
+
+    *pdevbus = bus;
+    *pbus = iommu_bus;
+    *pdevfn = devfn;
+}
+
+AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+    PCIBus *bus, *iommu_bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
     if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
         if (iommu_bus->iommu_fn) {
            return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 03/15] hw/pci: Introduce pci_device_iommu_get_attr()
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
  2023-06-22 21:48 ` [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper Joao Martins
  2023-06-22 21:48 ` [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-06-22 21:48 ` [PATCH v4 04/15] intel-iommu: Switch to pci_setup_iommu_ops() Joao Martins
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins, Yi Liu

From: Yi Liu <yi.l.liu@intel.com>

Introduce pci_device_iommu_get_attr() to get vIOMMU attributes
from the PCI device.

This is in preparation to ask if vIOMMU has dma translation enabled
and also to get IOVA boundaries.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
[joao: Massage commit message; add one more argument in
 pci_device_get_iommu_bus_devfn(); rename to pci_device_iommu_get_attr()
 to align with the other already namespaced function. ]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
follow-up version from:
https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
---
 include/hw/pci/pci.h |  4 ++++
 hw/pci/pci.c         | 16 ++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index f59aef5a329a..10c81287b6b3 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -372,8 +372,12 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
 struct PCIIOMMUOps {
     AddressSpace * (*get_address_space)(PCIBus *bus,
                                 void *opaque, int32_t devfn);
+    int (*get_iommu_attr)(PCIBus *bus, void *opaque, int32_t devfn,
+                          enum IOMMUMemoryRegionAttr attr, void *data);
 };
 void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
+int pci_device_iommu_get_attr(PCIDevice *dev, enum IOMMUMemoryRegionAttr attr,
+                              void *data);
 
 pcibus_t pci_bar_address(PCIDevice *d,
                          int reg, uint8_t type, pcibus_t size);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 90ae92a43d85..91ba6f0927a4 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2710,6 +2710,22 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     return &address_space_memory;
 }
 
+int pci_device_iommu_get_attr(PCIDevice *dev, enum IOMMUMemoryRegionAttr attr,
+                              void *data)
+{
+    PCIBus *bus, *iommu_bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
+    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
+        iommu_bus->iommu_ops && iommu_bus->iommu_ops->get_iommu_attr) {
+        return iommu_bus->iommu_ops->get_iommu_attr(bus, iommu_bus->iommu_opaque,
+                                                    devfn, attr, data);
+    }
+
+    return -ENOENT;
+}
+
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
 {
     bus->iommu_fn = fn;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 04/15] intel-iommu: Switch to pci_setup_iommu_ops()
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (2 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 03/15] hw/pci: Introduce pci_device_iommu_get_attr() Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-06-22 21:48 ` [PATCH v4 05/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute Joao Martins
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins, Yi Liu

From: Yi Liu <yi.l.liu@intel.com>

Use the PCI IOMMU setup function that supply a PCIIOMMUOps
argument. This is in preparation to support fetching vIOMMU
information via pci_device_get_iommu_attr() which will require
switching the driver to pci_setup_iommu_ops().

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
[joao: Split from the original patch]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Splitted from:
https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/#Z2e.:20210302203827.437645-5-yi.l.liu::40intel.com:1hw:i386:intel_iommu.c
---
 hw/i386/intel_iommu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 94d52f4205d2..1606d1b952d0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4032,6 +4032,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static PCIIOMMUOps vtd_iommu_ops = {
+    .get_address_space = vtd_host_dma_iommu,
+};
+
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
@@ -4155,7 +4159,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                       g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
-    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
+    pci_setup_iommu_ops(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
     x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
     qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 05/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (3 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 04/15] intel-iommu: Switch to pci_setup_iommu_ops() Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-10-06 13:08   ` Eric Auger
  2023-06-22 21:48 ` [PATCH v4 06/15] intel-iommu: Implement get_attr() method Joao Martins
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Add a new IOMMU attribute IOMMU_ATTR_DMA_TRANSLATION which indicates
whether the IOMMU supports DMA Translation.

This attribute will be used by VFIO device dirty page tracking so it can
restrict the IOVA under tracking to the memory map when vIOMMU is
enabled only for interrupt remapping without dma translation enabled.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/exec/memory.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 47c2e0221c35..5d6c2ab1f397 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -319,7 +319,8 @@ typedef struct MemoryRegionClass {
 
 
 enum IOMMUMemoryRegionAttr {
-    IOMMU_ATTR_SPAPR_TCE_FD
+    IOMMU_ATTR_SPAPR_TCE_FD,
+    IOMMU_ATTR_DMA_TRANSLATION,
 };
 
 /*
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 06/15] intel-iommu: Implement get_attr() method
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (4 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 05/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-09-08  6:23   ` Duan, Zhenzhong
  2023-10-02 15:23   ` Cédric Le Goater
  2023-06-22 21:48 ` [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU Joao Martins
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Implement IOMMU MR get_attr() method and use the dma_translation
property to report the IOMMU_ATTR_DMA_TRANSLATION attribute.
Additionally add the necessary get_iommu_attr into the PCIIOMMUOps to
support pci_device_iommu_get_attr().

The callback in there acts as a IOMMU-specific address space walker
which will call get_attr in the IOMMUMemoryRegion backing the device to
fetch the desired attribute.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1606d1b952d0..ed2a46e008df 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3861,6 +3861,29 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
+static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
+                              enum IOMMUMemoryRegionAttr attr, void *data)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    int ret = 0;
+
+    switch (attr) {
+    case IOMMU_ATTR_DMA_TRANSLATION:
+    {
+        bool *enabled = (bool *)(uintptr_t) data;
+
+        *enabled = s->dma_translation;
+        break;
+    }
+    default:
+        ret = -EINVAL;
+        break;
+    }
+
+    return ret;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -4032,8 +4055,24 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static int vtd_get_iommu_attr(PCIBus *bus, void *opaque, int32_t devfn,
+                              enum IOMMUMemoryRegionAttr attr, void *data)
+{
+    IntelIOMMUState *s = opaque;
+    VTDAddressSpace *vtd_as;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
+    if (!vtd_as)
+	return -EINVAL;
+
+    return memory_region_iommu_get_attr(&vtd_as->iommu, attr, data);
+}
+
 static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
+    .get_iommu_attr = vtd_get_iommu_attr,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
@@ -4197,6 +4236,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = vtd_iommu_translate;
     imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
     imrc->replay = vtd_iommu_replay;
+    imrc->get_attr = vtd_iommu_get_attr;
 }
 
 static const TypeInfo vtd_iommu_memory_region_info = {
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (5 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 06/15] intel-iommu: Implement get_attr() method Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-07-09 15:10   ` Avihai Horon
  2023-10-06 13:09   ` Eric Auger
  2023-06-22 21:48 ` [PATCH v4 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off Joao Martins
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

vfio_get_group() allocates and fills the group/container/space on
success which will store the AddressSpace inside the VFIOSpace struct.
Use the newly added pci_device_iommu_get_attr() to see if DMA
translation is enabled or not. Assume that by default it is enabled.

Today, this means only intel-iommu supports it.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/hw/vfio/vfio-common.h |  1 +
 hw/vfio/pci.c                 | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eed244f25f34..f41860988d6b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -70,6 +70,7 @@ typedef struct VFIOMigration {
 
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
+    bool no_dma_translation;
     QLIST_HEAD(, VFIOContainer) containers;
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 73874a94de12..8a98e6ffc480 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2900,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
     VFIODevice *vbasedev_iter;
+    VFIOAddressSpace *space;
     VFIOGroup *group;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
     Error *err = NULL;
@@ -2907,7 +2908,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     struct stat st;
     int groupid;
     int i, ret;
-    bool is_mdev;
+    bool is_mdev, dma_translation;
     char uuid[UUID_FMT_LEN];
     char *name;
 
@@ -2961,6 +2962,18 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    space = group->container->space;
+
+    /*
+     * Support for toggling DMA translation is optional.
+     * By default, DMA translation is assumed to be enabled i.e.
+     * space::no_dma_translation is 0.
+     */
+    dma_translation = true;
+    pci_device_iommu_get_attr(pdev, IOMMU_ATTR_DMA_TRANSLATION,
+                              &dma_translation);
+    space->no_dma_translation = !dma_translation;
+
     QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
         if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
             error_setg(errp, "device is already attached");
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (6 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-06-22 21:48 ` [PATCH v4 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Joao Martins
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Relax the vIOMMU migration blocker when the underlying IOMMU reports DMA
translation disabled. When it is disabled there will be no DMA mappings
via the vIOMMU and the guest can only use it for Interrupt Remapping.

The latter is done via vfio_viommu_preset() return value where in
addition to validating that the address space is memory, we also check
whether the vIOMMU backing the PCI device has DMA translation on. It
is assumed to be enabled, if the IOMMU model does not support toggling
on/off the dma-translation property.

Intel IOMMU right now is the only case supporting, although AMD IOMMU
can in theory provide the same functionality.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fa8fd949b1cf..17c1d882e221 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -419,7 +419,8 @@ static bool vfio_viommu_preset(void)
     VFIOAddressSpace *space;
 
     QLIST_FOREACH(space, &vfio_address_spaces, list) {
-        if (space->as != &address_space_memory) {
+        if ((space->as != &address_space_memory) &&
+            !space->no_dma_translation) {
             return true;
         }
     }
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (7 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-06-22 21:48 ` [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

Add a new IOMMU attribute IOMMU_ATTR_MAX_IOVA which indicates the
maximal IOVA that an IOMMU can use.

This attribute will be used by VFIO device dirty page tracking so it can
track the entire IOVA space when needed (i.e. when vIOMMU is enabled).

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5d6c2ab1f397..742bff82dc77 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -321,6 +321,7 @@ typedef struct MemoryRegionClass {
 enum IOMMUMemoryRegionAttr {
     IOMMU_ATTR_SPAPR_TCE_FD,
     IOMMU_ATTR_DMA_TRANSLATION,
+    IOMMU_ATTR_MAX_IOVA,
 };
 
 /*
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (8 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-07-09 15:17   ` Avihai Horon
  2023-06-22 21:48 ` [PATCH v4 11/15] vfio/common: Move dirty tracking ranges update to helper Joao Martins
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

Implement get_attr() method and use the address width property to report
the IOMMU_ATTR_MAX_IOVA attribute.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ed2a46e008df..989993e303a6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
         *enabled = s->dma_translation;
         break;
     }
+    case IOMMU_ATTR_MAX_IOVA:
+    {
+        hwaddr *max_iova = (hwaddr *)(uintptr_t) data;
+
+        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);;
+        break;
+    }
     default:
         ret = -EINVAL;
         break;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 11/15] vfio/common: Move dirty tracking ranges update to helper
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (9 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-06-22 21:48 ` [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Separate the changes that updates the ranges from the listener, to
make it reusable in preparation to expand its use to vIOMMU support.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 17c1d882e221..ecfb9afb3fb6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1413,20 +1413,10 @@ typedef struct VFIODirtyRangesListener {
     MemoryListener listener;
 } VFIODirtyRangesListener;
 
-static void vfio_dirty_tracking_update(MemoryListener *listener,
-                                       MemoryRegionSection *section)
+static void vfio_dirty_tracking_update(hwaddr iova, hwaddr end,
+                                       VFIODirtyRanges *range)
 {
-    VFIODirtyRangesListener *dirty = container_of(listener,
-                                                  VFIODirtyRangesListener,
-                                                  listener);
-    VFIODirtyRanges *range = &dirty->ranges;
-    hwaddr iova, end, *min, *max;
-
-    if (!vfio_listener_valid_section(section, "tracking_update") ||
-        !vfio_get_section_iova_range(dirty->container, section,
-                                     &iova, &end, NULL)) {
-        return;
-    }
+    hwaddr *min, *max;
 
     /*
      * The address space passed to the dirty tracker is reduced to two ranges:
@@ -1450,12 +1440,28 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
     }
 
     trace_vfio_device_dirty_tracking_update(iova, end, *min, *max);
-    return;
+}
+
+static void vfio_listener_dirty_tracking_update(MemoryListener *listener,
+                                                MemoryRegionSection *section)
+{
+    VFIODirtyRangesListener *dirty = container_of(listener,
+                                                  VFIODirtyRangesListener,
+                                                  listener);
+    hwaddr iova, end;
+
+    if (!vfio_listener_valid_section(section, "tracking_update") ||
+        !vfio_get_section_iova_range(dirty->container, section,
+                                     &iova, &end, NULL)) {
+        return;
+    }
+
+    vfio_dirty_tracking_update(iova, end, &dirty->ranges);
 }
 
 static const MemoryListener vfio_dirty_tracking_listener = {
     .name = "vfio-tracking",
-    .region_add = vfio_dirty_tracking_update,
+    .region_add = vfio_listener_dirty_tracking_update,
 };
 
 static void vfio_dirty_tracking_init(VFIOContainer *container,
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (10 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 11/15] vfio/common: Move dirty tracking ranges update to helper Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-07-09 15:24   ` Avihai Horon
  2023-09-08  6:11   ` Duan, Zhenzhong
  2023-06-22 21:48 ` [PATCH v4 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Joao Martins
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Currently, device dirty page tracking with vIOMMU is not supported,
and a blocker is added and the migration is prevented.

When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
requesting by the vIOMMU. These IOVA ranges can potentially be mapped
anywhere in the vIOMMU IOVA space as advertised by the VMM.

To support device dirty tracking when vIOMMU enabled instead create the
dirty ranges based on the vIOMMU provided limits, which leads to the
tracking of the whole IOVA space regardless of what devices use.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/hw/vfio/vfio-common.h |  1 +
 hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
 hw/vfio/pci.c                 |  7 +++++
 3 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f41860988d6b..c4bafad084b4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -71,6 +71,7 @@ typedef struct VFIOMigration {
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     bool no_dma_translation;
+    hwaddr max_iova;
     QLIST_HEAD(, VFIOContainer) containers;
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ecfb9afb3fb6..85fddef24026 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
     return false;
 }
 
+static int vfio_viommu_get_max_iova(hwaddr *max_iova)
+{
+    VFIOAddressSpace *space;
+
+    *max_iova = 0;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        if (space->as == &address_space_memory) {
+            continue;
+        }
+
+        if (*max_iova < space->max_iova) {
+            *max_iova = space->max_iova;
+        }
+    }
+
+    return *max_iova == 0;
+}
+
 int vfio_block_giommu_migration(Error **errp)
 {
     int ret;
@@ -1464,10 +1483,11 @@ static const MemoryListener vfio_dirty_tracking_listener = {
     .region_add = vfio_listener_dirty_tracking_update,
 };
 
-static void vfio_dirty_tracking_init(VFIOContainer *container,
+static int vfio_dirty_tracking_init(VFIOContainer *container,
                                      VFIODirtyRanges *ranges)
 {
     VFIODirtyRangesListener dirty;
+    int ret;
 
     memset(&dirty, 0, sizeof(dirty));
     dirty.ranges.min32 = UINT32_MAX;
@@ -1475,17 +1495,29 @@ static void vfio_dirty_tracking_init(VFIOContainer *container,
     dirty.listener = vfio_dirty_tracking_listener;
     dirty.container = container;
 
-    memory_listener_register(&dirty.listener,
-                             container->space->as);
+    if (vfio_viommu_preset()) {
+        hwaddr iommu_max_iova;
+
+        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
+        if (ret) {
+            return -EINVAL;
+        }
+
+        vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
+    } else {
+        memory_listener_register(&dirty.listener,
+                                 container->space->as);
+        /*
+         * The memory listener is synchronous, and used to calculate the range
+         * to dirty tracking. Unregister it after we are done as we are not
+         * interested in any follow-up updates.
+         */
+        memory_listener_unregister(&dirty.listener);
+    }
 
     *ranges = dirty.ranges;
 
-    /*
-     * The memory listener is synchronous, and used to calculate the range
-     * to dirty tracking. Unregister it after we are done as we are not
-     * interested in any follow-up updates.
-     */
-    memory_listener_unregister(&dirty.listener);
+    return 0;
 }
 
 static void vfio_devices_dma_logging_stop(VFIOContainer *container)
@@ -1590,7 +1622,13 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
     VFIOGroup *group;
     int ret = 0;
 
-    vfio_dirty_tracking_init(container, &ranges);
+    ret = vfio_dirty_tracking_init(container, &ranges);
+    if (ret) {
+        error_report("Failed to init DMA logging ranges, err %d",
+                      ret);
+        return -EOPNOTSUPP;
+    }
+
     feature = vfio_device_feature_dma_logging_start_create(container,
                                                            &ranges);
     if (!feature) {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8a98e6ffc480..3bda5618c5b5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2974,6 +2974,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                               &dma_translation);
     space->no_dma_translation = !dma_translation;
 
+    /*
+     * Support for advertised IOMMU address space boundaries is optional.
+     * By default, it is not advertised i.e. space::max_iova is 0.
+     */
+    pci_device_iommu_get_attr(pdev, IOMMU_ATTR_MAX_IOVA,
+                              &space->max_iova);
+
     QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
         if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
             error_setg(errp, "device is already attached");
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (11 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-06-22 21:48 ` [PATCH v4 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU Joao Martins
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

Extract vIOMMU code from vfio_sync_dirty_bitmap() to a new function and
restructure the code.

This is done in preparation for optimizing vIOMMU deviice dirty page
tracking. No functional changes intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 63 +++++++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 85fddef24026..c530e9d87f21 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1914,37 +1914,50 @@ static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
                                                 &vrdl);
 }
 
+static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
+                                        MemoryRegionSection *section)
+{
+    VFIOGuestIOMMU *giommu;
+    bool found = false;
+    Int128 llend;
+    vfio_giommu_dirty_notifier gdn;
+    int idx;
+
+    QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+        if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
+            giommu->n.start == section->offset_within_region) {
+            found = true;
+            break;
+        }
+    }
+
+    if (!found) {
+        return 0;
+    }
+
+    gdn.giommu = giommu;
+    idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
+                                             MEMTXATTRS_UNSPECIFIED);
+
+    llend = int128_add(int128_make64(section->offset_within_region),
+                       section->size);
+    llend = int128_sub(llend, int128_one());
+
+    iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
+                        section->offset_within_region, int128_get64(llend),
+                        idx);
+    memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
+
+    return 0;
+}
+
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                                   MemoryRegionSection *section)
 {
     ram_addr_t ram_addr;
 
     if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
-                giommu->n.start == section->offset_within_region) {
-                Int128 llend;
-                vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
-                                                       MEMTXATTRS_UNSPECIFIED);
-
-                llend = int128_add(int128_make64(section->offset_within_region),
-                                   section->size);
-                llend = int128_sub(llend, int128_one());
-
-                iommu_notifier_init(&gdn.n,
-                                    vfio_iommu_map_dirty_notify,
-                                    IOMMU_NOTIFIER_MAP,
-                                    section->offset_within_region,
-                                    int128_get64(llend),
-                                    idx);
-                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
-                break;
-            }
-        }
-        return 0;
+        return vfio_sync_iommu_dirty_bitmap(container, section);
     } else if (memory_region_has_ram_discard_manager(section->mr)) {
         return vfio_sync_ram_discard_listener_dirty_bitmap(container, section);
     }
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (12 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-06-22 21:48 ` [PATCH v4 15/15] vfio/common: Block migration with vIOMMUs without address width limits Joao Martins
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

From: Avihai Horon <avihaih@nvidia.com>

When vIOMMU is enabled, syncing dirty page bitmaps is done by replaying
the vIOMMU mappings and querying the dirty bitmap for each mapping.

With device dirty tracking this causes a lot of overhead, since the HW
is queried many times (even with small idle guest this can end up with
thousands of calls to HW).

Optimize this by de-coupling dirty bitmap query from vIOMMU replay.
Now a single dirty bitmap is queried per vIOMMU MR section, which is
then used for all corresponding vIOMMU mappings within that MR section.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c530e9d87f21..62f91e8e102d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1832,8 +1832,36 @@ out:
 typedef struct {
     IOMMUNotifier n;
     VFIOGuestIOMMU *giommu;
+    VFIOBitmap vbmap;
 } vfio_giommu_dirty_notifier;
 
+static int vfio_iommu_set_dirty_bitmap(VFIOContainer *container,
+                                       vfio_giommu_dirty_notifier *gdn,
+                                       hwaddr iova, hwaddr size,
+                                       ram_addr_t ram_addr)
+{
+    VFIOBitmap *vbmap = &gdn->vbmap;
+    VFIOBitmap dst_vbmap;
+    hwaddr start_iova = REAL_HOST_PAGE_ALIGN(gdn->n.start);
+    hwaddr copy_offset;
+    int ret;
+
+    ret = vfio_bitmap_alloc(&dst_vbmap, size);
+    if (ret) {
+        return -ENOMEM;
+    }
+
+    copy_offset = (iova - start_iova) / qemu_real_host_page_size();
+    bitmap_copy_with_src_offset(dst_vbmap.bitmap, vbmap->bitmap, copy_offset,
+                                dst_vbmap.pages);
+
+    cpu_physical_memory_set_dirty_lebitmap(dst_vbmap.bitmap, ram_addr,
+                                           dst_vbmap.pages);
+    g_free(dst_vbmap.bitmap);
+
+    return 0;
+}
+
 static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     vfio_giommu_dirty_notifier *gdn = container_of(n,
@@ -1854,8 +1882,15 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 
     rcu_read_lock();
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
-        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
-                                    translated_addr);
+        if (gdn->vbmap.bitmap) {
+            ret = vfio_iommu_set_dirty_bitmap(container, gdn, iova,
+                                              iotlb->addr_mask + 1,
+                                              translated_addr);
+        } else {
+            ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+                                        translated_addr);
+        }
+
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%s)",
@@ -1936,6 +1971,7 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
     }
 
     gdn.giommu = giommu;
+    gdn.vbmap.bitmap = NULL;
     idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
                                              MEMTXATTRS_UNSPECIFIED);
 
@@ -1943,10 +1979,44 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
                        section->size);
     llend = int128_sub(llend, int128_one());
 
+    /*
+     * Optimize device dirty tracking if the MR section is at least partially
+     * tracked. Optimization is done by querying a single dirty bitmap for the
+     * entire range instead of querying dirty bitmap for each vIOMMU mapping.
+     */
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        hwaddr start = REAL_HOST_PAGE_ALIGN(section->offset_within_region);
+        hwaddr end = int128_get64(llend);
+        hwaddr iommu_max_iova;
+        hwaddr size;
+        int ret;
+
+        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
+        if (ret) {
+            return -EINVAL;
+        }
+
+        size = REAL_HOST_PAGE_ALIGN(MIN(iommu_max_iova, end) - start);
+
+        ret = vfio_bitmap_alloc(&gdn.vbmap, size);
+        if (ret) {
+            return -ENOMEM;
+        }
+
+        ret = vfio_devices_query_dirty_bitmap(container, &gdn.vbmap,
+                                              start, size);
+        if (ret) {
+            g_free(gdn.vbmap.bitmap);
+
+            return ret;
+        }
+    }
+
     iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
                         section->offset_within_region, int128_get64(llend),
                         idx);
     memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
+    g_free(gdn.vbmap.bitmap);
 
     return 0;
 }
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 15/15] vfio/common: Block migration with vIOMMUs without address width limits
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (13 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU Joao Martins
@ 2023-06-22 21:48 ` Joao Martins
  2023-09-08  6:28   ` Duan, Zhenzhong
  2023-06-22 22:18 ` [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
  2023-09-07 11:11 ` Joao Martins
  16 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-06-22 21:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Joao Martins

Only block the case when the underlying vIOMMU model does not report any
address space limits, in addition to DMA translation being off or no
vIOMMU present. The limits are needed such that can define the IOVA limits
that arm the device dirty tracker.

Additionally, reword the migration blocker error message to clarify that
we the configured vIOMMU does not support migration, as opposed to
implying that just being there blocks migration.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 62f91e8e102d..c3cc0dd47044 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -449,15 +449,18 @@ static int vfio_viommu_get_max_iova(hwaddr *max_iova)
 
 int vfio_block_giommu_migration(Error **errp)
 {
+    hwaddr max;
     int ret;
 
     if (giommu_migration_blocker ||
-        !vfio_viommu_preset()) {
+        !vfio_viommu_preset() ||
+        (vfio_viommu_preset() && !vfio_viommu_get_max_iova(&max))) {
         return 0;
     }
 
     error_setg(&giommu_migration_blocker,
-               "Migration is currently not supported with vIOMMU enabled");
+               "Migration with vIOMMU is currently not supported "
+               "without vIOMMU address space boundaries");
     ret = migrate_add_blocker(giommu_migration_blocker, errp);
     if (ret < 0) {
         error_free(giommu_migration_blocker);
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (14 preceding siblings ...)
  2023-06-22 21:48 ` [PATCH v4 15/15] vfio/common: Block migration with vIOMMUs without address width limits Joao Martins
@ 2023-06-22 22:18 ` Joao Martins
  2023-09-07 11:11 ` Joao Martins
  16 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-06-22 22:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu

On 22/06/2023 22:48, Joao Martins wrote:
> Hey,
> 
> This series introduces support for vIOMMU with VFIO device migration,
> particurlarly related to how we do the dirty page tracking.
> 
> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> provide dma translation services for guests to provide some form of
> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> required for big VMs with VFs with more than 255 vcpus. We tackle both
> and remove the migration blocker when vIOMMU is present provided the
> conditions are met. I have both use-cases here in one series, but I am happy
> to tackle them in separate series.
> 
> As I found out we don't necessarily need to expose the whole vIOMMU
> functionality in order to just support interrupt remapping. x86 IOMMUs
> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> can instantiate a IOMMU just for interrupt remapping without needing to
> be advertised/support DMA translation. AMD IOMMU in theory can provide
> the same, but Linux doesn't quite support the IR-only part there yet,
> only intel-iommu.
> 
> The series is organized as following:
> 
> Patches 1-5: Today we can't gather vIOMMU details before the guest
> establishes their first DMA mapping via the vIOMMU. So these first four
> patches add a way for vIOMMUs to be asked of their properties at start
> of day. I choose the least churn possible way for now (as opposed to a
> treewide conversion) and allow easy conversion a posteriori. As
> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> was doing in v3.
> 
> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> DMA translation allowed. Today the 'dma-translation' attribute is
> x86-iommu only, but the way this series is structured nothing stops from
> other vIOMMUs supporting it too as long as they use
> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> we've then tackled item (1) of the second paragraph.
> 
> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
> IOVA address space, leveraging the logic we use to compose the dirty ranges.
> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
> addressing limits. This tackles item (2). So far I mainly use it with
> intel-iommu, although I have a small set of patches for virtio-iommu per
> Alex's suggestion in v2.
> 
> Comments, suggestions welcome. Thanks for the review!
> 

By mistake, I've spuriously sent this a little too early. There's some styling
errors in patch 1, 6 and 10. I've fixed the problems already, but I won't respin
the series as I don't wanna patch bomb folks again. I will give at least a week
or 2 before I do that. My apologies :/

Meanwhile, here's the diff of those fixes:

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 989993e303a6..7fad59126215 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3880,7 +3880,7 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
     {
         hwaddr *max_iova = (hwaddr *)(uintptr_t) data;

-        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);;
+        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);
         break;
     }
     default:
@@ -4071,8 +4071,9 @@ static int vtd_get_iommu_attr(PCIBus *bus, void *opaque,
int32_t devfn,
     assert(0 <= devfn && devfn < PCI_DEVFN_MAX);

     vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
-    if (!vtd_as)
-       return -EINVAL;
+    if (!vtd_as) {
+        return -EINVAL;
+    }

     return memory_region_iommu_get_attr(&vtd_as->iommu, attr, data);
 }
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 91ba6f0927a4..0cf000a9c1ff 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2700,10 +2700,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
     if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
         if (iommu_bus->iommu_fn) {
-           return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+            return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
         } else if (iommu_bus->iommu_ops &&
                    iommu_bus->iommu_ops->get_address_space) {
-           return iommu_bus->iommu_ops->get_address_space(bus,
+            return iommu_bus->iommu_ops->get_address_space(bus,
                                            iommu_bus->iommu_opaque, devfn);
         }
     }


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU
  2023-06-22 21:48 ` [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU Joao Martins
@ 2023-07-09 15:10   ` Avihai Horon
  2023-07-10 13:44     ` Joao Martins
  2023-10-06 13:09   ` Eric Auger
  1 sibling, 1 reply; 55+ messages in thread
From: Avihai Horon @ 2023-07-09 15:10 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Jason Gunthorpe


On 23/06/2023 0:48, Joao Martins wrote:
> External email: Use caution opening links or attachments
>
>
> vfio_get_group() allocates and fills the group/container/space on
> success which will store the AddressSpace inside the VFIOSpace struct.
> Use the newly added pci_device_iommu_get_attr() to see if DMA
> translation is enabled or not. Assume that by default it is enabled.
>
> Today, this means only intel-iommu supports it.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/hw/vfio/vfio-common.h |  1 +
>   hw/vfio/pci.c                 | 15 ++++++++++++++-
>   2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eed244f25f34..f41860988d6b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -70,6 +70,7 @@ typedef struct VFIOMigration {
>
>   typedef struct VFIOAddressSpace {
>       AddressSpace *as;
> +    bool no_dma_translation;

I find this negation a bit confusing, especially when below local 
variable is "dma_translation" (there is also double negation in next patch).
Maybe rename to "dma_translation" or "have_dma_translation"?

Thanks.

>       QLIST_HEAD(, VFIOContainer) containers;
>       QLIST_ENTRY(VFIOAddressSpace) list;
>   } VFIOAddressSpace;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 73874a94de12..8a98e6ffc480 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2900,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>       VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>       VFIODevice *vbasedev = &vdev->vbasedev;
>       VFIODevice *vbasedev_iter;
> +    VFIOAddressSpace *space;
>       VFIOGroup *group;
>       char *tmp, *subsys, group_path[PATH_MAX], *group_name;
>       Error *err = NULL;
> @@ -2907,7 +2908,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>       struct stat st;
>       int groupid;
>       int i, ret;
> -    bool is_mdev;
> +    bool is_mdev, dma_translation;
>       char uuid[UUID_FMT_LEN];
>       char *name;
>
> @@ -2961,6 +2962,18 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>           goto error;
>       }
>
> +    space = group->container->space;
> +
> +    /*
> +     * Support for toggling DMA translation is optional.
> +     * By default, DMA translation is assumed to be enabled i.e.
> +     * space::no_dma_translation is 0.
> +     */
> +    dma_translation = true;
> +    pci_device_iommu_get_attr(pdev, IOMMU_ATTR_DMA_TRANSLATION,
> +                              &dma_translation);
> +    space->no_dma_translation = !dma_translation;
> +
>       QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>           if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>               error_setg(errp, "device is already attached");
> --
> 2.17.2
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-06-22 21:48 ` [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
@ 2023-07-09 15:17   ` Avihai Horon
  2023-07-10 13:44     ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Avihai Horon @ 2023-07-09 15:17 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Jason Gunthorpe


On 23/06/2023 0:48, Joao Martins wrote:
> External email: Use caution opening links or attachments
>
>
> From: Avihai Horon <avihaih@nvidia.com>
>
> Implement get_attr() method and use the address width property to report
> the IOMMU_ATTR_MAX_IOVA attribute.

Nit: get_attr() method was already implemented in patch #6.
Maybe just "Use address width property to report IOMMU_ATTR_MAX_IOVA 
attribute"?

Thanks.

>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/i386/intel_iommu.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ed2a46e008df..989993e303a6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>           *enabled = s->dma_translation;
>           break;
>       }
> +    case IOMMU_ATTR_MAX_IOVA:
> +    {
> +        hwaddr *max_iova = (hwaddr *)(uintptr_t) data;
> +
> +        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);;
> +        break;
> +    }
>       default:
>           ret = -EINVAL;
>           break;
> --
> 2.17.2
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-06-22 21:48 ` [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
@ 2023-07-09 15:24   ` Avihai Horon
  2023-07-10 13:49     ` Joao Martins
  2023-09-08  6:11   ` Duan, Zhenzhong
  1 sibling, 1 reply; 55+ messages in thread
From: Avihai Horon @ 2023-07-09 15:24 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Jason Gunthorpe


On 23/06/2023 0:48, Joao Martins wrote:
> External email: Use caution opening links or attachments
>
>
> Currently, device dirty page tracking with vIOMMU is not supported,
> and a blocker is added and the migration is prevented.
>
> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>
> To support device dirty tracking when vIOMMU enabled instead create the
> dirty ranges based on the vIOMMU provided limits, which leads to the
> tracking of the whole IOVA space regardless of what devices use.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/hw/vfio/vfio-common.h |  1 +
>   hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
>   hw/vfio/pci.c                 |  7 +++++
>   3 files changed, 56 insertions(+), 10 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index f41860988d6b..c4bafad084b4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>   typedef struct VFIOAddressSpace {
>       AddressSpace *as;
>       bool no_dma_translation;
> +    hwaddr max_iova;
>       QLIST_HEAD(, VFIOContainer) containers;
>       QLIST_ENTRY(VFIOAddressSpace) list;
>   } VFIOAddressSpace;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ecfb9afb3fb6..85fddef24026 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>       return false;
>   }
>
> +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
> +{
> +    VFIOAddressSpace *space;
> +
> +    *max_iova = 0;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        if (space->as == &address_space_memory) {
> +            continue;
> +        }
> +
> +        if (*max_iova < space->max_iova) {
> +            *max_iova = space->max_iova;
> +        }
> +    }

Looks like max_iova is a per VFIOAddressSpace property, so why do we 
need to iterate over all address spaces?

Thanks.

> +
> +    return *max_iova == 0;
> +}
> +
>   int vfio_block_giommu_migration(Error **errp)
>   {
>       int ret;
> @@ -1464,10 +1483,11 @@ static const MemoryListener vfio_dirty_tracking_listener = {
>       .region_add = vfio_listener_dirty_tracking_update,
>   };
>
> -static void vfio_dirty_tracking_init(VFIOContainer *container,
> +static int vfio_dirty_tracking_init(VFIOContainer *container,
>                                        VFIODirtyRanges *ranges)
>   {
>       VFIODirtyRangesListener dirty;
> +    int ret;
>
>       memset(&dirty, 0, sizeof(dirty));
>       dirty.ranges.min32 = UINT32_MAX;
> @@ -1475,17 +1495,29 @@ static void vfio_dirty_tracking_init(VFIOContainer *container,
>       dirty.listener = vfio_dirty_tracking_listener;
>       dirty.container = container;
>
> -    memory_listener_register(&dirty.listener,
> -                             container->space->as);
> +    if (vfio_viommu_preset()) {
> +        hwaddr iommu_max_iova;
> +
> +        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
> +        if (ret) {
> +            return -EINVAL;
> +        }
> +
> +        vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
> +    } else {
> +        memory_listener_register(&dirty.listener,
> +                                 container->space->as);
> +        /*
> +         * The memory listener is synchronous, and used to calculate the range
> +         * to dirty tracking. Unregister it after we are done as we are not
> +         * interested in any follow-up updates.
> +         */
> +        memory_listener_unregister(&dirty.listener);
> +    }
>
>       *ranges = dirty.ranges;
>
> -    /*
> -     * The memory listener is synchronous, and used to calculate the range
> -     * to dirty tracking. Unregister it after we are done as we are not
> -     * interested in any follow-up updates.
> -     */
> -    memory_listener_unregister(&dirty.listener);
> +    return 0;
>   }
>
>   static void vfio_devices_dma_logging_stop(VFIOContainer *container)
> @@ -1590,7 +1622,13 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
>       VFIOGroup *group;
>       int ret = 0;
>
> -    vfio_dirty_tracking_init(container, &ranges);
> +    ret = vfio_dirty_tracking_init(container, &ranges);
> +    if (ret) {
> +        error_report("Failed to init DMA logging ranges, err %d",
> +                      ret);
> +        return -EOPNOTSUPP;
> +    }
> +
>       feature = vfio_device_feature_dma_logging_start_create(container,
>                                                              &ranges);
>       if (!feature) {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 8a98e6ffc480..3bda5618c5b5 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2974,6 +2974,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>                                 &dma_translation);
>       space->no_dma_translation = !dma_translation;
>
> +    /*
> +     * Support for advertised IOMMU address space boundaries is optional.
> +     * By default, it is not advertised i.e. space::max_iova is 0.
> +     */
> +    pci_device_iommu_get_attr(pdev, IOMMU_ATTR_MAX_IOVA,
> +                              &space->max_iova);
> +
>       QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>           if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>               error_setg(errp, "device is already attached");
> --
> 2.17.2
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU
  2023-07-09 15:10   ` Avihai Horon
@ 2023-07-10 13:44     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-07-10 13:44 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Jason Gunthorpe

On 09/07/2023 16:10, Avihai Horon wrote:
> On 23/06/2023 0:48, Joao Martins wrote:
>> vfio_get_group() allocates and fills the group/container/space on
>> success which will store the AddressSpace inside the VFIOSpace struct.
>> Use the newly added pci_device_iommu_get_attr() to see if DMA
>> translation is enabled or not. Assume that by default it is enabled.
>>
>> Today, this means only intel-iommu supports it.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   include/hw/vfio/vfio-common.h |  1 +
>>   hw/vfio/pci.c                 | 15 ++++++++++++++-
>>   2 files changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index eed244f25f34..f41860988d6b 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -70,6 +70,7 @@ typedef struct VFIOMigration {
>>
>>   typedef struct VFIOAddressSpace {
>>       AddressSpace *as;
>> +    bool no_dma_translation;
> 
> I find this negation a bit confusing, especially when below local variable is
> "dma_translation" (there is also double negation in next patch).
> Maybe rename to "dma_translation" or "have_dma_translation"?
> 

Good idea -- I can switch to that.

> Thanks.
> 
>>       QLIST_HEAD(, VFIOContainer) containers;
>>       QLIST_ENTRY(VFIOAddressSpace) list;
>>   } VFIOAddressSpace;
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 73874a94de12..8a98e6ffc480 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2900,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>       VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>       VFIODevice *vbasedev = &vdev->vbasedev;
>>       VFIODevice *vbasedev_iter;
>> +    VFIOAddressSpace *space;
>>       VFIOGroup *group;
>>       char *tmp, *subsys, group_path[PATH_MAX], *group_name;
>>       Error *err = NULL;
>> @@ -2907,7 +2908,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>       struct stat st;
>>       int groupid;
>>       int i, ret;
>> -    bool is_mdev;
>> +    bool is_mdev, dma_translation;
>>       char uuid[UUID_FMT_LEN];
>>       char *name;
>>
>> @@ -2961,6 +2962,18 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>           goto error;
>>       }
>>
>> +    space = group->container->space;
>> +
>> +    /*
>> +     * Support for toggling DMA translation is optional.
>> +     * By default, DMA translation is assumed to be enabled i.e.
>> +     * space::no_dma_translation is 0.
>> +     */
>> +    dma_translation = true;
>> +    pci_device_iommu_get_attr(pdev, IOMMU_ATTR_DMA_TRANSLATION,
>> +                              &dma_translation);
>> +    space->no_dma_translation = !dma_translation;
>> +
>>       QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>>           if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>>               error_setg(errp, "device is already attached");
>> -- 
>> 2.17.2
>>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-07-09 15:17   ` Avihai Horon
@ 2023-07-10 13:44     ` Joao Martins
  2023-10-02 15:42       ` Cédric Le Goater
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-07-10 13:44 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Jason Gunthorpe



On 09/07/2023 16:17, Avihai Horon wrote:
> 
> On 23/06/2023 0:48, Joao Martins wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> From: Avihai Horon <avihaih@nvidia.com>
>>
>> Implement get_attr() method and use the address width property to report
>> the IOMMU_ATTR_MAX_IOVA attribute.
> 
> Nit: get_attr() method was already implemented in patch #6.
> Maybe just "Use address width property to report IOMMU_ATTR_MAX_IOVA attribute"?
> 
Yeap, makes sense.

> Thanks.
> 
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/i386/intel_iommu.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index ed2a46e008df..989993e303a6 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>>           *enabled = s->dma_translation;
>>           break;
>>       }
>> +    case IOMMU_ATTR_MAX_IOVA:
>> +    {
>> +        hwaddr *max_iova = (hwaddr *)(uintptr_t) data;
>> +
>> +        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);;
>> +        break;
>> +    }
>>       default:
>>           ret = -EINVAL;
>>           break;
>> -- 
>> 2.17.2
>>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-07-09 15:24   ` Avihai Horon
@ 2023-07-10 13:49     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-07-10 13:49 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Jason Gunthorpe

On 09/07/2023 16:24, Avihai Horon wrote:
> On 23/06/2023 0:48, Joao Martins wrote:
>> Currently, device dirty page tracking with vIOMMU is not supported,
>> and a blocker is added and the migration is prevented.
>>
>> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
>> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
>> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>>
>> To support device dirty tracking when vIOMMU enabled instead create the
>> dirty ranges based on the vIOMMU provided limits, which leads to the
>> tracking of the whole IOVA space regardless of what devices use.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   include/hw/vfio/vfio-common.h |  1 +
>>   hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
>>   hw/vfio/pci.c                 |  7 +++++
>>   3 files changed, 56 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index f41860988d6b..c4bafad084b4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>>   typedef struct VFIOAddressSpace {
>>       AddressSpace *as;
>>       bool no_dma_translation;
>> +    hwaddr max_iova;
>>       QLIST_HEAD(, VFIOContainer) containers;
>>       QLIST_ENTRY(VFIOAddressSpace) list;
>>   } VFIOAddressSpace;
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ecfb9afb3fb6..85fddef24026 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>>       return false;
>>   }
>>
>> +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>> +{
>> +    VFIOAddressSpace *space;
>> +
>> +    *max_iova = 0;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        if (space->as == &address_space_memory) {
>> +            continue;
>> +        }
>> +
>> +        if (*max_iova < space->max_iova) {
>> +            *max_iova = space->max_iova;
>> +        }
>> +    }
> 
> Looks like max_iova is a per VFIOAddressSpace property, so why do we need to
> iterate over all address spaces?
> 

This was more futureproof-ing when Qemu supports multiple vIOMMU. In theory this
tracks device address space, and if two different devices stand behind different
vIOMMU, then this loop would compute the highest IOVA that we would track by the
host device dirty tracker.

But I realize this might introduce unnecessary complexity, and we should 'obey'
the advertised vIOMMU max_iova for the device. With Zhenzhong blocker cleanup I
can make this just fetch the max_iova in the space and be done with it.

	Joao

> Thanks.
> 
>> +
>> +    return *max_iova == 0;
>> +}
>> +
>>   int vfio_block_giommu_migration(Error **errp)
>>   {
>>       int ret;
>> @@ -1464,10 +1483,11 @@ static const MemoryListener
>> vfio_dirty_tracking_listener = {
>>       .region_add = vfio_listener_dirty_tracking_update,
>>   };
>>
>> -static void vfio_dirty_tracking_init(VFIOContainer *container,
>> +static int vfio_dirty_tracking_init(VFIOContainer *container,
>>                                        VFIODirtyRanges *ranges)
>>   {
>>       VFIODirtyRangesListener dirty;
>> +    int ret;
>>
>>       memset(&dirty, 0, sizeof(dirty));
>>       dirty.ranges.min32 = UINT32_MAX;
>> @@ -1475,17 +1495,29 @@ static void vfio_dirty_tracking_init(VFIOContainer
>> *container,
>>       dirty.listener = vfio_dirty_tracking_listener;
>>       dirty.container = container;
>>
>> -    memory_listener_register(&dirty.listener,
>> -                             container->space->as);
>> +    if (vfio_viommu_preset()) {
>> +        hwaddr iommu_max_iova;
>> +
>> +        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
>> +        if (ret) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
>> +    } else {
>> +        memory_listener_register(&dirty.listener,
>> +                                 container->space->as);
>> +        /*
>> +         * The memory listener is synchronous, and used to calculate the range
>> +         * to dirty tracking. Unregister it after we are done as we are not
>> +         * interested in any follow-up updates.
>> +         */
>> +        memory_listener_unregister(&dirty.listener);
>> +    }
>>
>>       *ranges = dirty.ranges;
>>
>> -    /*
>> -     * The memory listener is synchronous, and used to calculate the range
>> -     * to dirty tracking. Unregister it after we are done as we are not
>> -     * interested in any follow-up updates.
>> -     */
>> -    memory_listener_unregister(&dirty.listener);
>> +    return 0;
>>   }
>>
>>   static void vfio_devices_dma_logging_stop(VFIOContainer *container)
>> @@ -1590,7 +1622,13 @@ static int vfio_devices_dma_logging_start(VFIOContainer
>> *container)
>>       VFIOGroup *group;
>>       int ret = 0;
>>
>> -    vfio_dirty_tracking_init(container, &ranges);
>> +    ret = vfio_dirty_tracking_init(container, &ranges);
>> +    if (ret) {
>> +        error_report("Failed to init DMA logging ranges, err %d",
>> +                      ret);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>>       feature = vfio_device_feature_dma_logging_start_create(container,
>>                                                              &ranges);
>>       if (!feature) {
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 8a98e6ffc480..3bda5618c5b5 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2974,6 +2974,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>                                 &dma_translation);
>>       space->no_dma_translation = !dma_translation;
>>
>> +    /*
>> +     * Support for advertised IOMMU address space boundaries is optional.
>> +     * By default, it is not advertised i.e. space::max_iova is 0.
>> +     */
>> +    pci_device_iommu_get_attr(pdev, IOMMU_ATTR_MAX_IOVA,
>> +                              &space->max_iova);
>> +
>>       QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>>           if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>>               error_setg(errp, "device is already attached");
>> -- 
>> 2.17.2
>>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
  2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
                   ` (15 preceding siblings ...)
  2023-06-22 22:18 ` [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
@ 2023-09-07 11:11 ` Joao Martins
  2023-09-07 12:40   ` Cédric Le Goater
  16 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-09-07 11:11 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, qemu-devel, Yi Liu

On 22/06/2023 22:48, Joao Martins wrote:
> Hey,
> 
> This series introduces support for vIOMMU with VFIO device migration,
> particurlarly related to how we do the dirty page tracking.
> 
> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> provide dma translation services for guests to provide some form of
> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> required for big VMs with VFs with more than 255 vcpus. We tackle both
> and remove the migration blocker when vIOMMU is present provided the
> conditions are met. I have both use-cases here in one series, but I am happy
> to tackle them in separate series.
> 
> As I found out we don't necessarily need to expose the whole vIOMMU
> functionality in order to just support interrupt remapping. x86 IOMMUs
> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> can instantiate a IOMMU just for interrupt remapping without needing to
> be advertised/support DMA translation. AMD IOMMU in theory can provide
> the same, but Linux doesn't quite support the IR-only part there yet,
> only intel-iommu.
> 
> The series is organized as following:
> 
> Patches 1-5: Today we can't gather vIOMMU details before the guest
> establishes their first DMA mapping via the vIOMMU. So these first four
> patches add a way for vIOMMUs to be asked of their properties at start
> of day. I choose the least churn possible way for now (as opposed to a
> treewide conversion) and allow easy conversion a posteriori. As
> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> was doing in v3.
> 
> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> DMA translation allowed. Today the 'dma-translation' attribute is
> x86-iommu only, but the way this series is structured nothing stops from
> other vIOMMUs supporting it too as long as they use
> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> we've then tackled item (1) of the second paragraph.
> 
> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
> IOVA address space, leveraging the logic we use to compose the dirty ranges.
> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
> addressing limits. This tackles item (2). So far I mainly use it with
> intel-iommu, although I have a small set of patches for virtio-iommu per
> Alex's suggestion in v2.
> 
> Comments, suggestions welcome. Thanks for the review!
> 
Cedric, you mentioned that you take a look at this after you come back, not sure
if that's still the plan. But it's been a while since the last version, so would
you have me repost/rebase on the latest (post your PR)?

Additionally, I should say that I have an alternative (as a single small patch),
where vIOMMU usage is allowed ... but behind a VFIO command line option, and as
soon as attempt *any* vIOMMU mapping we fail-to-start/block the migration. I
haven't posted that alternative as early in the dirty tracking work the idea was
to avoid guest vIOMMU usage dependency to allow migration (which made this
patchset the way it is). But thought it was OK to remind, if it was only be
allowed if the admin explicitly states such its intent behind a x- command line
option.


> Regards,
> 	Joao
> 
> Changes since v3[8]:
> * Pick up Yi's patches[5][6], and rework the first four patches.
>   These are a bit better splitted, and make the new iommu_ops *optional*
>   as opposed to a treewide conversion. Rather than returning an IOMMU MR
>   and let VFIO operate on it to fetch attributes, we instead let the
>   underlying IOMMU driver fetch the desired IOMMU MR and ask for the
>   desired IOMMU attribute. Callers only care about PCI Device backing
>   vIOMMU attributes regardless of its topology/association. (Peter Xu)
>   These patches are a bit better splitted compared to original ones,
>   and I've kept all the same authorship and note the changes from
>   original where applicable.
> * Because of the rework of the first four patches, switch to
>   individual attributes in the VFIOSpace that track dma_translation
>   and the max_iova. All are expected to be unused when zero to retain
>   the defaults of today in common code.
> * Improve the migration blocker message of the last patch to be
>   more obvious that vIOMMU migration blocker is added when no vIOMMU
>   address space limits are advertised. (Patch 15)
> * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
> * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
> * Change diffstat of patches with scripts/git.orderfile (Philippe).
> 
> Changes since v2[3]:
> * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
> introduce ways to know various IOMMU model attributes via the IOMMU MR. This
> is partly meant to address a comment in previous versions where we can't
> access the IOMMU MR prior to the DMA mapping happening. Before this series
> vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
> guest. As well as better tackling of the IOMMU usage for interrupt-remapping
> only purposes. 
> * Dropped Peter Xu ack on patch 9 given that the code changed a bit.
> * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
> * The patches that existed in v2 of vIOMMU dirty tracking, are mostly
> * untouched, except patch 12 which was greatly simplified.
> 
> Changes since v1[4]:
> - Rebased on latest master branch. As part of it, made some changes in
>   pre-copy to adjust it to Juan's new patches:
>   1. Added a new patch that passes threshold_size parameter to
>      .state_pending_{estimate,exact}() handlers.
>   2. Added a new patch that refactors vfio_save_block().
>   3. Changed the pre-copy patch to cache and report pending pre-copy
>      size in the .state_pending_estimate() handler.
> - Removed unnecessary P2P code. This should be added later on when P2P
>   support is added. (Alex)
> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>   (patch #11). (Alex)
> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>   variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
> - Refactored the viommu device dirty tracking ranges creation code to
>   make it clearer (patch #15).
> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>   emphasize that we specifically check for 2^64 wrap around (patch #15).
> - Added R-bs / Acks.
> 
> [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
> [2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
> [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
> [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
> [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
> [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
> [8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/
> 
> Avihai Horon (4):
>   memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
>   intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
>   vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>   vfio/common: Optimize device dirty page tracking with vIOMMU
> 
> Joao Martins (7):
>   memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
>   intel-iommu: Implement get_attr() method
>   vfio/common: Track whether DMA Translation is enabled on the vIOMMU
>   vfio/common: Relax vIOMMU detection when DMA translation is off
>   vfio/common: Move dirty tracking ranges update to helper
>   vfio/common: Support device dirty page tracking with vIOMMU
>   vfio/common: Block migration with vIOMMUs without address width limits
> 
> Yi Liu (4):
>   hw/pci: Add a pci_setup_iommu_ops() helper
>   hw/pci: Refactor pci_device_iommu_address_space()
>   hw/pci: Introduce pci_device_iommu_get_attr()
>   intel-iommu: Switch to pci_setup_iommu_ops()
> 
>  include/exec/memory.h         |   4 +-
>  include/hw/pci/pci.h          |  11 ++
>  include/hw/pci/pci_bus.h      |   1 +
>  include/hw/vfio/vfio-common.h |   2 +
>  hw/i386/intel_iommu.c         |  53 +++++++-
>  hw/pci/pci.c                  |  58 +++++++-
>  hw/vfio/common.c              | 241 ++++++++++++++++++++++++++--------
>  hw/vfio/pci.c                 |  22 +++-
>  8 files changed, 329 insertions(+), 63 deletions(-)
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
  2023-09-07 11:11 ` Joao Martins
@ 2023-09-07 12:40   ` Cédric Le Goater
  2023-09-07 15:20     ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Cédric Le Goater @ 2023-09-07 12:40 UTC (permalink / raw)
  To: Joao Martins
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, qemu-devel, Yi Liu

Hello Joao,

> Cedric, you mentioned that you take a look at this after you come back, not sure
> if that's still the plan. But it's been a while since the last version, so would
> you have me repost/rebase on the latest (post your PR)?

Yes please. That's next on the TODO list (after some downstream work
regarding the postcopy crash). My rough plan for 8.2 is :

  * P2P
  * VIOMMU
  * dynamic MSI-X
  * fixes

I think it is a bit early for iommufd and I will probably lack time.
The recent migration addition is requiring some attention in many
areas.

> Additionally, I should say that I have an alternative (as a single small patch),
> where vIOMMU usage is allowed ... but behind a VFIO command line option, and as
> soon as attempt *any* vIOMMU mapping we fail-to-start/block the migration. I
> haven't posted that alternative as early in the dirty tracking work the idea was
> to avoid guest vIOMMU usage dependency to allow migration (which made this
> patchset the way it is). But thought it was OK to remind, if it was only be
> allowed if the admin explicitly states such its intent behind a x- command line
> option.

I don't remember seeing it. It is worth resending as an RFC so that
people can comment.

Thanks,

C.





^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
  2023-09-07 12:40   ` Cédric Le Goater
@ 2023-09-07 15:20     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-09-07 15:20 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, qemu-devel, Yi Liu

On 07/09/2023 13:40, Cédric Le Goater wrote:
> Hello Joao,
> 
>> Cedric, you mentioned that you take a look at this after you come back, not sure
>> if that's still the plan. But it's been a while since the last version, so would
>> you have me repost/rebase on the latest (post your PR)?
> 
> Yes please. That's next on the TODO list (after some downstream work
> regarding the postcopy crash). My rough plan for 8.2 is :
> 
>  * P2P
>  * VIOMMU
>  * dynamic MSI-X
>  * fixes
> 

Thanks for sharing

> I think it is a bit early for iommufd and I will probably lack time.
> The recent migration addition is requiring some attention in many
> areas.
> 
>> Additionally, I should say that I have an alternative (as a single small patch),
>> where vIOMMU usage is allowed ... but behind a VFIO command line option, and as
>> soon as attempt *any* vIOMMU mapping we fail-to-start/block the migration. I
>> haven't posted that alternative as early in the dirty tracking work the idea was
>> to avoid guest vIOMMU usage dependency to allow migration (which made this
>> patchset the way it is). But thought it was OK to remind, if it was only be
>> allowed if the admin explicitly states such its intent behind a x- command line
>> option.
> 
> I don't remember seeing it. It is worth resending as an RFC so that
> people can comment.

I haven't send it, largelly because in the first versions of dirty tracking the
situation revolved around whether or not we depend on guest vIOMMU usage
(passthrough or not) vs tracking something agnostic to guest (raw IOVA ranges).

In any case I can send out the patch and move the discussion there whether it's
a good idea or not (it's a simple patch)

	Joao


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-06-22 21:48 ` [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
  2023-07-09 15:24   ` Avihai Horon
@ 2023-09-08  6:11   ` Duan, Zhenzhong
  2023-09-08 10:11     ` Joao Martins
  1 sibling, 1 reply; 55+ messages in thread
From: Duan, Zhenzhong @ 2023-09-08  6:11 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

Hi Joao,

On 6/23/2023 5:48 AM, Joao Martins wrote:
> Currently, device dirty page tracking with vIOMMU is not supported,
> and a blocker is added and the migration is prevented.
>
> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>
> To support device dirty tracking when vIOMMU enabled instead create the
> dirty ranges based on the vIOMMU provided limits, which leads to the
> tracking of the whole IOVA space regardless of what devices use.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/hw/vfio/vfio-common.h |  1 +
>   hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
>   hw/vfio/pci.c                 |  7 +++++
>   3 files changed, 56 insertions(+), 10 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index f41860988d6b..c4bafad084b4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>   typedef struct VFIOAddressSpace {
>       AddressSpace *as;
>       bool no_dma_translation;
> +    hwaddr max_iova;
>       QLIST_HEAD(, VFIOContainer) containers;
>       QLIST_ENTRY(VFIOAddressSpace) list;
>   } VFIOAddressSpace;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ecfb9afb3fb6..85fddef24026 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>       return false;
>   }
>   
> +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
> +{
> +    VFIOAddressSpace *space;
> +
> +    *max_iova = 0;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        if (space->as == &address_space_memory) {
> +            continue;
> +        }

Just curious why address_space_memory is bypassed?

Imagine two vfio devices linked to two host bridge, one has bypass_iommu set

and the other not. Don't we need to include the guest memory ranges in

address_space_memory?

> +
> +        if (*max_iova < space->max_iova) {
> +            *max_iova = space->max_iova;
> +        }
> +    }
> +
> +    return *max_iova == 0;
> +}
> +
>   int vfio_block_giommu_migration(Error **errp)
>   {
>       int ret;
> @@ -1464,10 +1483,11 @@ static const MemoryListener vfio_dirty_tracking_listener = {
>       .region_add = vfio_listener_dirty_tracking_update,
>   };
>   
> -static void vfio_dirty_tracking_init(VFIOContainer *container,
> +static int vfio_dirty_tracking_init(VFIOContainer *container,
>                                        VFIODirtyRanges *ranges)
>   {
>       VFIODirtyRangesListener dirty;
> +    int ret;
>   
>       memset(&dirty, 0, sizeof(dirty));
>       dirty.ranges.min32 = UINT32_MAX;
> @@ -1475,17 +1495,29 @@ static void vfio_dirty_tracking_init(VFIOContainer *container,
>       dirty.listener = vfio_dirty_tracking_listener;
>       dirty.container = container;
>   
> -    memory_listener_register(&dirty.listener,
> -                             container->space->as);
> +    if (vfio_viommu_preset()) {
> +        hwaddr iommu_max_iova;
> +
> +        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
> +        if (ret) {
> +            return -EINVAL;
> +        }
> +
> +        vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
> +    } else {
> +        memory_listener_register(&dirty.listener,
> +                                 container->space->as);
> +        /*
> +         * The memory listener is synchronous, and used to calculate the range
> +         * to dirty tracking. Unregister it after we are done as we are not
> +         * interested in any follow-up updates.
> +         */
> +        memory_listener_unregister(&dirty.listener);
> +    }
>   
>       *ranges = dirty.ranges;
>   
> -    /*
> -     * The memory listener is synchronous, and used to calculate the range
> -     * to dirty tracking. Unregister it after we are done as we are not
> -     * interested in any follow-up updates.
> -     */
> -    memory_listener_unregister(&dirty.listener);
> +    return 0;
>   }
>   
>   static void vfio_devices_dma_logging_stop(VFIOContainer *container)
> @@ -1590,7 +1622,13 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
>       VFIOGroup *group;
>       int ret = 0;
>   
> -    vfio_dirty_tracking_init(container, &ranges);
> +    ret = vfio_dirty_tracking_init(container, &ranges);
> +    if (ret) {
> +        error_report("Failed to init DMA logging ranges, err %d",
> +                      ret);
> +        return -EOPNOTSUPP;
> +    }
> +
>       feature = vfio_device_feature_dma_logging_start_create(container,
>                                                              &ranges);

No clear how much dirty range size could impact device dirty tracking.

Maybe some devices linking to vIOMMU with small aw_bits or bypassing vIOMMU

with small guest memory could benefit if we use per address space's 
dirty range

Thanks

Zhenzhong



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/15] intel-iommu: Implement get_attr() method
  2023-06-22 21:48 ` [PATCH v4 06/15] intel-iommu: Implement get_attr() method Joao Martins
@ 2023-09-08  6:23   ` Duan, Zhenzhong
  2023-09-08 10:11     ` Joao Martins
  2023-10-02 15:23   ` Cédric Le Goater
  1 sibling, 1 reply; 55+ messages in thread
From: Duan, Zhenzhong @ 2023-09-08  6:23 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe


On 6/23/2023 5:48 AM, Joao Martins wrote:
> Implement IOMMU MR get_attr() method and use the dma_translation
> property to report the IOMMU_ATTR_DMA_TRANSLATION attribute.
> Additionally add the necessary get_iommu_attr into the PCIIOMMUOps to
> support pci_device_iommu_get_attr().
>
> The callback in there acts as a IOMMU-specific address space walker
> which will call get_attr in the IOMMUMemoryRegion backing the device to
> fetch the desired attribute.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/i386/intel_iommu.c | 40 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 40 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1606d1b952d0..ed2a46e008df 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3861,6 +3861,29 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>       return;
>   }
>   
> +static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
> +                              enum IOMMUMemoryRegionAttr attr, void *data)
> +{
> +    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    int ret = 0;
> +
> +    switch (attr) {
> +    case IOMMU_ATTR_DMA_TRANSLATION:
> +    {
> +        bool *enabled = (bool *)(uintptr_t) data;
> +
> +        *enabled = s->dma_translation;
> +        break;
> +    }
> +    default:
> +        ret = -EINVAL;
> +        break;
> +    }
> +
> +    return ret;
> +}
> +
>   /* Do the initialization. It will also be called when reset, so pay
>    * attention when adding new initialization stuff.
>    */
> @@ -4032,8 +4055,24 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>       return &vtd_as->as;
>   }
>   
> +static int vtd_get_iommu_attr(PCIBus *bus, void *opaque, int32_t devfn,
> +                              enum IOMMUMemoryRegionAttr attr, void *data)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDAddressSpace *vtd_as;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
> +    if (!vtd_as)
> +	return -EINVAL;
> +
> +    return memory_region_iommu_get_attr(&vtd_as->iommu, attr, data);
> +}
> +
>   static PCIIOMMUOps vtd_iommu_ops = {
>       .get_address_space = vtd_host_dma_iommu,
> +    .get_iommu_attr = vtd_get_iommu_attr,
>   };
>   
>   static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> @@ -4197,6 +4236,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
>       imrc->translate = vtd_iommu_translate;
>       imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
>       imrc->replay = vtd_iommu_replay;
> +    imrc->get_attr = vtd_iommu_get_attr;

Do we have other user of imrc->get_attr? If not, what about squashing

vtd_iommu_get_attr into vtd_get_iommu_attr and have imrc->get_attr removed?

Thanks
Zhenzhong

>   }
>   
>   static const TypeInfo vtd_iommu_memory_region_info = {


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 15/15] vfio/common: Block migration with vIOMMUs without address width limits
  2023-06-22 21:48 ` [PATCH v4 15/15] vfio/common: Block migration with vIOMMUs without address width limits Joao Martins
@ 2023-09-08  6:28   ` Duan, Zhenzhong
  2023-09-08 10:11     ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Duan, Zhenzhong @ 2023-09-08  6:28 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe


On 6/23/2023 5:48 AM, Joao Martins wrote:
> Only block the case when the underlying vIOMMU model does not report any
> address space limits, in addition to DMA translation being off or no
> vIOMMU present. The limits are needed such that can define the IOVA limits
> that arm the device dirty tracker.
>
> Additionally, reword the migration blocker error message to clarify that
> we the configured vIOMMU does not support migration, as opposed to
> implying that just being there blocks migration.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/vfio/common.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 62f91e8e102d..c3cc0dd47044 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -449,15 +449,18 @@ static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>   
>   int vfio_block_giommu_migration(Error **errp)
>   {
> +    hwaddr max;
>       int ret;
>   
>       if (giommu_migration_blocker ||
> -        !vfio_viommu_preset()) {
> +        !vfio_viommu_preset() ||
> +        (vfio_viommu_preset() && !vfio_viommu_get_max_iova(&max))) {

Could be simplified as below:

+        !vfio_viommu_preset() || !vfio_viommu_get_max_iova(&max))) {

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/15] intel-iommu: Implement get_attr() method
  2023-09-08  6:23   ` Duan, Zhenzhong
@ 2023-09-08 10:11     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-09-08 10:11 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe



On 08/09/2023 07:23, Duan, Zhenzhong wrote:
> 
> On 6/23/2023 5:48 AM, Joao Martins wrote:
>> Implement IOMMU MR get_attr() method and use the dma_translation
>> property to report the IOMMU_ATTR_DMA_TRANSLATION attribute.
>> Additionally add the necessary get_iommu_attr into the PCIIOMMUOps to
>> support pci_device_iommu_get_attr().
>>
>> The callback in there acts as a IOMMU-specific address space walker
>> which will call get_attr in the IOMMUMemoryRegion backing the device to
>> fetch the desired attribute.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/i386/intel_iommu.c | 40 ++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 40 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1606d1b952d0..ed2a46e008df 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3861,6 +3861,29 @@ static void vtd_iommu_replay(IOMMUMemoryRegion
>> *iommu_mr, IOMMUNotifier *n)
>>       return;
>>   }
>>   +static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>> +                              enum IOMMUMemoryRegionAttr attr, void *data)
>> +{
>> +    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    int ret = 0;
>> +
>> +    switch (attr) {
>> +    case IOMMU_ATTR_DMA_TRANSLATION:
>> +    {
>> +        bool *enabled = (bool *)(uintptr_t) data;
>> +
>> +        *enabled = s->dma_translation;
>> +        break;
>> +    }
>> +    default:
>> +        ret = -EINVAL;
>> +        break;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   /* Do the initialization. It will also be called when reset, so pay
>>    * attention when adding new initialization stuff.
>>    */
>> @@ -4032,8 +4055,24 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus,
>> void *opaque, int devfn)
>>       return &vtd_as->as;
>>   }
>>   +static int vtd_get_iommu_attr(PCIBus *bus, void *opaque, int32_t devfn,
>> +                              enum IOMMUMemoryRegionAttr attr, void *data)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    VTDAddressSpace *vtd_as;
>> +
>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>> +
>> +    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
>> +    if (!vtd_as)
>> +    return -EINVAL;
>> +
>> +    return memory_region_iommu_get_attr(&vtd_as->iommu, attr, data);
>> +}
>> +
>>   static PCIIOMMUOps vtd_iommu_ops = {
>>       .get_address_space = vtd_host_dma_iommu,
>> +    .get_iommu_attr = vtd_get_iommu_attr,
>>   };
>>     static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>> @@ -4197,6 +4236,7 @@ static void
>> vtd_iommu_memory_region_class_init(ObjectClass *klass,
>>       imrc->translate = vtd_iommu_translate;
>>       imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
>>       imrc->replay = vtd_iommu_replay;
>> +    imrc->get_attr = vtd_iommu_get_attr;
> 
> Do we have other user of imrc->get_attr? If not, what about squashing
> 
> vtd_iommu_get_attr into vtd_get_iommu_attr and have imrc->get_attr removed?
> 

There's an user, and that's VFIO, but it's later in the series. It felt more
natural for bisection to do things this way.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-09-08  6:11   ` Duan, Zhenzhong
@ 2023-09-08 10:11     ` Joao Martins
  2023-09-08 11:52       ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-09-08 10:11 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

On 08/09/2023 07:11, Duan, Zhenzhong wrote:
> Hi Joao,
> 
> On 6/23/2023 5:48 AM, Joao Martins wrote:
>> Currently, device dirty page tracking with vIOMMU is not supported,
>> and a blocker is added and the migration is prevented.
>>
>> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
>> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
>> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>>
>> To support device dirty tracking when vIOMMU enabled instead create the
>> dirty ranges based on the vIOMMU provided limits, which leads to the
>> tracking of the whole IOVA space regardless of what devices use.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   include/hw/vfio/vfio-common.h |  1 +
>>   hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
>>   hw/vfio/pci.c                 |  7 +++++
>>   3 files changed, 56 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index f41860988d6b..c4bafad084b4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>>   typedef struct VFIOAddressSpace {
>>       AddressSpace *as;
>>       bool no_dma_translation;
>> +    hwaddr max_iova;
>>       QLIST_HEAD(, VFIOContainer) containers;
>>       QLIST_ENTRY(VFIOAddressSpace) list;
>>   } VFIOAddressSpace;
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ecfb9afb3fb6..85fddef24026 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>>       return false;
>>   }
>>   +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>> +{
>> +    VFIOAddressSpace *space;
>> +
>> +    *max_iova = 0;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        if (space->as == &address_space_memory) {
>> +            continue;
>> +        }
> 
> Just curious why address_space_memory is bypassed?
> 

But address_space_memory part is done by memory listeners, but I see your point
conceptually on not considering it

> Imagine two vfio devices linked to two host bridge, one has bypass_iommu set
> 
> and the other not. Don't we need to include the guest memory ranges in
> 
> address_space_memory?

I am probably making a bad assumption that vIOMMU maximum adress space is a
superset of the CPU address space. But as you just reminded me, there is a user
case where all it takes is one bypass_iommu=true on a 2T guest setup with
aw-bits=39.

I'll rework this to consider this.

> 
>> +
>> +        if (*max_iova < space->max_iova) {
>> +            *max_iova = space->max_iova;
>> +        }
>> +    }
>> +
>> +    return *max_iova == 0;
>> +}
>> +
>>   int vfio_block_giommu_migration(Error **errp)
>>   {
>>       int ret;
>> @@ -1464,10 +1483,11 @@ static const MemoryListener
>> vfio_dirty_tracking_listener = {
>>       .region_add = vfio_listener_dirty_tracking_update,
>>   };
>>   -static void vfio_dirty_tracking_init(VFIOContainer *container,
>> +static int vfio_dirty_tracking_init(VFIOContainer *container,
>>                                        VFIODirtyRanges *ranges)
>>   {
>>       VFIODirtyRangesListener dirty;
>> +    int ret;
>>         memset(&dirty, 0, sizeof(dirty));
>>       dirty.ranges.min32 = UINT32_MAX;
>> @@ -1475,17 +1495,29 @@ static void vfio_dirty_tracking_init(VFIOContainer
>> *container,
>>       dirty.listener = vfio_dirty_tracking_listener;
>>       dirty.container = container;
>>   -    memory_listener_register(&dirty.listener,
>> -                             container->space->as);
>> +    if (vfio_viommu_preset()) {
>> +        hwaddr iommu_max_iova;
>> +
>> +        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
>> +        if (ret) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
>> +    } else {
>> +        memory_listener_register(&dirty.listener,
>> +                                 container->space->as);
>> +        /*
>> +         * The memory listener is synchronous, and used to calculate the range
>> +         * to dirty tracking. Unregister it after we are done as we are not
>> +         * interested in any follow-up updates.
>> +         */
>> +        memory_listener_unregister(&dirty.listener);
>> +    }
>>         *ranges = dirty.ranges;
>>   -    /*
>> -     * The memory listener is synchronous, and used to calculate the range
>> -     * to dirty tracking. Unregister it after we are done as we are not
>> -     * interested in any follow-up updates.
>> -     */
>> -    memory_listener_unregister(&dirty.listener);
>> +    return 0;
>>   }
>>     static void vfio_devices_dma_logging_stop(VFIOContainer *container)
>> @@ -1590,7 +1622,13 @@ static int vfio_devices_dma_logging_start(VFIOContainer
>> *container)
>>       VFIOGroup *group;
>>       int ret = 0;
>>   -    vfio_dirty_tracking_init(container, &ranges);
>> +    ret = vfio_dirty_tracking_init(container, &ranges);
>> +    if (ret) {
>> +        error_report("Failed to init DMA logging ranges, err %d",
>> +                      ret);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>>       feature = vfio_device_feature_dma_logging_start_create(container,
>>                                                              &ranges);
> 
> No clear how much dirty range size could impact device dirty tracking.
> 
> Maybe some devices linking to vIOMMU with small aw_bits or bypassing vIOMMU
> 

So, my intended usecase with this series starts with DMA_TRANSLATION=off, where
vIOMMU is restricted to interrupt remapping, yet guest only uses it for
interrupt remapping. Right now only supported by intel-iommu, but my
understanding of AMD IOMMU is that is also possible there (haven't checked
smmu-v3). This unblocks guests with more >255. I have another patch separate to
this that hopefully relaxes vIOMMU blockage for these older guests.

Now with real emulated vIOMMU DMA translation usage, intel-iommu there's only
39, 48, 57 address space width (mimmicing the levels). And it's true that only
the first value is supportable and somewhat limiting as you say.

virtio-iommu has more going there as you can limit input iova to be the size of
the CPU address space. Eric latest series[0] is actually nice on that end of
fixing that issue (my alternative I had there was to introduce a gaw-bits
equivalent, but it's better done like [0]).

[0]
https://lore.kernel.org/qemu-devel/20230904080451.424731-1-eric.auger@redhat.com/

> with small guest memory could benefit if we use per address space's dirty range
> 


Yeah, I'll need to fix that, when there's big guest memory and small vIOMMU
address space. Hopefully I picked up on your comment right :)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 15/15] vfio/common: Block migration with vIOMMUs without address width limits
  2023-09-08  6:28   ` Duan, Zhenzhong
@ 2023-09-08 10:11     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-09-08 10:11 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe



On 08/09/2023 07:28, Duan, Zhenzhong wrote:
> 
> On 6/23/2023 5:48 AM, Joao Martins wrote:
>> Only block the case when the underlying vIOMMU model does not report any
>> address space limits, in addition to DMA translation being off or no
>> vIOMMU present. The limits are needed such that can define the IOVA limits
>> that arm the device dirty tracker.
>>
>> Additionally, reword the migration blocker error message to clarify that
>> we the configured vIOMMU does not support migration, as opposed to
>> implying that just being there blocks migration.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/vfio/common.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 62f91e8e102d..c3cc0dd47044 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -449,15 +449,18 @@ static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>>     int vfio_block_giommu_migration(Error **errp)
>>   {
>> +    hwaddr max;
>>       int ret;
>>         if (giommu_migration_blocker ||
>> -        !vfio_viommu_preset()) {
>> +        !vfio_viommu_preset() ||
>> +        (vfio_viommu_preset() && !vfio_viommu_get_max_iova(&max))) {
> 
> Could be simplified as below:
> 
> +        !vfio_viommu_preset() || !vfio_viommu_get_max_iova(&max))) {
> 

True.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-09-08 10:11     ` Joao Martins
@ 2023-09-08 11:52       ` Duan, Zhenzhong
  2023-09-08 11:54         ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Duan, Zhenzhong @ 2023-09-08 11:52 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe


On 9/8/2023 6:11 PM, Joao Martins wrote:
> On 08/09/2023 07:11, Duan, Zhenzhong wrote:
>> Hi Joao,
>>
>> On 6/23/2023 5:48 AM, Joao Martins wrote:
>>> Currently, device dirty page tracking with vIOMMU is not supported,
>>> and a blocker is added and the migration is prevented.
>>>
>>> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
>>> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
>>> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>>>
>>> To support device dirty tracking when vIOMMU enabled instead create the
>>> dirty ranges based on the vIOMMU provided limits, which leads to the
>>> tracking of the whole IOVA space regardless of what devices use.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>    include/hw/vfio/vfio-common.h |  1 +
>>>    hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
>>>    hw/vfio/pci.c                 |  7 +++++
>>>    3 files changed, 56 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index f41860988d6b..c4bafad084b4 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>>>    typedef struct VFIOAddressSpace {
>>>        AddressSpace *as;
>>>        bool no_dma_translation;
>>> +    hwaddr max_iova;
>>>        QLIST_HEAD(, VFIOContainer) containers;
>>>        QLIST_ENTRY(VFIOAddressSpace) list;
>>>    } VFIOAddressSpace;
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index ecfb9afb3fb6..85fddef24026 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>>>        return false;
>>>    }
>>>    +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>>> +{
>>> +    VFIOAddressSpace *space;
>>> +
>>> +    *max_iova = 0;
>>> +
>>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>>> +        if (space->as == &address_space_memory) {
>>> +            continue;
>>> +        }
>> Just curious why address_space_memory is bypassed?
>>
> But address_space_memory part is done by memory listeners

Only this part. Still think about the case with two vfio devices, one 
bypass iommu, the other not.

The device bypassing iommu will get address_space_memory, the other get 
iommu

address space. vfio_viommu_preset() return true for any device, so we 
never run into

memory listener even for device bypassing iommu?


Thanks

Zhenzhong



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU
  2023-09-08 11:52       ` Duan, Zhenzhong
@ 2023-09-08 11:54         ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-09-08 11:54 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

On 08/09/2023 12:52, Duan, Zhenzhong wrote:
> On 9/8/2023 6:11 PM, Joao Martins wrote:
>> On 08/09/2023 07:11, Duan, Zhenzhong wrote:
>>> Hi Joao,
>>>
>>> On 6/23/2023 5:48 AM, Joao Martins wrote:
>>>> Currently, device dirty page tracking with vIOMMU is not supported,
>>>> and a blocker is added and the migration is prevented.
>>>>
>>>> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
>>>> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
>>>> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>>>>
>>>> To support device dirty tracking when vIOMMU enabled instead create the
>>>> dirty ranges based on the vIOMMU provided limits, which leads to the
>>>> tracking of the whole IOVA space regardless of what devices use.
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>    include/hw/vfio/vfio-common.h |  1 +
>>>>    hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
>>>>    hw/vfio/pci.c                 |  7 +++++
>>>>    3 files changed, 56 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index f41860988d6b..c4bafad084b4 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>>>>    typedef struct VFIOAddressSpace {
>>>>        AddressSpace *as;
>>>>        bool no_dma_translation;
>>>> +    hwaddr max_iova;
>>>>        QLIST_HEAD(, VFIOContainer) containers;
>>>>        QLIST_ENTRY(VFIOAddressSpace) list;
>>>>    } VFIOAddressSpace;
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index ecfb9afb3fb6..85fddef24026 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>>>>        return false;
>>>>    }
>>>>    +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>>>> +{
>>>> +    VFIOAddressSpace *space;
>>>> +
>>>> +    *max_iova = 0;
>>>> +
>>>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>>>> +        if (space->as == &address_space_memory) {
>>>> +            continue;
>>>> +        }
>>> Just curious why address_space_memory is bypassed?
>>>
>> But address_space_memory part is done by memory listeners
> 
> Only this part. Still think about the case with two vfio devices, one bypass
> iommu, the other not.
> 
> The device bypassing iommu will get address_space_memory, the other get iommu
> 
> address space. vfio_viommu_preset() return true for any device, so we never run
> into
> 
> memory listener even for device bypassing iommu?

Yeap, that's correct. When I said earlier 'reworking this' I meant this part
exactly to do both.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-06-22 21:48 ` [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper Joao Martins
@ 2023-10-02 15:12   ` Cédric Le Goater
  2023-10-06  8:38     ` Joao Martins
  2023-10-06  8:45   ` Eric Auger
  1 sibling, 1 reply; 55+ messages in thread
From: Cédric Le Goater @ 2023-10-02 15:12 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu

Hello Joao,

On 6/22/23 23:48, Joao Martins wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Add a pci_setup_iommu_ops() that uses a newly added structure
> (PCIIOMMUOps) instead of using PCIIOMMUFunc. The old pci_setup_iommu()
> that uses PCIIOMMUFunc is still kept for other IOMMUs to get an
> an address space for a PCI device in vendor specific way.
> 
> In preparation to expand to supplying vIOMMU attributes, add a
> alternate helper pci_setup_iommu_ops() to setup the PCI device IOMMU.
> For now the PCIIOMMUOps just defines the address_space, but it will
> be extended to have another callback.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> [joao: Massage commit message and subject, and make it a complementary
> rather than changing every single consumer of pci_setup_iommu()]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> v1: https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
> ---
>   include/hw/pci/pci.h     |  7 +++++++
>   include/hw/pci/pci_bus.h |  1 +
>   hw/pci/pci.c             | 26 +++++++++++++++++++++++---
>   3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index e6d0574a2999..f59aef5a329a 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -368,6 +368,13 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>   
> +typedef struct PCIIOMMUOps PCIIOMMUOps;
> +struct PCIIOMMUOps {
> +    AddressSpace * (*get_address_space)(PCIBus *bus,
> +                                void *opaque, int32_t devfn);
> +};
> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
> +

I think you should first convert all PHBs to PCIIOMMUOps to avoid all the
tests as below and adapt pci_setup_iommu_ops() with the new parameter.

Thanks,

C.

>   pcibus_t pci_bar_address(PCIDevice *d,
>                            int reg, uint8_t type, pcibus_t size);
>   
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index 56531759578f..fb770b236d69 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -35,6 +35,7 @@ struct PCIBus {
>       enum PCIBusFlags flags;
>       PCIIOMMUFunc iommu_fn;
>       void *iommu_opaque;
> +    const PCIIOMMUOps *iommu_ops;
>       uint8_t devfn_min;
>       uint32_t slot_reserved_mask;
>       pci_set_irq_fn set_irq;
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index bf38905b7dc0..4e32c09e81d6 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2639,7 +2639,15 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>       PCIBus *iommu_bus = bus;
>       uint8_t devfn = dev->devfn;
>   
> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> +    /*
> +     * get_address_space() callback is mandatory when iommu uses
> +     * pci_setup_iommu_ops(), so needs to ensure its presence in
> +     * the iommu_bus search.
> +     */
> +    while (iommu_bus &&
> +           !(iommu_bus->iommu_fn ||
> +            (iommu_bus->iommu_ops && iommu_bus->iommu_ops->get_address_space)) &&
> +           iommu_bus->parent_dev) {
>           PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>   
>           /*
> @@ -2678,8 +2686,14 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>   
>           iommu_bus = parent_bus;
>       }
> -    if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
> +        if (iommu_bus->iommu_fn) {
> +           return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> +        } else if (iommu_bus->iommu_ops &&
> +                   iommu_bus->iommu_ops->get_address_space) {
> +           return iommu_bus->iommu_ops->get_address_space(bus,
> +                                           iommu_bus->iommu_opaque, devfn);
> +        }
>       }
>       return &address_space_memory;
>   }
> @@ -2690,6 +2704,12 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
>       bus->iommu_opaque = opaque;
>   }
>   
> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
> +{
> +    bus->iommu_ops = ops;
> +    bus->iommu_opaque = opaque;
> +}
> +
>   static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
>   {
>       Range *range = opaque;



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-06-22 21:48 ` [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
@ 2023-10-02 15:22   ` Cédric Le Goater
  2023-10-06  8:39     ` Joao Martins
  2023-10-06  8:52   ` Eric Auger
  2023-10-06  9:11   ` Eric Auger
  2 siblings, 1 reply; 55+ messages in thread
From: Cédric Le Goater @ 2023-10-02 15:22 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu

On 6/22/23 23:48, Joao Martins wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Refactor pci_device_iommu_address_space() and move the
> code that fetches the device bus and iommu bus into its
> own private helper pci_device_get_iommu_bus_devfn().
> 
> This is in preparation to introduce pci_device_iommu_get_attr()
> which will need to use it too.

Where is this routine used ?

Thanks,

C.


> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> [joao: Commit message, and better splitting]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Splitted from v1:
> https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
> ---
>   hw/pci/pci.c | 16 ++++++++++++++--
>   1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 4e32c09e81d6..90ae92a43d85 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2632,8 +2632,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>           assert(conventional || pcie || cxl);
>       }
>   }
> -
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **pdevbus,
> +                                           PCIBus **pbus, uint8_t *pdevfn)
>   {
>       PCIBus *bus = pci_get_bus(dev);
>       PCIBus *iommu_bus = bus;
> @@ -2686,6 +2686,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>   
>           iommu_bus = parent_bus;
>       }
> +
> +    *pdevbus = bus;
> +    *pbus = iommu_bus;
> +    *pdevfn = devfn;
> +}
> +
> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus, *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>       if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
>           if (iommu_bus->iommu_fn) {
>              return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/15] intel-iommu: Implement get_attr() method
  2023-06-22 21:48 ` [PATCH v4 06/15] intel-iommu: Implement get_attr() method Joao Martins
  2023-09-08  6:23   ` Duan, Zhenzhong
@ 2023-10-02 15:23   ` Cédric Le Goater
  2023-10-06  8:42     ` Joao Martins
  1 sibling, 1 reply; 55+ messages in thread
From: Cédric Le Goater @ 2023-10-02 15:23 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe

On 6/22/23 23:48, Joao Martins wrote:
> Implement IOMMU MR get_attr() method and use the dma_translation
> property to report the IOMMU_ATTR_DMA_TRANSLATION attribute.
> Additionally add the necessary get_iommu_attr into the PCIIOMMUOps to
> support pci_device_iommu_get_attr().
> 
> The callback in there acts as a IOMMU-specific address space walker
> which will call get_attr in the IOMMUMemoryRegion backing the device to
> fetch the desired attribute.

This looks like two patches to me and the previous should be merged with
the one introducing vtd_iommu_get_attr().

Thanks,

C.

> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/i386/intel_iommu.c | 40 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 40 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1606d1b952d0..ed2a46e008df 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3861,6 +3861,29 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>       return;
>   }
>   
> +static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
> +                              enum IOMMUMemoryRegionAttr attr, void *data)
> +{
> +    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    int ret = 0;
> +
> +    switch (attr) {
> +    case IOMMU_ATTR_DMA_TRANSLATION:
> +    {
> +        bool *enabled = (bool *)(uintptr_t) data;
> +
> +        *enabled = s->dma_translation;
> +        break;
> +    }
> +    default:
> +        ret = -EINVAL;
> +        break;
> +    }
> +
> +    return ret;
> +}
> +
>   /* Do the initialization. It will also be called when reset, so pay
>    * attention when adding new initialization stuff.
>    */
> @@ -4032,8 +4055,24 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>       return &vtd_as->as;
>   }
>   
> +static int vtd_get_iommu_attr(PCIBus *bus, void *opaque, int32_t devfn,
> +                              enum IOMMUMemoryRegionAttr attr, void *data)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDAddressSpace *vtd_as;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
> +    if (!vtd_as)
> +	return -EINVAL;
> +
> +    return memory_region_iommu_get_attr(&vtd_as->iommu, attr, data);
> +}
> +
>   static PCIIOMMUOps vtd_iommu_ops = {
>       .get_address_space = vtd_host_dma_iommu,
> +    .get_iommu_attr = vtd_get_iommu_attr,
>   };
>   
>   static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> @@ -4197,6 +4236,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
>       imrc->translate = vtd_iommu_translate;
>       imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
>       imrc->replay = vtd_iommu_replay;
> +    imrc->get_attr = vtd_iommu_get_attr;
>   }
>   
>   static const TypeInfo vtd_iommu_memory_region_info = {



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-07-10 13:44     ` Joao Martins
@ 2023-10-02 15:42       ` Cédric Le Goater
  2023-10-06  8:43         ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Cédric Le Goater @ 2023-10-02 15:42 UTC (permalink / raw)
  To: Joao Martins, Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Jason Gunthorpe

On 7/10/23 15:44, Joao Martins wrote:
> 
> 
> On 09/07/2023 16:17, Avihai Horon wrote:
>>
>> On 23/06/2023 0:48, Joao Martins wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> From: Avihai Horon <avihaih@nvidia.com>
>>>
>>> Implement get_attr() method and use the address width property to report
>>> the IOMMU_ATTR_MAX_IOVA attribute.
>>
>> Nit: get_attr() method was already implemented in patch #6.
>> Maybe just "Use address width property to report IOMMU_ATTR_MAX_IOVA attribute"?
>>
> Yeap, makes sense.

I would merge with the previous patch also.

> 
>> Thanks.
>>
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>    hw/i386/intel_iommu.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index ed2a46e008df..989993e303a6 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>>>            *enabled = s->dma_translation;
>>>            break;
>>>        }
>>> +    case IOMMU_ATTR_MAX_IOVA:
>>> +    {
>>> +        hwaddr *max_iova = (hwaddr *)(uintptr_t) data;
>>> +
>>> +        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);;

one ; is enough.

Thanks,

C.

>>> +        break;
>>> +    }
>>>        default:
>>>            ret = -EINVAL;
>>>            break;
>>> -- 
>>> 2.17.2
>>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-10-02 15:12   ` Cédric Le Goater
@ 2023-10-06  8:38     ` Joao Martins
  2023-10-06  8:50       ` Cédric Le Goater
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-10-06  8:38 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu

On 02/10/2023 16:12, Cédric Le Goater wrote:
> Hello Joao,
> 
> On 6/22/23 23:48, Joao Martins wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Add a pci_setup_iommu_ops() that uses a newly added structure
>> (PCIIOMMUOps) instead of using PCIIOMMUFunc. The old pci_setup_iommu()
>> that uses PCIIOMMUFunc is still kept for other IOMMUs to get an
>> an address space for a PCI device in vendor specific way.
>>
>> In preparation to expand to supplying vIOMMU attributes, add a
>> alternate helper pci_setup_iommu_ops() to setup the PCI device IOMMU.
>> For now the PCIIOMMUOps just defines the address_space, but it will
>> be extended to have another callback.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> [joao: Massage commit message and subject, and make it a complementary
>> rather than changing every single consumer of pci_setup_iommu()]
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> v1: https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>> ---
>>   include/hw/pci/pci.h     |  7 +++++++
>>   include/hw/pci/pci_bus.h |  1 +
>>   hw/pci/pci.c             | 26 +++++++++++++++++++++++---
>>   3 files changed, 31 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index e6d0574a2999..f59aef5a329a 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -368,6 +368,13 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *,
>> int);
>>   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>>   +typedef struct PCIIOMMUOps PCIIOMMUOps;
>> +struct PCIIOMMUOps {
>> +    AddressSpace * (*get_address_space)(PCIBus *bus,
>> +                                void *opaque, int32_t devfn);
>> +};
>> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void
>> *opaque);
>> +
> 
> I think you should first convert all PHBs to PCIIOMMUOps to avoid all the
> tests as below and adapt pci_setup_iommu_ops() with the new parameter.
> 

OK, that's Yi's original patch:

https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/

I went with this one is that 1) it might take eons to get every single IOMMU
maintainer ack; and 2) it would allow each IOMMU to move at its own speed
specially as I can't test most of the other ones. essentially iterative, rather
than invasive change? Does that make sense?

	Joao


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-10-02 15:22   ` Cédric Le Goater
@ 2023-10-06  8:39     ` Joao Martins
  2023-10-06  8:40       ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-10-06  8:39 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu



On 02/10/2023 16:22, Cédric Le Goater wrote:
> On 6/22/23 23:48, Joao Martins wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Refactor pci_device_iommu_address_space() and move the
>> code that fetches the device bus and iommu bus into its
>> own private helper pci_device_get_iommu_bus_devfn().
>>
>> This is in preparation to introduce pci_device_iommu_get_attr()
>> which will need to use it too.
> 
> Where is this routine used ?
> 

Patch 7 to understand if a configured vIOMMU supports DMA translation,
regardless of whether the guest is doing.

> Thanks,
> 
> C.
> 
> 
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> [joao: Commit message, and better splitting]
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> Splitted from v1:
>> https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
>> ---
>>   hw/pci/pci.c | 16 ++++++++++++++--
>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 4e32c09e81d6..90ae92a43d85 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2632,8 +2632,8 @@ static void pci_device_class_base_init(ObjectClass
>> *klass, void *data)
>>           assert(conventional || pcie || cxl);
>>       }
>>   }
>> -
>> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **pdevbus,
>> +                                           PCIBus **pbus, uint8_t *pdevfn)
>>   {
>>       PCIBus *bus = pci_get_bus(dev);
>>       PCIBus *iommu_bus = bus;
>> @@ -2686,6 +2686,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice
>> *dev)
>>             iommu_bus = parent_bus;
>>       }
>> +
>> +    *pdevbus = bus;
>> +    *pbus = iommu_bus;
>> +    *pdevfn = devfn;
>> +}
>> +
>> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +{
>> +    PCIBus *bus, *iommu_bus;
>> +    uint8_t devfn;
>> +
>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>>       if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
>>           if (iommu_bus->iommu_fn) {
>>              return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-10-06  8:39     ` Joao Martins
@ 2023-10-06  8:40       ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-10-06  8:40 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu



On 06/10/2023 09:39, Joao Martins wrote:
> 
> 
> On 02/10/2023 16:22, Cédric Le Goater wrote:
>> On 6/22/23 23:48, Joao Martins wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> Refactor pci_device_iommu_address_space() and move the
>>> code that fetches the device bus and iommu bus into its
>>> own private helper pci_device_get_iommu_bus_devfn().
>>>
>>> This is in preparation to introduce pci_device_iommu_get_attr()
>>> which will need to use it too.
>>
>> Where is this routine used ?
>>
> 
> Patch 7 to understand if a configured vIOMMU supports DMA translation,
> regardless of whether the guest is doing.
> 

And then patch 12 to see the max IOVA boundaries of the vIOMMU possible address
space.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/15] intel-iommu: Implement get_attr() method
  2023-10-02 15:23   ` Cédric Le Goater
@ 2023-10-06  8:42     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-10-06  8:42 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe



On 02/10/2023 16:23, Cédric Le Goater wrote:
> On 6/22/23 23:48, Joao Martins wrote:
>> Implement IOMMU MR get_attr() method and use the dma_translation
>> property to report the IOMMU_ATTR_DMA_TRANSLATION attribute.
>> Additionally add the necessary get_iommu_attr into the PCIIOMMUOps to
>> support pci_device_iommu_get_attr().
>>
>> The callback in there acts as a IOMMU-specific address space walker
>> which will call get_attr in the IOMMUMemoryRegion backing the device to
>> fetch the desired attribute.
> 
> This looks like two patches to me and the previous should be merged with
> the one introducing vtd_iommu_get_attr().
> 

The reason was that one is a bit useless without the other. But I can spread
into two patches. And I can merge the previous to the one introducing
vtd_iommu_get_attr()

> Thanks,
> 
> C.
> 
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/i386/intel_iommu.c | 40 ++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 40 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1606d1b952d0..ed2a46e008df 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3861,6 +3861,29 @@ static void vtd_iommu_replay(IOMMUMemoryRegion
>> *iommu_mr, IOMMUNotifier *n)
>>       return;
>>   }
>>   +static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
>> +                              enum IOMMUMemoryRegionAttr attr, void *data)
>> +{
>> +    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    int ret = 0;
>> +
>> +    switch (attr) {
>> +    case IOMMU_ATTR_DMA_TRANSLATION:
>> +    {
>> +        bool *enabled = (bool *)(uintptr_t) data;
>> +
>> +        *enabled = s->dma_translation;
>> +        break;
>> +    }
>> +    default:
>> +        ret = -EINVAL;
>> +        break;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   /* Do the initialization. It will also be called when reset, so pay
>>    * attention when adding new initialization stuff.
>>    */
>> @@ -4032,8 +4055,24 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus,
>> void *opaque, int devfn)
>>       return &vtd_as->as;
>>   }
>>   +static int vtd_get_iommu_attr(PCIBus *bus, void *opaque, int32_t devfn,
>> +                              enum IOMMUMemoryRegionAttr attr, void *data)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    VTDAddressSpace *vtd_as;
>> +
>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>> +
>> +    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
>> +    if (!vtd_as)
>> +    return -EINVAL;
>> +
>> +    return memory_region_iommu_get_attr(&vtd_as->iommu, attr, data);
>> +}
>> +
>>   static PCIIOMMUOps vtd_iommu_ops = {
>>       .get_address_space = vtd_host_dma_iommu,
>> +    .get_iommu_attr = vtd_get_iommu_attr,
>>   };
>>     static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>> @@ -4197,6 +4236,7 @@ static void
>> vtd_iommu_memory_region_class_init(ObjectClass *klass,
>>       imrc->translate = vtd_iommu_translate;
>>       imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
>>       imrc->replay = vtd_iommu_replay;
>> +    imrc->get_attr = vtd_iommu_get_attr;
>>   }
>>     static const TypeInfo vtd_iommu_memory_region_info = {
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  2023-10-02 15:42       ` Cédric Le Goater
@ 2023-10-06  8:43         ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-10-06  8:43 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Jason Gunthorpe



On 02/10/2023 16:42, Cédric Le Goater wrote:
> On 7/10/23 15:44, Joao Martins wrote:
>>
>>
>> On 09/07/2023 16:17, Avihai Horon wrote:
>>>
>>> On 23/06/2023 0:48, Joao Martins wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> From: Avihai Horon <avihaih@nvidia.com>
>>>>
>>>> Implement get_attr() method and use the address width property to report
>>>> the IOMMU_ATTR_MAX_IOVA attribute.
>>>
>>> Nit: get_attr() method was already implemented in patch #6.
>>> Maybe just "Use address width property to report IOMMU_ATTR_MAX_IOVA attribute"?
>>>
>> Yeap, makes sense.
> 
> I would merge with the previous patch also.
> 

OK

>>
>>> Thanks.
>>>
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>    hw/i386/intel_iommu.c | 7 +++++++
>>>>    1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index ed2a46e008df..989993e303a6 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -3876,6 +3876,13 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion
>>>> *iommu_mr,
>>>>            *enabled = s->dma_translation;
>>>>            break;
>>>>        }
>>>> +    case IOMMU_ATTR_MAX_IOVA:
>>>> +    {
>>>> +        hwaddr *max_iova = (hwaddr *)(uintptr_t) data;
>>>> +
>>>> +        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);;
> 
> one ; is enough.
> 
Oh, spurious one, let me remove it.

> Thanks,
> 
> C.
> 
>>>> +        break;
>>>> +    }
>>>>        default:
>>>>            ret = -EINVAL;
>>>>            break;
>>>> -- 
>>>> 2.17.2
>>>>
>>
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-06-22 21:48 ` [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper Joao Martins
  2023-10-02 15:12   ` Cédric Le Goater
@ 2023-10-06  8:45   ` Eric Auger
  2023-10-06 11:03     ` Joao Martins
  1 sibling, 1 reply; 55+ messages in thread
From: Eric Auger @ 2023-10-06  8:45 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu

Hi Joao,

On 6/22/23 23:48, Joao Martins wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Add a pci_setup_iommu_ops() that uses a newly added structure
> (PCIIOMMUOps) instead of using PCIIOMMUFunc. The old pci_setup_iommu()
> that uses PCIIOMMUFunc is still kept for other IOMMUs to get an
> an address space for a PCI device in vendor specific way.
double 'an'
> 
> In preparation to expand to supplying vIOMMU attributes, add a
> alternate helper pci_setup_iommu_ops() to setup the PCI device IOMMU.
> For now the PCIIOMMUOps just defines the address_space, but it will
> be extended to have another callback.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> [joao: Massage commit message and subject, and make it a complementary
> rather than changing every single consumer of pci_setup_iommu()]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> v1: https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
> ---
>  include/hw/pci/pci.h     |  7 +++++++
>  include/hw/pci/pci_bus.h |  1 +
>  hw/pci/pci.c             | 26 +++++++++++++++++++++++---
>  3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index e6d0574a2999..f59aef5a329a 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -368,6 +368,13 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>  void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>  
> +typedef struct PCIIOMMUOps PCIIOMMUOps;
> +struct PCIIOMMUOps {
> +    AddressSpace * (*get_address_space)(PCIBus *bus,
> +                                void *opaque, int32_t devfn);
> +};
> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
> +
>  pcibus_t pci_bar_address(PCIDevice *d,
>                           int reg, uint8_t type, pcibus_t size);
>  
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index 56531759578f..fb770b236d69 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -35,6 +35,7 @@ struct PCIBus {
>      enum PCIBusFlags flags;
>      PCIIOMMUFunc iommu_fn;
>      void *iommu_opaque;
> +    const PCIIOMMUOps *iommu_ops;
>      uint8_t devfn_min;
>      uint32_t slot_reserved_mask;
>      pci_set_irq_fn set_irq;
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index bf38905b7dc0..4e32c09e81d6 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2639,7 +2639,15 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      PCIBus *iommu_bus = bus;
>      uint8_t devfn = dev->devfn;
>  
> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> +    /*
> +     * get_address_space() callback is mandatory when iommu uses
> +     * pci_setup_iommu_ops(), so needs to ensure its presence in
> +     * the iommu_bus search.
the fact that it is mandatory should also be documented along with the
PCIIOMMUOps struct definition and enforced at  pci_setup_iommu_ops()
removing the need for checking iommu_bus->iommu_ops->get_address_space
> +     */
> +    while (iommu_bus &&
> +           !(iommu_bus->iommu_fn ||
> +            (iommu_bus->iommu_ops && iommu_bus->iommu_ops->get_address_space)) &&
> +           iommu_bus->parent_dev) {
>          PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>  
>          /*
> @@ -2678,8 +2686,14 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> -    if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
> +        if (iommu_bus->iommu_fn) {
> +           return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> +        } else if (iommu_bus->iommu_ops &&
> +                   iommu_bus->iommu_ops->get_address_space) {
> +           return iommu_bus->iommu_ops->get_address_space(bus,
> +                                           iommu_bus->iommu_opaque, devfn);
> +        }
>      }
>      return &address_space_memory;
>  }
> @@ -2690,6 +2704,12 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
>      bus->iommu_opaque = opaque;
>  }
>  
> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
> +{
> +    bus->iommu_ops = ops;
> +    bus->iommu_opaque = opaque;
maybe assert if both iommu_fn and iommu_ops are set to make sure the
conversion is not done partially? If I understand it correctly either of
those 2 ops should be set and not both.
> +}
> +
>  static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
>  {
>      Range *range = opaque;

Thanks

Eric



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-10-06  8:38     ` Joao Martins
@ 2023-10-06  8:50       ` Cédric Le Goater
  2023-10-06 11:06         ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Cédric Le Goater @ 2023-10-06  8:50 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu

On 10/6/23 10:38, Joao Martins wrote:
> On 02/10/2023 16:12, Cédric Le Goater wrote:
>> Hello Joao,
>>
>> On 6/22/23 23:48, Joao Martins wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> Add a pci_setup_iommu_ops() that uses a newly added structure
>>> (PCIIOMMUOps) instead of using PCIIOMMUFunc. The old pci_setup_iommu()
>>> that uses PCIIOMMUFunc is still kept for other IOMMUs to get an
>>> an address space for a PCI device in vendor specific way.
>>>
>>> In preparation to expand to supplying vIOMMU attributes, add a
>>> alternate helper pci_setup_iommu_ops() to setup the PCI device IOMMU.
>>> For now the PCIIOMMUOps just defines the address_space, but it will
>>> be extended to have another callback.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> [joao: Massage commit message and subject, and make it a complementary
>>> rather than changing every single consumer of pci_setup_iommu()]
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>> v1: https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>>> ---
>>>    include/hw/pci/pci.h     |  7 +++++++
>>>    include/hw/pci/pci_bus.h |  1 +
>>>    hw/pci/pci.c             | 26 +++++++++++++++++++++++---
>>>    3 files changed, 31 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>> index e6d0574a2999..f59aef5a329a 100644
>>> --- a/include/hw/pci/pci.h
>>> +++ b/include/hw/pci/pci.h
>>> @@ -368,6 +368,13 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *,
>>> int);
>>>    AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>>    void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>>>    +typedef struct PCIIOMMUOps PCIIOMMUOps;
>>> +struct PCIIOMMUOps {
>>> +    AddressSpace * (*get_address_space)(PCIBus *bus,
>>> +                                void *opaque, int32_t devfn);
>>> +};
>>> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void
>>> *opaque);
>>> +
>>
>> I think you should first convert all PHBs to PCIIOMMUOps to avoid all the
>> tests as below and adapt pci_setup_iommu_ops() with the new parameter.
>>
> 
> OK, that's Yi's original patch:
> 
> https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
> 
> I went with this one is that 1) it might take eons to get every single IOMMU
> maintainer ack; and 2) it would allow each IOMMU to move at its own speed
> specially as I can't test most of the other ones. essentially iterative, rather
> than invasive change? Does that make sense?

I think it is ok to make global changes to replace a function by a struct
of ops. This is not major (unless the extra indirection has a major perf
impact on some platforms). Getting acks from everyone will be difficult
since some PHBs are orphans.

C.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-06-22 21:48 ` [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
  2023-10-02 15:22   ` Cédric Le Goater
@ 2023-10-06  8:52   ` Eric Auger
  2023-10-06 11:07     ` Joao Martins
  2023-10-06  9:11   ` Eric Auger
  2 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2023-10-06  8:52 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu

Hi Joao,

On 6/22/23 23:48, Joao Martins wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Refactor pci_device_iommu_address_space() and move the
> code that fetches the device bus and iommu bus into its
> own private helper pci_device_get_iommu_bus_devfn().
> 
> This is in preparation to introduce pci_device_iommu_get_attr()
> which will need to use it too.

you may add:
no functional change intended.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> [joao: Commit message, and better splitting]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric

> ---
> Splitted from v1:
> https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
> ---
>  hw/pci/pci.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 4e32c09e81d6..90ae92a43d85 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2632,8 +2632,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>          assert(conventional || pcie || cxl);
>      }
>  }
> -
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **pdevbus,
> +                                           PCIBus **pbus, uint8_t *pdevfn)
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> @@ -2686,6 +2686,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> +
> +    *pdevbus = bus;
> +    *pbus = iommu_bus;
> +    *pdevfn = devfn;
> +}
> +
> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus, *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>      if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
>          if (iommu_bus->iommu_fn) {
>             return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-06-22 21:48 ` [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
  2023-10-02 15:22   ` Cédric Le Goater
  2023-10-06  8:52   ` Eric Auger
@ 2023-10-06  9:11   ` Eric Auger
  2 siblings, 0 replies; 55+ messages in thread
From: Eric Auger @ 2023-10-06  9:11 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu

Hi Joao,

On 6/22/23 23:48, Joao Martins wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Refactor pci_device_iommu_address_space() and move the
> code that fetches the device bus and iommu bus into its
> own private helper pci_device_get_iommu_bus_devfn().
> 
> This is in preparation to introduce pci_device_iommu_get_attr()
> which will need to use it too.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> [joao: Commit message, and better splitting]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Splitted from v1:
> https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
> ---
>  hw/pci/pci.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 4e32c09e81d6..90ae92a43d85 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2632,8 +2632,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>          assert(conventional || pcie || cxl);
>      }
>  }
> -
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **pdevbus,
> +                                           PCIBus **pbus, uint8_t *pdevfn)
actually I would rename pbus arg to match what it is, ie the iommu_bus
also not sure the 'p' prefix is needed

By the way can the iommu_bus be null? I see it initialized to
pci_get_bus(dev);

What does characterize an iommu bus? It must have either iommu_fn or
iommu_ops set, right?

Eric

>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> @@ -2686,6 +2686,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> +
> +    *pdevbus = bus;
> +    *pbus = iommu_bus;
> +    *pdevfn = devfn;
> +}
> +
> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus, *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>      if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
>          if (iommu_bus->iommu_fn) {
>             return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-10-06  8:45   ` Eric Auger
@ 2023-10-06 11:03     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-10-06 11:03 UTC (permalink / raw)
  To: Eric Auger, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu



On 06/10/2023 09:45, Eric Auger wrote:
> Hi Joao,
> 
> On 6/22/23 23:48, Joao Martins wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Add a pci_setup_iommu_ops() that uses a newly added structure
>> (PCIIOMMUOps) instead of using PCIIOMMUFunc. The old pci_setup_iommu()
>> that uses PCIIOMMUFunc is still kept for other IOMMUs to get an
>> an address space for a PCI device in vendor specific way.
> double 'an'

OK

>>
>> In preparation to expand to supplying vIOMMU attributes, add a
>> alternate helper pci_setup_iommu_ops() to setup the PCI device IOMMU.
>> For now the PCIIOMMUOps just defines the address_space, but it will
>> be extended to have another callback.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> [joao: Massage commit message and subject, and make it a complementary
>> rather than changing every single consumer of pci_setup_iommu()]
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> v1: https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>> ---
>>  include/hw/pci/pci.h     |  7 +++++++
>>  include/hw/pci/pci_bus.h |  1 +
>>  hw/pci/pci.c             | 26 +++++++++++++++++++++++---
>>  3 files changed, 31 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index e6d0574a2999..f59aef5a329a 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -368,6 +368,13 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>  void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>>  
>> +typedef struct PCIIOMMUOps PCIIOMMUOps;
>> +struct PCIIOMMUOps {
>> +    AddressSpace * (*get_address_space)(PCIBus *bus,
>> +                                void *opaque, int32_t devfn);
>> +};
>> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
>> +
>>  pcibus_t pci_bar_address(PCIDevice *d,
>>                           int reg, uint8_t type, pcibus_t size);
>>  
>> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
>> index 56531759578f..fb770b236d69 100644
>> --- a/include/hw/pci/pci_bus.h
>> +++ b/include/hw/pci/pci_bus.h
>> @@ -35,6 +35,7 @@ struct PCIBus {
>>      enum PCIBusFlags flags;
>>      PCIIOMMUFunc iommu_fn;
>>      void *iommu_opaque;
>> +    const PCIIOMMUOps *iommu_ops;
>>      uint8_t devfn_min;
>>      uint32_t slot_reserved_mask;
>>      pci_set_irq_fn set_irq;
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index bf38905b7dc0..4e32c09e81d6 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2639,7 +2639,15 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>      PCIBus *iommu_bus = bus;
>>      uint8_t devfn = dev->devfn;
>>  
>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>> +    /*
>> +     * get_address_space() callback is mandatory when iommu uses
>> +     * pci_setup_iommu_ops(), so needs to ensure its presence in
>> +     * the iommu_bus search.
> the fact that it is mandatory should also be documented along with the
> PCIIOMMUOps struct definition and enforced at  pci_setup_iommu_ops()
> removing the need for checking iommu_bus->iommu_ops->get_address_space

OK

>> +     */
>> +    while (iommu_bus &&
>> +           !(iommu_bus->iommu_fn ||
>> +            (iommu_bus->iommu_ops && iommu_bus->iommu_ops->get_address_space)) &&
>> +           iommu_bus->parent_dev) {
>>          PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>>  
>>          /*
>> @@ -2678,8 +2686,14 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>  
>>          iommu_bus = parent_bus;
>>      }
>> -    if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
>> +        if (iommu_bus->iommu_fn) {
>> +           return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>> +        } else if (iommu_bus->iommu_ops &&
>> +                   iommu_bus->iommu_ops->get_address_space) {
>> +           return iommu_bus->iommu_ops->get_address_space(bus,
>> +                                           iommu_bus->iommu_opaque, devfn);
>> +        }
>>      }
>>      return &address_space_memory;
>>  }
>> @@ -2690,6 +2704,12 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
>>      bus->iommu_opaque = opaque;
>>  }
>>  
>> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
>> +{
>> +    bus->iommu_ops = ops;
>> +    bus->iommu_opaque = opaque;
> maybe assert if both iommu_fn and iommu_ops are set to make sure the
> conversion is not done partially? If I understand it correctly either of
> those 2 ops should be set and not both.

Good idea.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-10-06  8:50       ` Cédric Le Goater
@ 2023-10-06 11:06         ` Joao Martins
  2023-10-06 17:09           ` Cédric Le Goater
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-10-06 11:06 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu



On 06/10/2023 09:50, Cédric Le Goater wrote:
> On 10/6/23 10:38, Joao Martins wrote:
>> On 02/10/2023 16:12, Cédric Le Goater wrote:
>>> Hello Joao,
>>>
>>> On 6/22/23 23:48, Joao Martins wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> Add a pci_setup_iommu_ops() that uses a newly added structure
>>>> (PCIIOMMUOps) instead of using PCIIOMMUFunc. The old pci_setup_iommu()
>>>> that uses PCIIOMMUFunc is still kept for other IOMMUs to get an
>>>> an address space for a PCI device in vendor specific way.
>>>>
>>>> In preparation to expand to supplying vIOMMU attributes, add a
>>>> alternate helper pci_setup_iommu_ops() to setup the PCI device IOMMU.
>>>> For now the PCIIOMMUOps just defines the address_space, but it will
>>>> be extended to have another callback.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> [joao: Massage commit message and subject, and make it a complementary
>>>> rather than changing every single consumer of pci_setup_iommu()]
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>> v1: https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>>>> ---
>>>>    include/hw/pci/pci.h     |  7 +++++++
>>>>    include/hw/pci/pci_bus.h |  1 +
>>>>    hw/pci/pci.c             | 26 +++++++++++++++++++++++---
>>>>    3 files changed, 31 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>> index e6d0574a2999..f59aef5a329a 100644
>>>> --- a/include/hw/pci/pci.h
>>>> +++ b/include/hw/pci/pci.h
>>>> @@ -368,6 +368,13 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *,
>>>> int);
>>>>    AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>>>    void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>>>>    +typedef struct PCIIOMMUOps PCIIOMMUOps;
>>>> +struct PCIIOMMUOps {
>>>> +    AddressSpace * (*get_address_space)(PCIBus *bus,
>>>> +                                void *opaque, int32_t devfn);
>>>> +};
>>>> +void pci_setup_iommu_ops(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void
>>>> *opaque);
>>>> +
>>>
>>> I think you should first convert all PHBs to PCIIOMMUOps to avoid all the
>>> tests as below and adapt pci_setup_iommu_ops() with the new parameter.
>>>
>>
>> OK, that's Yi's original patch:
>>
>> https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>>
>> I went with this one is that 1) it might take eons to get every single IOMMU
>> maintainer ack; and 2) it would allow each IOMMU to move at its own speed
>> specially as I can't test most of the other ones. essentially iterative, rather
>> than invasive change? Does that make sense?
> 
> I think it is ok to make global changes to replace a function by a struct
> of ops. This is not major (unless the extra indirection has a major perf
> impact on some platforms). 

It should be a mechanical change. As the pci_setup_iommu_ops() should be
functionally equivalent to pci_setup_iommu() [...]

> Getting acks from everyone will be difficult
> since some PHBs are orphans.

[...] This is what gets me a bit hesitant


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space()
  2023-10-06  8:52   ` Eric Auger
@ 2023-10-06 11:07     ` Joao Martins
  0 siblings, 0 replies; 55+ messages in thread
From: Joao Martins @ 2023-10-06 11:07 UTC (permalink / raw)
  To: Eric Auger, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe, Yi Liu

On 06/10/2023 09:52, Eric Auger wrote:
> Hi Joao,
> 
> On 6/22/23 23:48, Joao Martins wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Refactor pci_device_iommu_address_space() and move the
>> code that fetches the device bus and iommu bus into its
>> own private helper pci_device_get_iommu_bus_devfn().
>>
>> This is in preparation to introduce pci_device_iommu_get_attr()
>> which will need to use it too.
> 
> you may add:
> no functional change intended.

OK

>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> [joao: Commit message, and better splitting]
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thank you


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 05/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
  2023-06-22 21:48 ` [PATCH v4 05/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute Joao Martins
@ 2023-10-06 13:08   ` Eric Auger
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Auger @ 2023-10-06 13:08 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

Hi Joao,

On 6/22/23 23:48, Joao Martins wrote:
> Add a new IOMMU attribute IOMMU_ATTR_DMA_TRANSLATION which indicates
> whether the IOMMU supports DMA Translation.
doesn't it return whether the DMA translation is enabled instead?
> 
> This attribute will be used by VFIO device dirty page tracking so it can
> restrict the IOVA under tracking to the memory map when vIOMMU is
> enabled only for interrupt remapping without dma translation enabled.
The above sentence sounds a bit cryptic to me. Knowing whether DMAR is
enabled allows you to restrict the scope of dirty page tracking, is that
a correct understand?

Thanks

Eric
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/exec/memory.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 47c2e0221c35..5d6c2ab1f397 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -319,7 +319,8 @@ typedef struct MemoryRegionClass {
>  
>  
>  enum IOMMUMemoryRegionAttr {
> -    IOMMU_ATTR_SPAPR_TCE_FD
> +    IOMMU_ATTR_SPAPR_TCE_FD,
> +    IOMMU_ATTR_DMA_TRANSLATION,
>  };
>  
>  /*



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU
  2023-06-22 21:48 ` [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU Joao Martins
  2023-07-09 15:10   ` Avihai Horon
@ 2023-10-06 13:09   ` Eric Auger
  1 sibling, 0 replies; 55+ messages in thread
From: Eric Auger @ 2023-10-06 13:09 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Paolo Bonzini, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daude, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Richard Henderson, Eduardo Habkost,
	Avihai Horon, Jason Gunthorpe

Hi Joao,

On 6/22/23 23:48, Joao Martins wrote:
> vfio_get_group() allocates and fills the group/container/space on
> success which will store the AddressSpace inside the VFIOSpace struct.
VFIOAddressSpace
> Use the newly added pci_device_iommu_get_attr() to see if DMA
> translation is enabled or not. Assume that by default it is enabled.
> 
> Today, this means only intel-iommu supports it.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/hw/vfio/vfio-common.h |  1 +
>  hw/vfio/pci.c                 | 15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eed244f25f34..f41860988d6b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -70,6 +70,7 @@ typedef struct VFIOMigration {
>  
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
> +    bool no_dma_translation;
>      QLIST_HEAD(, VFIOContainer) containers;
>      QLIST_ENTRY(VFIOAddressSpace) list;
>  } VFIOAddressSpace;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 73874a94de12..8a98e6ffc480 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2900,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>      VFIODevice *vbasedev = &vdev->vbasedev;
>      VFIODevice *vbasedev_iter;
> +    VFIOAddressSpace *space;
>      VFIOGroup *group;
>      char *tmp, *subsys, group_path[PATH_MAX], *group_name;
>      Error *err = NULL;
> @@ -2907,7 +2908,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      struct stat st;
>      int groupid;
>      int i, ret;
> -    bool is_mdev;
> +    bool is_mdev, dma_translation;
>      char uuid[UUID_FMT_LEN];
>      char *name;
>  
> @@ -2961,6 +2962,18 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          goto error;
>      }
>  
> +    space = group->container->space;
> +
> +    /*
> +     * Support for toggling DMA translation is optional.
> +     * By default, DMA translation is assumed to be enabled i.e.
> +     * space::no_dma_translation is 0.
> +     */
> +    dma_translation = true;
nit can be set at init.
> +    pci_device_iommu_get_attr(pdev, IOMMU_ATTR_DMA_TRANSLATION,
> +                              &dma_translation);> +    space->no_dma_translation = !dma_translation;
> +
>      QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>          if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>              error_setg(errp, "device is already attached");

Thanks

Eric



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-10-06 11:06         ` Joao Martins
@ 2023-10-06 17:09           ` Cédric Le Goater
  2023-10-06 17:59             ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Cédric Le Goater @ 2023-10-06 17:09 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu

Hello Joao,

>>>> I think you should first convert all PHBs to PCIIOMMUOps to avoid all the
>>>> tests as below and adapt pci_setup_iommu_ops() with the new parameter.
>>>>
>>>
>>> OK, that's Yi's original patch:
>>>
>>> https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>>>
>>> I went with this one is that 1) it might take eons to get every single IOMMU
>>> maintainer ack; and 2) it would allow each IOMMU to move at its own speed
>>> specially as I can't test most of the other ones. essentially iterative, rather
>>> than invasive change? Does that make sense?
>>
>> I think it is ok to make global changes to replace a function by a struct
>> of ops. This is not major (unless the extra indirection has a major perf
>> impact on some platforms).
> 
> It should be a mechanical change. As the pci_setup_iommu_ops() should be
> functionally equivalent to pci_setup_iommu() [...]

Thanks for going back to the previous proposal.

>> Getting acks from everyone will be difficultsince some PHBs are orphans.
> 
> [...] This is what gets me a bit hesitant

orphans shouldn't be an issue, nor the PPC emulated machines. We will see
what other maintainers have to say.

Thanks,

C.
  



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-10-06 17:09           ` Cédric Le Goater
@ 2023-10-06 17:59             ` Joao Martins
  2023-10-09 13:01               ` Cédric Le Goater
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2023-10-06 17:59 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu

On 06/10/2023 18:09, Cédric Le Goater wrote:
>>> Getting acks from everyone will be difficultsince some PHBs are orphans.
>>
>> [...] This is what gets me a bit hesitant
> 
> orphans shouldn't be an issue, nor the PPC emulated machines. We will see
> what other maintainers have to say.

How about this as a compromise: to have a separate patch at the end of the
series that converts every other PHB? This way the rest can iterate, while we
await maintainers feedback without potentially blocking everything else.

Also, one other patch I'll add to this series at the end is this one:

https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/

This way the vIOMMU series is a complete thing for old and new guests, as
opposed to just new.

	Joao


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper
  2023-10-06 17:59             ` Joao Martins
@ 2023-10-09 13:01               ` Cédric Le Goater
  0 siblings, 0 replies; 55+ messages in thread
From: Cédric Le Goater @ 2023-10-09 13:01 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daude, Michael S. Tsirkin, Marcel Apfelbaum,
	Jason Wang, Richard Henderson, Eduardo Habkost, Avihai Horon,
	Jason Gunthorpe, Yi Liu

On 10/6/23 19:59, Joao Martins wrote:
> On 06/10/2023 18:09, Cédric Le Goater wrote:
>>>> Getting acks from everyone will be difficultsince some PHBs are orphans.
>>>
>>> [...] This is what gets me a bit hesitant
>>
>> orphans shouldn't be an issue, nor the PPC emulated machines. We will see
>> what other maintainers have to say.
> 
> How about this as a compromise: to have a separate patch at the end of the
> series that converts every other PHB? This way the rest can iterate, while we
> await maintainers feedback without potentially blocking everything else.

In patch [1], impacted files are :

* PCIIOMMUFunc -> PCIIOMMUOps change in models

   hw/ppc/spapr_pci.c       (David)  R-b
   hw/i386/intel_iommu.c    (Peter Xu) R-b
   hw/arm/smmu-common.c     (Eric)
   hw/virtio/virtio-iommu.c (Eric)
   hw/pci-host/pnv_phb3.c   (Cédric)
   hw/pci-host/pnv_phb4.c   (Cédric)
   hw/pci-host/designware.c (Peter M.)
   hw/pci-host/prep.c       (Hervé)
   hw/pci-host/sabre.c      (Mark)
   hw/s390x/s390-pci-bus.c  (Thomas)
   hw/alpha/typhoon.c       (Richard)
   hw/hppa/dino.c           (Richard)

* Common PCI

   hw/pci/pci.c             (Michael)
   include/hw/pci/pci.h     (Michael)
   include/hw/pci/pci_bus.h (Michael)

* Orphans
  
   hw/i386/amd_iommu.c      (Orphan)
   hw/ppc/ppc440_pcix.c     (Orphan)
   hw/pci-host/ppce500.c    (Orphan)

I will add my R-b on the PPC parts I maintain. The rest doesn't seem it
would raise issues and if so, it should be quick to have since the change
is simple.

[1] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/

> 
> Also, one other patch I'll add to this series at the end is this one:
> 
> https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/
> 
> This way the vIOMMU series is a complete thing for old and new guests, as
> opposed to just new.

Ok good. Let's have it.

Thanks,

C.



^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2023-10-09 13:02 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-22 21:48 [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
2023-06-22 21:48 ` [PATCH v4 01/15] hw/pci: Add a pci_setup_iommu_ops() helper Joao Martins
2023-10-02 15:12   ` Cédric Le Goater
2023-10-06  8:38     ` Joao Martins
2023-10-06  8:50       ` Cédric Le Goater
2023-10-06 11:06         ` Joao Martins
2023-10-06 17:09           ` Cédric Le Goater
2023-10-06 17:59             ` Joao Martins
2023-10-09 13:01               ` Cédric Le Goater
2023-10-06  8:45   ` Eric Auger
2023-10-06 11:03     ` Joao Martins
2023-06-22 21:48 ` [PATCH v4 02/15] hw/pci: Refactor pci_device_iommu_address_space() Joao Martins
2023-10-02 15:22   ` Cédric Le Goater
2023-10-06  8:39     ` Joao Martins
2023-10-06  8:40       ` Joao Martins
2023-10-06  8:52   ` Eric Auger
2023-10-06 11:07     ` Joao Martins
2023-10-06  9:11   ` Eric Auger
2023-06-22 21:48 ` [PATCH v4 03/15] hw/pci: Introduce pci_device_iommu_get_attr() Joao Martins
2023-06-22 21:48 ` [PATCH v4 04/15] intel-iommu: Switch to pci_setup_iommu_ops() Joao Martins
2023-06-22 21:48 ` [PATCH v4 05/15] memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute Joao Martins
2023-10-06 13:08   ` Eric Auger
2023-06-22 21:48 ` [PATCH v4 06/15] intel-iommu: Implement get_attr() method Joao Martins
2023-09-08  6:23   ` Duan, Zhenzhong
2023-09-08 10:11     ` Joao Martins
2023-10-02 15:23   ` Cédric Le Goater
2023-10-06  8:42     ` Joao Martins
2023-06-22 21:48 ` [PATCH v4 07/15] vfio/common: Track whether DMA Translation is enabled on the vIOMMU Joao Martins
2023-07-09 15:10   ` Avihai Horon
2023-07-10 13:44     ` Joao Martins
2023-10-06 13:09   ` Eric Auger
2023-06-22 21:48 ` [PATCH v4 08/15] vfio/common: Relax vIOMMU detection when DMA translation is off Joao Martins
2023-06-22 21:48 ` [PATCH v4 09/15] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Joao Martins
2023-06-22 21:48 ` [PATCH v4 10/15] intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute Joao Martins
2023-07-09 15:17   ` Avihai Horon
2023-07-10 13:44     ` Joao Martins
2023-10-02 15:42       ` Cédric Le Goater
2023-10-06  8:43         ` Joao Martins
2023-06-22 21:48 ` [PATCH v4 11/15] vfio/common: Move dirty tracking ranges update to helper Joao Martins
2023-06-22 21:48 ` [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU Joao Martins
2023-07-09 15:24   ` Avihai Horon
2023-07-10 13:49     ` Joao Martins
2023-09-08  6:11   ` Duan, Zhenzhong
2023-09-08 10:11     ` Joao Martins
2023-09-08 11:52       ` Duan, Zhenzhong
2023-09-08 11:54         ` Joao Martins
2023-06-22 21:48 ` [PATCH v4 13/15] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Joao Martins
2023-06-22 21:48 ` [PATCH v4 14/15] vfio/common: Optimize device dirty page tracking with vIOMMU Joao Martins
2023-06-22 21:48 ` [PATCH v4 15/15] vfio/common: Block migration with vIOMMUs without address width limits Joao Martins
2023-09-08  6:28   ` Duan, Zhenzhong
2023-09-08 10:11     ` Joao Martins
2023-06-22 22:18 ` [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Joao Martins
2023-09-07 11:11 ` Joao Martins
2023-09-07 12:40   ` Cédric Le Goater
2023-09-07 15:20     ` Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.