All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU
@ 2024-01-15 10:13 Zhenzhong Duan
  2024-01-15 10:13 ` [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice Zhenzhong Duan
                   ` (5 more replies)
  0 siblings, 6 replies; 46+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan

Hi,

This introduces a framework for vIOMMU to get hw IOMMU cap/ecap information
through IOMMUFD interface and check or sync with vIOMMU's own cap/ecap
config.

This framework works by having device side, i.e. VFIO, register a
IOMMUFDDevice to vIOMMU, IOMMUFDDevice includes necessary data to
archive that. Currently only VFIO device is supported, but it
could also be used for other devices, i.e., VDPA.

This is also a prerequisite for incoming iommufd nesting series:
'intel_iommu: Enable stage-1 translation'.

PATCH1-4: initialize IOMMUFDDevice and pass to vIOMMU
PATCH5-6: cap/ecap sync mechanism between host IOMMU and vIOMMU

Qemu code can be found at:
https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_preq_rfcv1

Thanks
Zhenzhong

Yi Liu (3):
  hw/pci: introduce pci_device_set/unset_iommu_device()
  intel_iommu: add set/unset_iommu_device callback
  intel_iommu: add a framework to check and sync host IOMMU cap/ecap

Zhenzhong Duan (3):
  backends/iommufd_device: introduce IOMMUFDDevice
  vfio: initialize IOMMUFDDevice and pass to vIOMMU
  intel_iommu: extract out vtd_cap_init to initialize cap/ecap

 MAINTAINERS                     |   4 +-
 include/hw/i386/intel_iommu.h   |  14 ++
 include/hw/pci/pci.h            |  39 +++++-
 include/hw/vfio/vfio-common.h   |   2 +
 include/sysemu/iommufd_device.h |  31 +++++
 backends/iommufd_device.c       |  50 +++++++
 hw/i386/intel_iommu.c           | 239 ++++++++++++++++++++++++++------
 hw/pci/pci.c                    |  49 ++++++-
 hw/vfio/iommufd.c               |   2 +
 hw/vfio/pci.c                   |  24 +++-
 backends/meson.build            |   2 +-
 11 files changed, 402 insertions(+), 54 deletions(-)
 create mode 100644 include/sysemu/iommufd_device.h
 create mode 100644 backends/iommufd_device.c

-- 
2.34.1



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-15 10:13 [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
@ 2024-01-15 10:13 ` Zhenzhong Duan
  2024-01-17 14:11   ` Eric Auger
  2024-01-18 12:42   ` Eric Auger
  2024-01-15 10:13 ` [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device() Zhenzhong Duan
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 46+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Yi Sun

IOMMUFDDevice represents a device in iommufd and can be used as
a communication interface between devices (i.e., VFIO, VDPA) and
vIOMMU.

Currently it includes iommufd handler and device id information
which could be used by vIOMMU to get hw IOMMU information.

In future nested translation support, vIOMMU is going to have
more iommufd related operations like allocate hwpt for a device,
attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.

IOMMUFDDevice is willingly not a QOM object because we don't want
it to be visible from the user interface.

Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.

Originally-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 MAINTAINERS                     |  4 +--
 include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
 backends/iommufd_device.c       | 50 +++++++++++++++++++++++++++++++++
 backends/meson.build            |  2 +-
 4 files changed, 84 insertions(+), 3 deletions(-)
 create mode 100644 include/sysemu/iommufd_device.h
 create mode 100644 backends/iommufd_device.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 00ec1f7eca..606dfeb2b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2171,8 +2171,8 @@ M: Yi Liu <yi.l.liu@intel.com>
 M: Eric Auger <eric.auger@redhat.com>
 M: Zhenzhong Duan <zhenzhong.duan@intel.com>
 S: Supported
-F: backends/iommufd.c
-F: include/sysemu/iommufd.h
+F: backends/iommufd*.c
+F: include/sysemu/iommufd*.h
 F: include/qemu/chardev_open.h
 F: util/chardev_open.c
 F: docs/devel/vfio-iommufd.rst
diff --git a/include/sysemu/iommufd_device.h b/include/sysemu/iommufd_device.h
new file mode 100644
index 0000000000..795630324b
--- /dev/null
+++ b/include/sysemu/iommufd_device.h
@@ -0,0 +1,31 @@
+/*
+ * IOMMUFD Device
+ *
+ * Copyright (C) 2024 Intel Corporation.
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Zhenzhong Duan <zhenzhong.duan@intel.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef SYSEMU_IOMMUFD_DEVICE_H
+#define SYSEMU_IOMMUFD_DEVICE_H
+
+#include <linux/iommufd.h>
+#include "sysemu/iommufd.h"
+
+typedef struct IOMMUFDDevice IOMMUFDDevice;
+
+/* This is an abstraction of host IOMMUFD device */
+struct IOMMUFDDevice {
+    IOMMUFDBackend *iommufd;
+    uint32_t dev_id;
+};
+
+int iommufd_device_get_info(IOMMUFDDevice *idev,
+                            enum iommu_hw_info_type *type,
+                            uint32_t len, void *data);
+void iommufd_device_init(void *_idev, size_t instance_size,
+                         IOMMUFDBackend *iommufd, uint32_t dev_id);
+#endif
diff --git a/backends/iommufd_device.c b/backends/iommufd_device.c
new file mode 100644
index 0000000000..f6e7ca1dbf
--- /dev/null
+++ b/backends/iommufd_device.c
@@ -0,0 +1,50 @@
+/*
+ * QEMU abstract of Host IOMMU
+ *
+ * Copyright (C) 2024 Intel Corporation.
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Zhenzhong Duan <zhenzhong.duan@intel.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include <sys/ioctl.h>
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "sysemu/iommufd_device.h"
+
+int iommufd_device_get_info(IOMMUFDDevice *idev,
+                            enum iommu_hw_info_type *type,
+                            uint32_t len, void *data)
+{
+    struct iommu_hw_info info = {
+        .size = sizeof(info),
+        .flags = 0,
+        .dev_id = idev->dev_id,
+        .data_len = len,
+        .__reserved = 0,
+        .data_uptr = (uintptr_t)data,
+    };
+    int ret;
+
+    ret = ioctl(idev->iommufd->fd, IOMMU_GET_HW_INFO, &info);
+    if (ret) {
+        error_report("Failed to get info %m");
+    } else {
+        *type = info.out_data_type;
+    }
+
+    return ret;
+}
+
+void iommufd_device_init(void *_idev, size_t instance_size,
+                         IOMMUFDBackend *iommufd, uint32_t dev_id)
+{
+    IOMMUFDDevice *idev = (IOMMUFDDevice *)_idev;
+
+    g_assert(sizeof(IOMMUFDDevice) <= instance_size);
+
+    idev->iommufd = iommufd;
+    idev->dev_id = dev_id;
+}
diff --git a/backends/meson.build b/backends/meson.build
index 8b2b111497..c437cdb363 100644
--- a/backends/meson.build
+++ b/backends/meson.build
@@ -24,7 +24,7 @@ if have_vhost_user
   system_ss.add(when: 'CONFIG_VIRTIO', if_true: files('vhost-user.c'))
 endif
 system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost.c'))
-system_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c'))
+system_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c', 'iommufd_device.c'))
 if have_vhost_user_crypto
   system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost-user.c'))
 endif
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-15 10:13 [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
  2024-01-15 10:13 ` [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice Zhenzhong Duan
@ 2024-01-15 10:13 ` Zhenzhong Duan
  2024-01-17 14:11   ` Eric Auger
  2024-01-22 16:55   ` Cédric Le Goater
  2024-01-15 10:13 ` [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback Zhenzhong Duan
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 46+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

This adds pci_device_set/unset_iommu_device() to set/unset
IOMMUFDDevice for a given PCIe device. Caller of set
should fail if set operation fails.

Extract out pci_device_get_iommu_bus_devfn() to facilitate
implementation of pci_device_set/unset_iommu_device().

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/pci/pci.h | 39 ++++++++++++++++++++++++++++++++++-
 hw/pci/pci.c         | 49 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index fa6313aabc..a810c0ec74 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -7,6 +7,8 @@
 /* PCI includes legacy ISA access.  */
 #include "hw/isa/isa.h"
 
+#include "sysemu/iommufd_device.h"
+
 extern bool pci_available;
 
 /* PCI bus */
@@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
      *
      * @devfn: device and function number
      */
-   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int devfn);
+    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int devfn);
+    /**
+     * @set_iommu_device: set iommufd device for a PCI device to vIOMMU
+     *
+     * Optional callback, if not implemented in vIOMMU, then vIOMMU can't
+     * utilize iommufd specific features.
+     *
+     * Return true if iommufd device is accepted, or else return false with
+     * errp set.
+     *
+     * @bus: the #PCIBus of the PCI device.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * @devfn: device and function number of the PCI device.
+     *
+     * @idev: the data structure representing iommufd device.
+     *
+     */
+    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
+                            IOMMUFDDevice *idev, Error **errp);
+    /**
+     * @unset_iommu_device: unset iommufd device for a PCI device from vIOMMU
+     *
+     * Optional callback.
+     *
+     * @bus: the #PCIBus of the PCI device.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * @devfn: device and function number of the PCI device.
+     */
+    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn);
 } PCIIOMMUOps;
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
+int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice *idev,
+                                Error **errp);
+void pci_device_unset_iommu_device(PCIDevice *dev);
 
 /**
  * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 76080af580..3848662f95 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2672,7 +2672,10 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
     }
 }
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
+                                           PCIBus **aliased_pbus,
+                                           PCIBus **piommu_bus,
+                                           uint8_t *aliased_pdevfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
@@ -2717,6 +2720,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
+    *aliased_pbus = bus;
+    *piommu_bus = iommu_bus;
+    *aliased_pdevfn = devfn;
+}
+
+AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
     if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
         return iommu_bus->iommu_ops->get_address_space(bus,
                                  iommu_bus->iommu_opaque, devfn);
@@ -2724,6 +2739,38 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     return &address_space_memory;
 }
 
+int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice *idev,
+                                Error **errp)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
+    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
+        iommu_bus->iommu_ops && iommu_bus->iommu_ops->set_iommu_device) {
+        return iommu_bus->iommu_ops->set_iommu_device(pci_get_bus(dev),
+                                                      iommu_bus->iommu_opaque,
+                                                      dev->devfn, idev, errp);
+    }
+    return 0;
+}
+
+void pci_device_unset_iommu_device(PCIDevice *dev)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
+    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
+        iommu_bus->iommu_ops && iommu_bus->iommu_ops->unset_iommu_device) {
+        return iommu_bus->iommu_ops->unset_iommu_device(pci_get_bus(dev),
+                                                        iommu_bus->iommu_opaque,
+                                                        dev->devfn);
+    }
+}
+
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
     /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback
  2024-01-15 10:13 [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
  2024-01-15 10:13 ` [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice Zhenzhong Duan
  2024-01-15 10:13 ` [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device() Zhenzhong Duan
@ 2024-01-15 10:13 ` Zhenzhong Duan
  2024-01-17 15:44   ` Eric Auger
  2024-01-22 17:09   ` Cédric Le Goater
  2024-01-15 10:13 ` [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 46+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

This adds set/unset_iommu_device() implementation in Intel vIOMMU.
In set call, IOMMUFDDevice is recorded in hash table indexed by
PCI BDF.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h | 10 +++++
 hw/i386/intel_iommu.c         | 79 +++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 7fa0a695c8..c65fdde56f 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -62,6 +62,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
+typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -148,6 +149,13 @@ struct VTDAddressSpace {
     IOVATree *iova_tree;
 };
 
+struct VTDIOMMUFDDevice {
+    PCIBus *bus;
+    uint8_t devfn;
+    IOMMUFDDevice *idev;
+    IntelIOMMUState *iommu_state;
+};
+
 struct VTDIOTLBEntry {
     uint64_t gfn;
     uint16_t domain_id;
@@ -292,6 +300,8 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
+    GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ed5677c0ae..95faf697eb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -237,6 +237,13 @@ static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
            (key1->pasid == key2->pasid);
 }
 
+static gboolean vtd_as_idev_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct vtd_as_key *key1 = v1;
+    const struct vtd_as_key *key2 = v2;
+
+    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
+}
 /*
  * Note that we use pointer to PCIBus as the key, so hashing/shifting
  * based on the pointer value is intended. Note that we deal with
@@ -3812,6 +3819,74 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
+static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
+                                    IOMMUFDDevice *idev, Error **errp)
+{
+    IntelIOMMUState *s = opaque;
+    VTDIOMMUFDDevice *vtd_idev;
+    struct vtd_as_key key = {
+        .bus = bus,
+        .devfn = devfn,
+    };
+    struct vtd_as_key *new_key;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    /* None IOMMUFD case */
+    if (!idev) {
+        return 0;
+    }
+
+    vtd_iommu_lock(s);
+
+    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
+
+    if (vtd_idev) {
+        error_setg(errp, "IOMMUFD device already exist");
+        return -1;
+    }
+
+    new_key = g_malloc(sizeof(*new_key));
+    new_key->bus = bus;
+    new_key->devfn = devfn;
+
+    vtd_idev = g_malloc0(sizeof(VTDIOMMUFDDevice));
+    vtd_idev->bus = bus;
+    vtd_idev->devfn = (uint8_t)devfn;
+    vtd_idev->iommu_state = s;
+    vtd_idev->idev = idev;
+
+    g_hash_table_insert(s->vtd_iommufd_dev, new_key, vtd_idev);
+
+    vtd_iommu_unlock(s);
+
+    return 0;
+}
+
+static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int32_t devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDIOMMUFDDevice *vtd_idev;
+    struct vtd_as_key key = {
+        .bus = bus,
+        .devfn = devfn,
+    };
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_iommu_lock(s);
+
+    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
+    if (!vtd_idev) {
+        vtd_iommu_unlock(s);
+        return;
+    }
+
+    g_hash_table_remove(s->vtd_iommufd_dev, &key);
+
+    vtd_iommu_unlock(s);
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -4107,6 +4182,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 
 static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
+    .set_iommu_device = vtd_dev_set_iommu_device,
+    .unset_iommu_device = vtd_dev_unset_iommu_device,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
@@ -4230,6 +4307,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                      g_free, g_free);
     s->vtd_address_spaces = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
                                       g_free, g_free);
+    s->vtd_iommufd_dev = g_hash_table_new_full(vtd_as_hash, vtd_as_idev_equal,
+                                               g_free, g_free);
     vtd_init(s);
     pci_setup_iommu(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-15 10:13 [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2024-01-15 10:13 ` [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback Zhenzhong Duan
@ 2024-01-15 10:13 ` Zhenzhong Duan
  2024-01-17 15:37   ` Joao Martins
                     ` (2 more replies)
  2024-01-15 10:13 ` [PATCH rfcv1 5/6] intel_iommu: extract out vtd_cap_init to initialize cap/ecap Zhenzhong Duan
  2024-01-15 10:13 ` [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
  5 siblings, 3 replies; 46+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Yi Sun

Initialize IOMMUFDDevice in vfio and pass to vIOMMU, so that vIOMMU
could get hw IOMMU information.

In VFIO legacy backend mode, we still pass a NULL IOMMUFDDevice to vIOMMU,
in case vIOMMU needs some processing for VFIO legacy backend mode.

Originally-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/vfio/vfio-common.h |  2 ++
 hw/vfio/iommufd.c             |  2 ++
 hw/vfio/pci.c                 | 24 +++++++++++++++++++-----
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9b7ef7d02b..fde0d0ca60 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -31,6 +31,7 @@
 #endif
 #include "sysemu/sysemu.h"
 #include "hw/vfio/vfio-container-base.h"
+#include "sysemu/iommufd_device.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -126,6 +127,7 @@ typedef struct VFIODevice {
     bool dirty_tracking;
     int devid;
     IOMMUFDBackend *iommufd;
+    IOMMUFDDevice idev;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 9bfddc1360..cbd035f148 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
     VFIOContainerBase *bcontainer;
     VFIOIOMMUFDContainer *container;
     VFIOAddressSpace *space;
+    IOMMUFDDevice *idev = &vbasedev->idev;
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, devfd;
     uint32_t ioas_id;
@@ -428,6 +429,7 @@ found_container:
     QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
     QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
 
+    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid);
     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
                                    vbasedev->num_regions, vbasedev->flags);
     return 0;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d7fe06715c..2c3a5d267b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3107,11 +3107,21 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     vfio_bars_register(vdev);
 
-    ret = vfio_add_capabilities(vdev, errp);
+    if (vbasedev->iommufd) {
+        ret = pci_device_set_iommu_device(pdev, &vbasedev->idev, errp);
+    } else {
+        ret = pci_device_set_iommu_device(pdev, 0, errp);
+    }
     if (ret) {
+        error_prepend(errp, "Failed to set iommu_device: ");
         goto out_teardown;
     }
 
+    ret = vfio_add_capabilities(vdev, errp);
+    if (ret) {
+        goto out_unset_idev;
+    }
+
     if (vdev->vga) {
         vfio_vga_quirk_setup(vdev);
     }
@@ -3128,7 +3138,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             error_setg(errp,
                        "cannot support IGD OpRegion feature on hotplugged "
                        "device");
-            goto out_teardown;
+            goto out_unset_idev;
         }
 
         ret = vfio_get_dev_region_info(vbasedev,
@@ -3137,13 +3147,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         if (ret) {
             error_setg_errno(errp, -ret,
                              "does not support requested IGD OpRegion feature");
-            goto out_teardown;
+            goto out_unset_idev;
         }
 
         ret = vfio_pci_igd_opregion_init(vdev, opregion, errp);
         g_free(opregion);
         if (ret) {
-            goto out_teardown;
+            goto out_unset_idev;
         }
     }
 
@@ -3229,6 +3239,8 @@ out_deregister:
     if (vdev->intx.mmap_timer) {
         timer_free(vdev->intx.mmap_timer);
     }
+out_unset_idev:
+    pci_device_unset_iommu_device(pdev);
 out_teardown:
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
@@ -3257,6 +3269,7 @@ static void vfio_instance_finalize(Object *obj)
 static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
@@ -3271,7 +3284,8 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_teardown_msi(vdev);
     vfio_pci_disable_rp_atomics(vdev);
     vfio_bars_exit(vdev);
-    vfio_migration_exit(&vdev->vbasedev);
+    vfio_migration_exit(vbasedev);
+    pci_device_unset_iommu_device(pdev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH rfcv1 5/6] intel_iommu: extract out vtd_cap_init to initialize cap/ecap
  2024-01-15 10:13 [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2024-01-15 10:13 ` [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
@ 2024-01-15 10:13 ` Zhenzhong Duan
  2024-01-17 17:36   ` Eric Auger
  2024-01-15 10:13 ` [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
  5 siblings, 1 reply; 46+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

This is a prerequisite for host cap/ecap sync.

No functional change intended.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 92 +++++++++++++++++++++++--------------------
 1 file changed, 50 insertions(+), 42 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 95faf697eb..4c1d058ebd 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4009,30 +4009,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
-/* Do the initialization. It will also be called when reset, so pay
- * attention when adding new initialization stuff.
- */
-static void vtd_init(IntelIOMMUState *s)
+static void vtd_cap_init(IntelIOMMUState *s)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
 
-    memset(s->csr, 0, DMAR_REG_SIZE);
-    memset(s->wmask, 0, DMAR_REG_SIZE);
-    memset(s->w1cmask, 0, DMAR_REG_SIZE);
-    memset(s->womask, 0, DMAR_REG_SIZE);
-
-    s->root = 0;
-    s->root_scalable = false;
-    s->dmar_enabled = false;
-    s->intr_enabled = false;
-    s->iq_head = 0;
-    s->iq_tail = 0;
-    s->iq = 0;
-    s->iq_size = 0;
-    s->qi_enabled = false;
-    s->iq_last_desc_type = VTD_INV_DESC_NONE;
-    s->iq_dw = false;
-    s->next_frcd_reg = 0;
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
              VTD_CAP_MGAW(s->aw_bits);
@@ -4049,27 +4029,6 @@ static void vtd_init(IntelIOMMUState *s)
     }
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
-    /*
-     * Rsvd field masks for spte
-     */
-    vtd_spte_rsvd[0] = ~0ULL;
-    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits,
-                                                  x86_iommu->dt_supported);
-    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
-    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
-    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
-
-    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits,
-                                                         x86_iommu->dt_supported);
-    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits,
-                                                         x86_iommu->dt_supported);
-
-    if (s->scalable_mode || s->snoop_control) {
-        vtd_spte_rsvd[1] &= ~VTD_SPTE_SNP;
-        vtd_spte_rsvd_large[2] &= ~VTD_SPTE_SNP;
-        vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
-    }
-
     if (x86_iommu_ir_supported(x86_iommu)) {
         s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
         if (s->intr_eim == ON_OFF_AUTO_ON) {
@@ -4102,7 +4061,56 @@ static void vtd_init(IntelIOMMUState *s)
     if (s->pasid) {
         s->ecap |= VTD_ECAP_PASID;
     }
+}
+
+/*
+ * Do the initialization. It will also be called when reset, so pay
+ * attention when adding new initialization stuff.
+ */
+static void vtd_init(IntelIOMMUState *s)
+{
+    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
+
+    memset(s->csr, 0, DMAR_REG_SIZE);
+    memset(s->wmask, 0, DMAR_REG_SIZE);
+    memset(s->w1cmask, 0, DMAR_REG_SIZE);
+    memset(s->womask, 0, DMAR_REG_SIZE);
+
+    s->root = 0;
+    s->root_scalable = false;
+    s->dmar_enabled = false;
+    s->intr_enabled = false;
+    s->iq_head = 0;
+    s->iq_tail = 0;
+    s->iq = 0;
+    s->iq_size = 0;
+    s->qi_enabled = false;
+    s->iq_last_desc_type = VTD_INV_DESC_NONE;
+    s->iq_dw = false;
+    s->next_frcd_reg = 0;
+
+    /*
+     * Rsvd field masks for spte
+     */
+    vtd_spte_rsvd[0] = ~0ULL;
+    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits,
+                                                  x86_iommu->dt_supported);
+    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
+    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
+    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
+
+    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits,
+                                                    x86_iommu->dt_supported);
+    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits,
+                                                    x86_iommu->dt_supported);
+
+    if (s->scalable_mode || s->snoop_control) {
+        vtd_spte_rsvd[1] &= ~VTD_SPTE_SNP;
+        vtd_spte_rsvd_large[2] &= ~VTD_SPTE_SNP;
+        vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
+    }
 
+    vtd_cap_init(s);
     vtd_reset_caches(s);
 
     /* Define registers with default values and bit semantics */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-15 10:13 [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2024-01-15 10:13 ` [PATCH rfcv1 5/6] intel_iommu: extract out vtd_cap_init to initialize cap/ecap Zhenzhong Duan
@ 2024-01-15 10:13 ` Zhenzhong Duan
  2024-01-17 17:56   ` Eric Auger
  2024-01-23  8:39   ` Cédric Le Goater
  5 siblings, 2 replies; 46+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

Add a framework to check and synchronize host IOMMU cap/ecap with
vIOMMU cap/ecap.

Currently only stage-2 translation is supported which is backed by
shadow page table on host side. So we don't need exact matching of
each bit of cap/ecap between vIOMMU and host. However, we can still
utilize this framework to ensure compatibility of host and vIOMMU's
address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
which is missed before.

When stage-1 translation is supported in future, a.k.a. scalable
modern mode, we need to ensure compatibility of each bits. Some
bits are user controllable, they should be checked with host side
to ensure compatibility. Other bits are not, they should be synced
into vIOMMU cap/ecap for compatibility.

The sequence will be:

vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
iommu->host_cap/ecap is checked and updated some bits with host cap/ecap. ---- vtd_sync_hw_info()
iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ---- vtd_machine_done_hook()

iommu->host_cap/ecap is a temporary storage to hold intermediate value
when synthesize host cap/ecap and vIOMMU's initial configured cap/ecap.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h |  4 ++
 hw/i386/intel_iommu.c         | 78 +++++++++++++++++++++++++++++++----
 2 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index c65fdde56f..b8abbcce12 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -292,6 +292,9 @@ struct IntelIOMMUState {
     uint64_t cap;                   /* The value of capability reg */
     uint64_t ecap;                  /* The value of extended capability reg */
 
+    uint64_t host_cap;              /* The value of host capability reg */
+    uint64_t host_ecap;             /* The value of host ext-capability reg */
+
     uint32_t context_cache_gen;     /* Should be in [1,MAX] */
     GHashTable *iotlb;              /* IOTLB */
 
@@ -314,6 +317,7 @@ struct IntelIOMMUState {
     bool dma_translation;           /* Whether DMA translation supported */
     bool pasid;                     /* Whether to support PASID */
 
+    bool cap_finalized;             /* Whether VTD capability finalized */
     /*
      * Protects IOMMU states in general.  Currently it protects the
      * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4c1d058ebd..be03fcbf52 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3819,6 +3819,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
+static bool vtd_sync_hw_info(IntelIOMMUState *s, struct iommu_hw_info_vtd *vtd,
+                             Error **errp)
+{
+    uint64_t addr_width;
+
+    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
+    if (s->aw_bits > addr_width) {
+        error_setg(errp, "User aw-bits: %u > host address width: %lu",
+                   s->aw_bits, addr_width);
+        return false;
+    }
+
+    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
+
+    return true;
+}
+
+/*
+ * virtual VT-d which wants nested needs to check the host IOMMU
+ * nesting cap info behind the assigned devices. Thus that vIOMMU
+ * could bind guest page table to host.
+ */
+static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
+                           Error **errp)
+{
+    struct iommu_hw_info_vtd vtd;
+    enum iommu_hw_info_type type = IOMMU_HW_INFO_TYPE_INTEL_VTD;
+
+    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
+        error_setg(errp, "Failed to get IOMMU capability!!!");
+        return false;
+    }
+
+    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "IOMMU hardware is not compatible!!!");
+        return false;
+    }
+
+    return vtd_sync_hw_info(s, &vtd, errp);
+}
+
 static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
                                     IOMMUFDDevice *idev, Error **errp)
 {
@@ -3837,6 +3878,10 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
         return 0;
     }
 
+    if (!vtd_check_idev(s, idev, errp)) {
+        return -1;
+    }
+
     vtd_iommu_lock(s);
 
     vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
@@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
 
-    memset(s->csr, 0, DMAR_REG_SIZE);
-    memset(s->wmask, 0, DMAR_REG_SIZE);
-    memset(s->w1cmask, 0, DMAR_REG_SIZE);
-    memset(s->womask, 0, DMAR_REG_SIZE);
+    /* CAP/ECAP are initialized in machine create done stage */
+    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
+    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
+    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
+    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
 
     s->root = 0;
     s->root_scalable = false;
@@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)
         vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
     }
 
-    vtd_cap_init(s);
+    if (!s->cap_finalized) {
+        vtd_cap_init(s);
+        s->host_cap = s->cap;
+        s->host_ecap = s->ecap;
+    }
+
     vtd_reset_caches(s);
 
     /* Define registers with default values and bit semantics */
     vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
-    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
-    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
     vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
     vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
     vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
@@ -4241,6 +4290,12 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
     return true;
 }
 
+static void vtd_setup_capability_reg(IntelIOMMUState *s)
+{
+    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
+    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
+}
+
 static int vtd_machine_done_notify_one(Object *child, void *unused)
 {
     IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
@@ -4259,6 +4314,14 @@ static int vtd_machine_done_notify_one(Object *child, void *unused)
 
 static void vtd_machine_done_hook(Notifier *notifier, void *unused)
 {
+    IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
+
+    iommu->cap = iommu->host_cap;
+    iommu->ecap = iommu->host_ecap;
+    iommu->cap_finalized = true;
+
+    vtd_setup_capability_reg(iommu);
+
     object_child_foreach_recursive(object_get_root(),
                                    vtd_machine_done_notify_one, NULL);
 }
@@ -4292,6 +4355,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
 
     QLIST_INIT(&s->vtd_as_with_notifiers);
     qemu_mutex_init(&s->iommu_lock);
+    s->cap_finalized = false;
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
     memory_region_add_subregion(get_system_memory(),
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-15 10:13 ` [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device() Zhenzhong Duan
@ 2024-01-17 14:11   ` Eric Auger
  2024-01-18  7:58     ` Duan, Zhenzhong
  2024-01-22 16:55   ` Cédric Le Goater
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-17 14:11 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun, Marcel Apfelbaum

Hi Zhenzhong,

On 1/15/24 11:13, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> This adds pci_device_set/unset_iommu_device() to set/unset
> IOMMUFDDevice for a given PCIe device. Caller of set
> should fail if set operation fails.
>
> Extract out pci_device_get_iommu_bus_devfn() to facilitate
> implementation of pci_device_set/unset_iommu_device().
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/pci/pci.h | 39 ++++++++++++++++++++++++++++++++++-
>  hw/pci/pci.c         | 49 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 86 insertions(+), 2 deletions(-)
>
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index fa6313aabc..a810c0ec74 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -7,6 +7,8 @@
>  /* PCI includes legacy ISA access.  */
>  #include "hw/isa/isa.h"
>  
> +#include "sysemu/iommufd_device.h"
> +
>  extern bool pci_available;
>  
>  /* PCI bus */
> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>       *
>       * @devfn: device and function number
>       */
> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int devfn);
> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int devfn);
> +    /**
> +     * @set_iommu_device: set iommufd device for a PCI device to vIOMMU
> +     *
> +     * Optional callback, if not implemented in vIOMMU, then vIOMMU can't
> +     * utilize iommufd specific features.
> +     *
> +     * Return true if iommufd device is accepted, or else return false with
> +     * errp set.
> +     *
> +     * @bus: the #PCIBus of the PCI device.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number of the PCI device.
> +     *
> +     * @idev: the data structure representing iommufd device.
> +     *
> +     */
> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
> +                            IOMMUFDDevice *idev, Error **errp);
> +    /**
> +     * @unset_iommu_device: unset iommufd device for a PCI device from vIOMMU
> +     *
> +     * Optional callback.
> +     *
> +     * @bus: the #PCIBus of the PCI device.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number of the PCI device.
> +     */
> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn);
>  } PCIIOMMUOps;
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice *idev,
> +                                Error **errp);
> +void pci_device_unset_iommu_device(PCIDevice *dev);
>  
>  /**
>   * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 76080af580..3848662f95 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2672,7 +2672,10 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>      }
>  }
>  
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> +                                           PCIBus **aliased_pbus,
> +                                           PCIBus **piommu_bus,
> +                                           uint8_t *aliased_pdevfn)
nit: I would drop the p in aliased_pbus andaliased_pdevfn. Maybe you
should allow the caller to pass NUL for aliased_pbus and aliased_pdevfn
as it is the case for pci_device_set_iommu_device() I may resue that
helper in [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> @@ -2717,6 +2720,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> +    *aliased_pbus = bus;
> +    *piommu_bus = iommu_bus;
> +    *aliased_pdevfn = devfn;
> +}
> +
> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>      if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>          return iommu_bus->iommu_ops->get_address_space(bus,
>                                   iommu_bus->iommu_opaque, devfn);
> @@ -2724,6 +2739,38 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      return &address_space_memory;
>  }
>  
> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice *idev,
> +                                Error **errp)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
> +        iommu_bus->iommu_ops && iommu_bus->iommu_ops->set_iommu_device) {
> +        return iommu_bus->iommu_ops->set_iommu_device(pci_get_bus(dev),
> +                                                      iommu_bus->iommu_opaque,
> +                                                      dev->devfn, idev, errp);
> +    }
> +    return 0;
> +}
> +
> +void pci_device_unset_iommu_device(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
> +        iommu_bus->iommu_ops && iommu_bus->iommu_ops->unset_iommu_device) {
> +        return iommu_bus->iommu_ops->unset_iommu_device(pci_get_bus(dev),
> +                                                        iommu_bus->iommu_opaque,
> +                                                        dev->devfn);
> +    }
> +}
> +
>  void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
>  {
>      /*
Thanks

Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-15 10:13 ` [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice Zhenzhong Duan
@ 2024-01-17 14:11   ` Eric Auger
  2024-01-18  2:58     ` Duan, Zhenzhong
  2024-01-18 12:42   ` Eric Auger
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-17 14:11 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun

Hi Zhenzhong,

On 1/15/24 11:13, Zhenzhong Duan wrote:
> IOMMUFDDevice represents a device in iommufd and can be used as
> a communication interface between devices (i.e., VFIO, VDPA) and
> vIOMMU.
>
> Currently it includes iommufd handler and device id information
iommufd handle
> which could be used by vIOMMU to get hw IOMMU information.
>
> In future nested translation support, vIOMMU is going to have
> more iommufd related operations like allocate hwpt for a device,
> attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.
>
> IOMMUFDDevice is willingly not a QOM object because we don't want
> it to be visible from the user interface.
>
> Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.

+  iommufd_device_get_info helper
>
> Originally-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  MAINTAINERS                     |  4 +--
>  include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
>  backends/iommufd_device.c       | 50 +++++++++++++++++++++++++++++++++
>  backends/meson.build            |  2 +-
>  4 files changed, 84 insertions(+), 3 deletions(-)
>  create mode 100644 include/sysemu/iommufd_device.h
>  create mode 100644 backends/iommufd_device.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 00ec1f7eca..606dfeb2b1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2171,8 +2171,8 @@ M: Yi Liu <yi.l.liu@intel.com>
>  M: Eric Auger <eric.auger@redhat.com>
>  M: Zhenzhong Duan <zhenzhong.duan@intel.com>
>  S: Supported
> -F: backends/iommufd.c
> -F: include/sysemu/iommufd.h
> +F: backends/iommufd*.c
> +F: include/sysemu/iommufd*.h
>  F: include/qemu/chardev_open.h
>  F: util/chardev_open.c
>  F: docs/devel/vfio-iommufd.rst
> diff --git a/include/sysemu/iommufd_device.h b/include/sysemu/iommufd_device.h
> new file mode 100644
> index 0000000000..795630324b
> --- /dev/null
> +++ b/include/sysemu/iommufd_device.h
> @@ -0,0 +1,31 @@
> +/*
> + * IOMMUFD Device
> + *
> + * Copyright (C) 2024 Intel Corporation.
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Zhenzhong Duan <zhenzhong.duan@intel.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef SYSEMU_IOMMUFD_DEVICE_H
> +#define SYSEMU_IOMMUFD_DEVICE_H
> +
> +#include <linux/iommufd.h>
> +#include "sysemu/iommufd.h"
> +
> +typedef struct IOMMUFDDevice IOMMUFDDevice;
> +
> +/* This is an abstraction of host IOMMUFD device */
> +struct IOMMUFDDevice {
> +    IOMMUFDBackend *iommufd;
> +    uint32_t dev_id;
> +};
> +
> +int iommufd_device_get_info(IOMMUFDDevice *idev,
> +                            enum iommu_hw_info_type *type,
> +                            uint32_t len, void *data);
> +void iommufd_device_init(void *_idev, size_t instance_size,
> +                         IOMMUFDBackend *iommufd, uint32_t dev_id);
> +#endif
> diff --git a/backends/iommufd_device.c b/backends/iommufd_device.c
> new file mode 100644
> index 0000000000..f6e7ca1dbf
> --- /dev/null
> +++ b/backends/iommufd_device.c
> @@ -0,0 +1,50 @@
> +/*
> + * QEMU abstract of Host IOMMU
it is the abstraction of the IOMMU or of any assigned device?
> + *
> + * Copyright (C) 2024 Intel Corporation.
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Zhenzhong Duan <zhenzhong.duan@intel.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include <sys/ioctl.h>
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/iommufd_device.h"
> +
> +int (IOMMUFDDevice *idev,
> +                            enum iommu_hw_info_type *type,
> +                            uint32_t len, void *data)
> +{
> +    struct iommu_hw_info info = {
> +        .size = sizeof(info),
> +        .flags = 0,
> +        .dev_id = idev->dev_id,
> +        .data_len = len,
> +        .__reserved = 0,
> +        .data_uptr = (uintptr_t)data,
> +    };
> +    int ret;
> +
> +    ret = ioctl(idev->iommufd->fd, IOMMU_GET_HW_INFO, &info);
> +    if (ret) {
> +        error_report("Failed to get info %m");
you may prefer using errp instead of hard traces.
> +    } else {
> +        *type = info.out_data_type;
> +    }
> +
> +    return ret;
> +}
> +
> +void iommufd_device_init(void *_idev, size_t instance_size,
nit: why the "_"
> +                         IOMMUFDBackend *iommufd, uint32_t dev_id)
> +{
> +    IOMMUFDDevice *idev = (IOMMUFDDevice *)_idev;
> +
> +    g_assert(sizeof(IOMMUFDDevice) <= instance_size);
at this stage of the reading it is not clear why you input the
instance_size. worth to be clarified/documented.
> +
> +    idev->iommufd = iommufd;
> +    idev->dev_id = dev_id;
> +}
> diff --git a/backends/meson.build b/backends/meson.build
> index 8b2b111497..c437cdb363 100644
> --- a/backends/meson.build
> +++ b/backends/meson.build
> @@ -24,7 +24,7 @@ if have_vhost_user
>    system_ss.add(when: 'CONFIG_VIRTIO', if_true: files('vhost-user.c'))
>  endif
>  system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost.c'))
> -system_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c'))
> +system_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c', 'iommufd_device.c'))
>  if have_vhost_user_crypto
>    system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost-user.c'))
>  endif
Thanks

Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-15 10:13 ` [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
@ 2024-01-17 15:37   ` Joao Martins
  2024-01-18  8:17     ` Duan, Zhenzhong
  2024-01-17 17:30   ` Eric Auger
  2024-01-22 17:15   ` Cédric Le Goater
  2 siblings, 1 reply; 46+ messages in thread
From: Joao Martins @ 2024-01-17 15:37 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, Yi Sun,
	qemu-devel

On 15/01/2024 10:13, Zhenzhong Duan wrote:
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 9bfddc1360..cbd035f148 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>      VFIOContainerBase *bcontainer;
>      VFIOIOMMUFDContainer *container;
>      VFIOAddressSpace *space;
> +    IOMMUFDDevice *idev = &vbasedev->idev;
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, devfd;
>      uint32_t ioas_id;
> @@ -428,6 +429,7 @@ found_container:
>      QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
>      QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>  
> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid);
>      trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
>                                     vbasedev->num_regions, vbasedev->flags);
>      return 0;

In the dirty tracking series, I'll need to fetch out_capabilities from device
and do a bunch of stuff that is used when allocating hwpt to ask for dirty
tracking. And this means having iommufd_device_init() be called before we call
iommufd_cdev_attach_container().

Here's what it looks based on an earlier version of your patch:

https://github.com/jpemartins/qemu/commit/433f97a05e0cdd8e3b8563aa20e4f22d107219b5

I can move the call earlier in my series, unless there's something specifically
when you call it here?

	Joao


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback
  2024-01-15 10:13 ` [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback Zhenzhong Duan
@ 2024-01-17 15:44   ` Eric Auger
  2024-01-18  8:43     ` Duan, Zhenzhong
  2024-01-22 17:09   ` Cédric Le Goater
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-17 15:44 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

Hi Zhenzhong,

On 1/15/24 11:13, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> This adds set/unset_iommu_device() implementation in Intel vIOMMU.
> In set call, IOMMUFDDevice is recorded in hash table indexed by
> PCI BDF.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/i386/intel_iommu.h | 10 +++++
>  hw/i386/intel_iommu.c         | 79 +++++++++++++++++++++++++++++++++++
>  2 files changed, 89 insertions(+)
>
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 7fa0a695c8..c65fdde56f 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -62,6 +62,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>  typedef struct VTDPASIDEntry VTDPASIDEntry;
> +typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -148,6 +149,13 @@ struct VTDAddressSpace {
>      IOVATree *iova_tree;
>  };
>  
> +struct VTDIOMMUFDDevice {
> +    PCIBus *bus;
> +    uint8_t devfn;
> +    IOMMUFDDevice *idev;
> +    IntelIOMMUState *iommu_state;
> +};
> +
Just wondering whether we shouldn't reuse the VTDAddressSpace to store
the idev, if any. How have you made your choice. What will it become
when PASID gets added?
>  struct VTDIOTLBEntry {
>      uint64_t gfn;
>      uint16_t domain_id;
> @@ -292,6 +300,8 @@ struct IntelIOMMUState {
>      /* list of registered notifiers */
>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>  
> +    GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
> +
>      /* interrupt remapping */
>      bool intr_enabled;              /* Whether guest enabled IR */
>      dma_addr_t intr_root;           /* Interrupt remapping table pointer */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ed5677c0ae..95faf697eb 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -237,6 +237,13 @@ static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
>             (key1->pasid == key2->pasid);
>  }
>  
> +static gboolean vtd_as_idev_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    const struct vtd_as_key *key1 = v1;
> +    const struct vtd_as_key *key2 = v2;
> +
> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
> +}
>  /*
>   * Note that we use pointer to PCIBus as the key, so hashing/shifting
>   * based on the pointer value is intended. Note that we deal with
> @@ -3812,6 +3819,74 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>      return vtd_dev_as;
>  }
>  
> +static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
> +                                    IOMMUFDDevice *idev, Error **errp)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDIOMMUFDDevice *vtd_idev;
> +    struct vtd_as_key key = {
> +        .bus = bus,
> +        .devfn = devfn,
> +    };
> +    struct vtd_as_key *new_key;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    /* None IOMMUFD case */
> +    if (!idev) {
> +        return 0;
> +    }
> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
> +
> +    if (vtd_idev) {
> +        error_setg(errp, "IOMMUFD device already exist");
> +        return -1;
> +    }
> +
> +    new_key = g_malloc(sizeof(*new_key));
> +    new_key->bus = bus;
> +    new_key->devfn = devfn;
> +
> +    vtd_idev = g_malloc0(sizeof(VTDIOMMUFDDevice));
> +    vtd_idev->bus = bus;
> +    vtd_idev->devfn = (uint8_t)devfn;
> +    vtd_idev->iommu_state = s;
> +    vtd_idev->idev = idev;
> +
> +    g_hash_table_insert(s->vtd_iommufd_dev, new_key, vtd_idev);
> +
> +    vtd_iommu_unlock(s);
> +
> +    return 0;
> +}
> +
> +static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int32_t devfn)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDIOMMUFDDevice *vtd_idev;
> +    struct vtd_as_key key = {
> +        .bus = bus,
> +        .devfn = devfn,
> +    };
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
> +    if (!vtd_idev) {
> +        vtd_iommu_unlock(s);
> +        return;
> +    }
> +
> +    g_hash_table_remove(s->vtd_iommufd_dev, &key);
> +
> +    vtd_iommu_unlock(s);
> +}
> +
>  /* Unmap the whole range in the notifier's scope. */
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>  {
> @@ -4107,6 +4182,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>  
>  static PCIIOMMUOps vtd_iommu_ops = {
>      .get_address_space = vtd_host_dma_iommu,
> +    .set_iommu_device = vtd_dev_set_iommu_device,
> +    .unset_iommu_device = vtd_dev_unset_iommu_device,
>  };
>  
>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> @@ -4230,6 +4307,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                       g_free, g_free);
>      s->vtd_address_spaces = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
>                                        g_free, g_free);
> +    s->vtd_iommufd_dev = g_hash_table_new_full(vtd_as_hash, vtd_as_idev_equal,
> +                                               g_free, g_free);
>      vtd_init(s);
>      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>      /* Pseudo address space under root PCI bus. */
Thanks

Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-15 10:13 ` [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
  2024-01-17 15:37   ` Joao Martins
@ 2024-01-17 17:30   ` Eric Auger
  2024-01-18  9:23     ` Duan, Zhenzhong
  2024-01-22 17:15   ` Cédric Le Goater
  2 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-17 17:30 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun

Hi Zhenzhong,

On 1/15/24 11:13, Zhenzhong Duan wrote:
> Initialize IOMMUFDDevice in vfio and pass to vIOMMU, so that vIOMMU
> could get hw IOMMU information.
>
> In VFIO legacy backend mode, we still pass a NULL IOMMUFDDevice to vIOMMU,
> in case vIOMMU needs some processing for VFIO legacy backend mode.
>
> Originally-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/vfio/vfio-common.h |  2 ++
>  hw/vfio/iommufd.c             |  2 ++
>  hw/vfio/pci.c                 | 24 +++++++++++++++++++-----
>  3 files changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 9b7ef7d02b..fde0d0ca60 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -31,6 +31,7 @@
>  #endif
>  #include "sysemu/sysemu.h"
>  #include "hw/vfio/vfio-container-base.h"
> +#include "sysemu/iommufd_device.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -126,6 +127,7 @@ typedef struct VFIODevice {
>      bool dirty_tracking;
>      int devid;
>      IOMMUFDBackend *iommufd;
> +    IOMMUFDDevice idev;
This looks duplicate of existing fields:
idev.dev_id is same as above devid. by the way let's try to use the same
devid everywhere.
idev.iommufd is same as above iommufd if != NULL.
So we should at least rationalize.
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 9bfddc1360..cbd035f148 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>      VFIOContainerBase *bcontainer;
>      VFIOIOMMUFDContainer *container;
>      VFIOAddressSpace *space;
> +    IOMMUFDDevice *idev = &vbasedev->idev;
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, devfd;
>      uint32_t ioas_id;
> @@ -428,6 +429,7 @@ found_container:
>      QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
>      QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>  
> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid);
>      trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
>                                     vbasedev->num_regions, vbasedev->flags);
>      return 0;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index d7fe06715c..2c3a5d267b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3107,11 +3107,21 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  
>      vfio_bars_register(vdev);
>  
> -    ret = vfio_add_capabilities(vdev, errp);
> +    if (vbasedev->iommufd) {
> +        ret = pci_device_set_iommu_device(pdev, &vbasedev->idev, errp);
> +    } else {
> +        ret = pci_device_set_iommu_device(pdev, 0, errp);
> +    }
>      if (ret) {
> +        error_prepend(errp, "Failed to set iommu_device: ");
at the moment it is rather an IOMMUFD device.
>          goto out_teardown;
>      }
>  
> +    ret = vfio_add_capabilities(vdev, errp);
> +    if (ret) {
> +        goto out_unset_idev;
> +    }
> +
>      if (vdev->vga) {
>          vfio_vga_quirk_setup(vdev);
>      }
> @@ -3128,7 +3138,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>              error_setg(errp,
>                         "cannot support IGD OpRegion feature on hotplugged "
>                         "device");
> -            goto out_teardown;
> +            goto out_unset_idev;
>          }
>  
>          ret = vfio_get_dev_region_info(vbasedev,
> @@ -3137,13 +3147,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          if (ret) {
>              error_setg_errno(errp, -ret,
>                               "does not support requested IGD OpRegion feature");
> -            goto out_teardown;
> +            goto out_unset_idev;
>          }
>  
>          ret = vfio_pci_igd_opregion_init(vdev, opregion, errp);
>          g_free(opregion);
>          if (ret) {
> -            goto out_teardown;
> +            goto out_unset_idev;
>          }
>      }
>  
> @@ -3229,6 +3239,8 @@ out_deregister:
>      if (vdev->intx.mmap_timer) {
>          timer_free(vdev->intx.mmap_timer);
>      }
> +out_unset_idev:
> +    pci_device_unset_iommu_device(pdev);
>  out_teardown:
>      vfio_teardown_msi(vdev);
>      vfio_bars_exit(vdev);
> @@ -3257,6 +3269,7 @@ static void vfio_instance_finalize(Object *obj)
>  static void vfio_exitfn(PCIDevice *pdev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +    VFIODevice *vbasedev = &vdev->vbasedev;
>  
>      vfio_unregister_req_notifier(vdev);
>      vfio_unregister_err_notifier(vdev);
> @@ -3271,7 +3284,8 @@ static void vfio_exitfn(PCIDevice *pdev)
>      vfio_teardown_msi(vdev);
>      vfio_pci_disable_rp_atomics(vdev);
>      vfio_bars_exit(vdev);
> -    vfio_migration_exit(&vdev->vbasedev);
> +    vfio_migration_exit(vbasedev);
> +    pci_device_unset_iommu_device(pdev);
>  }
>  
>  static void vfio_pci_reset(DeviceState *dev)
Thanks

Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 5/6] intel_iommu: extract out vtd_cap_init to initialize cap/ecap
  2024-01-15 10:13 ` [PATCH rfcv1 5/6] intel_iommu: extract out vtd_cap_init to initialize cap/ecap Zhenzhong Duan
@ 2024-01-17 17:36   ` Eric Auger
  0 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2024-01-17 17:36 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum



On 1/15/24 11:13, Zhenzhong Duan wrote:
> This is a prerequisite for host cap/ecap sync.
>
> No functional change intended.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Looks good to me
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/i386/intel_iommu.c | 92 +++++++++++++++++++++++--------------------
>  1 file changed, 50 insertions(+), 42 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 95faf697eb..4c1d058ebd 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4009,30 +4009,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>      return;
>  }
>  
> -/* Do the initialization. It will also be called when reset, so pay
> - * attention when adding new initialization stuff.
> - */
> -static void vtd_init(IntelIOMMUState *s)
> +static void vtd_cap_init(IntelIOMMUState *s)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>  
> -    memset(s->csr, 0, DMAR_REG_SIZE);
> -    memset(s->wmask, 0, DMAR_REG_SIZE);
> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
> -    memset(s->womask, 0, DMAR_REG_SIZE);
> -
> -    s->root = 0;
> -    s->root_scalable = false;
> -    s->dmar_enabled = false;
> -    s->intr_enabled = false;
> -    s->iq_head = 0;
> -    s->iq_tail = 0;
> -    s->iq = 0;
> -    s->iq_size = 0;
> -    s->qi_enabled = false;
> -    s->iq_last_desc_type = VTD_INV_DESC_NONE;
> -    s->iq_dw = false;
> -    s->next_frcd_reg = 0;
>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
>               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
>               VTD_CAP_MGAW(s->aw_bits);
> @@ -4049,27 +4029,6 @@ static void vtd_init(IntelIOMMUState *s)
>      }
>      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
>  
> -    /*
> -     * Rsvd field masks for spte
> -     */
> -    vtd_spte_rsvd[0] = ~0ULL;
> -    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits,
> -                                                  x86_iommu->dt_supported);
> -    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> -    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> -    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> -
> -    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits,
> -                                                         x86_iommu->dt_supported);
> -    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits,
> -                                                         x86_iommu->dt_supported);
> -
> -    if (s->scalable_mode || s->snoop_control) {
> -        vtd_spte_rsvd[1] &= ~VTD_SPTE_SNP;
> -        vtd_spte_rsvd_large[2] &= ~VTD_SPTE_SNP;
> -        vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
> -    }
> -
>      if (x86_iommu_ir_supported(x86_iommu)) {
>          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
>          if (s->intr_eim == ON_OFF_AUTO_ON) {
> @@ -4102,7 +4061,56 @@ static void vtd_init(IntelIOMMUState *s)
>      if (s->pasid) {
>          s->ecap |= VTD_ECAP_PASID;
>      }
> +}
> +
> +/*
> + * Do the initialization. It will also be called when reset, so pay
> + * attention when adding new initialization stuff.
> + */
> +static void vtd_init(IntelIOMMUState *s)
> +{
> +    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> +
> +    memset(s->csr, 0, DMAR_REG_SIZE);
> +    memset(s->wmask, 0, DMAR_REG_SIZE);
> +    memset(s->w1cmask, 0, DMAR_REG_SIZE);
> +    memset(s->womask, 0, DMAR_REG_SIZE);
> +
> +    s->root = 0;
> +    s->root_scalable = false;
> +    s->dmar_enabled = false;
> +    s->intr_enabled = false;
> +    s->iq_head = 0;
> +    s->iq_tail = 0;
> +    s->iq = 0;
> +    s->iq_size = 0;
> +    s->qi_enabled = false;
> +    s->iq_last_desc_type = VTD_INV_DESC_NONE;
> +    s->iq_dw = false;
> +    s->next_frcd_reg = 0;
> +
> +    /*
> +     * Rsvd field masks for spte
> +     */
> +    vtd_spte_rsvd[0] = ~0ULL;
> +    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits,
> +                                                  x86_iommu->dt_supported);
> +    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> +    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> +    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> +
> +    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits,
> +                                                    x86_iommu->dt_supported);
> +    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits,
> +                                                    x86_iommu->dt_supported);
> +
> +    if (s->scalable_mode || s->snoop_control) {
> +        vtd_spte_rsvd[1] &= ~VTD_SPTE_SNP;
> +        vtd_spte_rsvd_large[2] &= ~VTD_SPTE_SNP;
> +        vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
> +    }
>  
> +    vtd_cap_init(s);
>      vtd_reset_caches(s);
>  
>      /* Define registers with default values and bit semantics */



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-15 10:13 ` [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
@ 2024-01-17 17:56   ` Eric Auger
  2024-01-18  9:30     ` Duan, Zhenzhong
  2024-01-23  8:39   ` Cédric Le Goater
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-17 17:56 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

Hi Zhenzhong,

On 1/15/24 11:13, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> Add a framework to check and synchronize host IOMMU cap/ecap with
> vIOMMU cap/ecap.
>
> Currently only stage-2 translation is supported which is backed by
> shadow page table on host side. So we don't need exact matching of
> each bit of cap/ecap between vIOMMU and host. However, we can still
> utilize this framework to ensure compatibility of host and vIOMMU's
> address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
> which is missed before.
>
> When stage-1 translation is supported in future, a.k.a. scalable
> modern mode, we need to ensure compatibility of each bits. Some
> bits are user controllable, they should be checked with host side
> to ensure compatibility. Other bits are not, they should be synced
> into vIOMMU cap/ecap for compatibility.
>
> The sequence will be:
>
> vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
> iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
> iommu->host_cap/ecap is checked and updated some bits with host cap/ecap. ---- vtd_sync_hw_info()
> iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ---- vtd_machine_done_hook()
>
> iommu->host_cap/ecap is a temporary storage to hold intermediate value
> when synthesize host cap/ecap and vIOMMU's initial configured cap/ecap.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/i386/intel_iommu.h |  4 ++
>  hw/i386/intel_iommu.c         | 78 +++++++++++++++++++++++++++++++----
>  2 files changed, 75 insertions(+), 7 deletions(-)
>
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index c65fdde56f..b8abbcce12 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -292,6 +292,9 @@ struct IntelIOMMUState {
>      uint64_t cap;                   /* The value of capability reg */
>      uint64_t ecap;                  /* The value of extended capability reg */
>  
> +    uint64_t host_cap;              /* The value of host capability reg */
> +    uint64_t host_ecap;             /* The value of host ext-capability reg */
> +
>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>      GHashTable *iotlb;              /* IOTLB */
>  
> @@ -314,6 +317,7 @@ struct IntelIOMMUState {
>      bool dma_translation;           /* Whether DMA translation supported */
>      bool pasid;                     /* Whether to support PASID */
>  
> +    bool cap_finalized;             /* Whether VTD capability finalized */
>      /*
>       * Protects IOMMU states in general.  Currently it protects the
>       * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 4c1d058ebd..be03fcbf52 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3819,6 +3819,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>      return vtd_dev_as;
>  }
>  
> +static bool vtd_sync_hw_info(IntelIOMMUState *s, struct iommu_hw_info_vtd *vtd,
> +                             Error **errp)
> +{
> +    uint64_t addr_width;
> +
> +    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
> +    if (s->aw_bits > addr_width) {
> +        error_setg(errp, "User aw-bits: %u > host address width: %lu",
> +                   s->aw_bits, addr_width);
> +        return false;
> +    }
> +
> +    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
> +
> +    return true;
> +}
> +
> +/*
> + * virtual VT-d which wants nested needs to check the host IOMMU
> + * nesting cap info behind the assigned devices. Thus that vIOMMU
> + * could bind guest page table to host.
> + */
> +static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
> +                           Error **errp)
> +{
> +    struct iommu_hw_info_vtd vtd;
> +    enum iommu_hw_info_type type = IOMMU_HW_INFO_TYPE_INTEL_VTD;
> +
> +    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
> +        error_setg(errp, "Failed to get IOMMU capability!!!");
> +        return false;
> +    }
> +
> +    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> +        error_setg(errp, "IOMMU hardware is not compatible!!!");
> +        return false;
> +    }
> +
> +    return vtd_sync_hw_info(s, &vtd, errp);
> +}
> +
>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
>                                      IOMMUFDDevice *idev, Error **errp)
>  {
> @@ -3837,6 +3878,10 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
>          return 0;
>      }
>  
> +    if (!vtd_check_idev(s, idev, errp)) {In
In
[RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for
hotplugged devices
https://lore.kernel.org/all/20240117080414.316890-1-eric.auger@redhat.com/

I also attempt to pass host iommu info to the virtio-iommu but with
legacy BE. In my case I want to pass the reserved memory regions which
also model the aw.
So this is a pretty similar use case.

Why don't we pass the pointer to an opaque iommu_hw_info instead,
through the PCIIOMMUOps?



> +        return -1;
> +    }
> +
>      vtd_iommu_lock(s);
>  
>      vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
> @@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>  
> -    memset(s->csr, 0, DMAR_REG_SIZE);
> -    memset(s->wmask, 0, DMAR_REG_SIZE);
> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
> -    memset(s->womask, 0, DMAR_REG_SIZE);
> +    /* CAP/ECAP are initialized in machine create done stage */
> +    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
> +    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
> +    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
> +    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
This change is not documented in the commit msg.
Sorry I don't get why this is needed?
>  
>      s->root = 0;
>      s->root_scalable = false;
> @@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)
>          vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
>      }
>  
> -    vtd_cap_init(s);
> +    if (!s->cap_finalized) {
> +        vtd_cap_init(s);
> +        s->host_cap = s->cap;
> +        s->host_ecap = s->ecap;
> +    }
> +
>      vtd_reset_caches(s);
>  
>      /* Define registers with default values and bit semantics */
>      vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
> -    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
> -    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>      vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>      vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>      vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
> @@ -4241,6 +4290,12 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>      return true;
>  }
>  
> +static void vtd_setup_capability_reg(IntelIOMMUState *s)
> +{
> +    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
> +    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
> +}
> +
>  static int vtd_machine_done_notify_one(Object *child, void *unused)
>  {
>      IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
> @@ -4259,6 +4314,14 @@ static int vtd_machine_done_notify_one(Object *child, void *unused)
>  
>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>  {
> +    IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
> +
> +    iommu->cap = iommu->host_cap;
> +    iommu->ecap = iommu->host_ecap;
> +    iommu->cap_finalized = true;
I don't think you can change the defaults like this without taking care
of compats (migration).

Thanks

Eric
> +
> +    vtd_setup_capability_reg(iommu);
> +
>      object_child_foreach_recursive(object_get_root(),
>                                     vtd_machine_done_notify_one, NULL);
>  }
> @@ -4292,6 +4355,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>  
>      QLIST_INIT(&s->vtd_as_with_notifiers);
>      qemu_mutex_init(&s->iommu_lock);
> +    s->cap_finalized = false;
>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>                            "intel_iommu", DMAR_REG_SIZE);
>      memory_region_add_subregion(get_system_memory(),



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-17 14:11   ` Eric Auger
@ 2024-01-18  2:58     ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  2:58 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce
>IOMMUFDDevice
>
>Hi Zhenzhong,
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> IOMMUFDDevice represents a device in iommufd and can be used as
>> a communication interface between devices (i.e., VFIO, VDPA) and
>> vIOMMU.
>>
>> Currently it includes iommufd handler and device id information
>iommufd handle
>> which could be used by vIOMMU to get hw IOMMU information.
>>
>> In future nested translation support, vIOMMU is going to have
>> more iommufd related operations like allocate hwpt for a device,
>> attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.
>>
>> IOMMUFDDevice is willingly not a QOM object because we don't want
>> it to be visible from the user interface.
>>
>> Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.
>
>+  iommufd_device_get_info helper

Will do.

>>
>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  MAINTAINERS                     |  4 +--
>>  include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
>>  backends/iommufd_device.c       | 50
>+++++++++++++++++++++++++++++++++
>>  backends/meson.build            |  2 +-
>>  4 files changed, 84 insertions(+), 3 deletions(-)
>>  create mode 100644 include/sysemu/iommufd_device.h
>>  create mode 100644 backends/iommufd_device.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 00ec1f7eca..606dfeb2b1 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2171,8 +2171,8 @@ M: Yi Liu <yi.l.liu@intel.com>
>>  M: Eric Auger <eric.auger@redhat.com>
>>  M: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>  S: Supported
>> -F: backends/iommufd.c
>> -F: include/sysemu/iommufd.h
>> +F: backends/iommufd*.c
>> +F: include/sysemu/iommufd*.h
>>  F: include/qemu/chardev_open.h
>>  F: util/chardev_open.c
>>  F: docs/devel/vfio-iommufd.rst
>> diff --git a/include/sysemu/iommufd_device.h
>b/include/sysemu/iommufd_device.h
>> new file mode 100644
>> index 0000000000..795630324b
>> --- /dev/null
>> +++ b/include/sysemu/iommufd_device.h
>> @@ -0,0 +1,31 @@
>> +/*
>> + * IOMMUFD Device
>> + *
>> + * Copyright (C) 2024 Intel Corporation.
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Zhenzhong Duan <zhenzhong.duan@intel.com>
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef SYSEMU_IOMMUFD_DEVICE_H
>> +#define SYSEMU_IOMMUFD_DEVICE_H
>> +
>> +#include <linux/iommufd.h>
>> +#include "sysemu/iommufd.h"
>> +
>> +typedef struct IOMMUFDDevice IOMMUFDDevice;
>> +
>> +/* This is an abstraction of host IOMMUFD device */
>> +struct IOMMUFDDevice {
>> +    IOMMUFDBackend *iommufd;
>> +    uint32_t dev_id;
>> +};
>> +
>> +int iommufd_device_get_info(IOMMUFDDevice *idev,
>> +                            enum iommu_hw_info_type *type,
>> +                            uint32_t len, void *data);
>> +void iommufd_device_init(void *_idev, size_t instance_size,
>> +                         IOMMUFDBackend *iommufd, uint32_t dev_id);
>> +#endif
>> diff --git a/backends/iommufd_device.c b/backends/iommufd_device.c
>> new file mode 100644
>> index 0000000000..f6e7ca1dbf
>> --- /dev/null
>> +++ b/backends/iommufd_device.c
>> @@ -0,0 +1,50 @@
>> +/*
>> + * QEMU abstract of Host IOMMU
>it is the abstraction of the IOMMU or of any assigned device?

' QEMU abstract of Host IOMMUFD device' may be better.

>> + *
>> + * Copyright (C) 2024 Intel Corporation.
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Zhenzhong Duan <zhenzhong.duan@intel.com>
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include <sys/ioctl.h>
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "sysemu/iommufd_device.h"
>> +
>> +int (IOMMUFDDevice *idev,
>> +                            enum iommu_hw_info_type *type,
>> +                            uint32_t len, void *data)
>> +{
>> +    struct iommu_hw_info info = {
>> +        .size = sizeof(info),
>> +        .flags = 0,
>> +        .dev_id = idev->dev_id,
>> +        .data_len = len,
>> +        .__reserved = 0,
>> +        .data_uptr = (uintptr_t)data,
>> +    };
>> +    int ret;
>> +
>> +    ret = ioctl(idev->iommufd->fd, IOMMU_GET_HW_INFO, &info);
>> +    if (ret) {
>> +        error_report("Failed to get info %m");
>you may prefer using errp instead of hard traces.

Good suggestion, will do.

>> +    } else {
>> +        *type = info.out_data_type;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +void iommufd_device_init(void *_idev, size_t instance_size,
>nit: why the "_"

To distinguish with local idev.

>> +                         IOMMUFDBackend *iommufd, uint32_t dev_id)
>> +{
>> +    IOMMUFDDevice *idev = (IOMMUFDDevice *)_idev;
>> +
>> +    g_assert(sizeof(IOMMUFDDevice) <= instance_size);
>at this stage of the reading it is not clear why you input the
>instance_size. worth to be clarified/documented.

VFIO or VDPA may have IOMMUFD related attributes for its own usages.
It looks VFIO doesn't need this for now. I'll remove it, then _idev can be
removed too.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-17 14:11   ` Eric Auger
@ 2024-01-18  7:58     ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  7:58 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>pci_device_set/unset_iommu_device()
>
>Hi Zhenzhong,
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This adds pci_device_set/unset_iommu_device() to set/unset
>> IOMMUFDDevice for a given PCIe device. Caller of set
>> should fail if set operation fails.
>>
>> Extract out pci_device_get_iommu_bus_devfn() to facilitate
>> implementation of pci_device_set/unset_iommu_device().
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/hw/pci/pci.h | 39 ++++++++++++++++++++++++++++++++++-
>>  hw/pci/pci.c         | 49
>+++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index fa6313aabc..a810c0ec74 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -7,6 +7,8 @@
>>  /* PCI includes legacy ISA access.  */
>>  #include "hw/isa/isa.h"
>>
>> +#include "sysemu/iommufd_device.h"
>> +
>>  extern bool pci_available;
>>
>>  /* PCI bus */
>> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>>       *
>>       * @devfn: device and function number
>>       */
>> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>devfn);
>> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>devfn);
>> +    /**
>> +     * @set_iommu_device: set iommufd device for a PCI device to
>vIOMMU
>> +     *
>> +     * Optional callback, if not implemented in vIOMMU, then vIOMMU
>can't
>> +     * utilize iommufd specific features.
>> +     *
>> +     * Return true if iommufd device is accepted, or else return false with
>> +     * errp set.
>> +     *
>> +     * @bus: the #PCIBus of the PCI device.
>> +     *
>> +     * @opaque: the data passed to pci_setup_iommu().
>> +     *
>> +     * @devfn: device and function number of the PCI device.
>> +     *
>> +     * @idev: the data structure representing iommufd device.
>> +     *
>> +     */
>> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
>> +                            IOMMUFDDevice *idev, Error **errp);
>> +    /**
>> +     * @unset_iommu_device: unset iommufd device for a PCI device from
>vIOMMU
>> +     *
>> +     * Optional callback.
>> +     *
>> +     * @bus: the #PCIBus of the PCI device.
>> +     *
>> +     * @opaque: the data passed to pci_setup_iommu().
>> +     *
>> +     * @devfn: device and function number of the PCI device.
>> +     */
>> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t
>devfn);
>>  } PCIIOMMUOps;
>>
>>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>*idev,
>> +                                Error **errp);
>> +void pci_device_unset_iommu_device(PCIDevice *dev);
>>
>>  /**
>>   * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 76080af580..3848662f95 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2672,7 +2672,10 @@ static void
>pci_device_class_base_init(ObjectClass *klass, void *data)
>>      }
>>  }
>>
>> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>> +                                           PCIBus **aliased_pbus,
>> +                                           PCIBus **piommu_bus,
>> +                                           uint8_t *aliased_pdevfn)
>nit: I would drop the p in aliased_pbus andaliased_pdevfn. Maybe you
>should allow the caller to pass NUL for aliased_pbus and aliased_pdevfn
>as it is the case for pci_device_set_iommu_device() I may resue that
>helper in [RFC 2/7] hw/pci: Introduce pci_device_iommu_bus

Good suggestion, will do.

Thanks
Zhenzhong

>>  {
>>      PCIBus *bus = pci_get_bus(dev);
>>      PCIBus *iommu_bus = bus;
>> @@ -2717,6 +2720,18 @@ AddressSpace
>*pci_device_iommu_address_space(PCIDevice *dev)
>>
>>          iommu_bus = parent_bus;
>>      }
>> +    *aliased_pbus = bus;
>> +    *piommu_bus = iommu_bus;
>> +    *aliased_pdevfn = devfn;
>> +}
>> +
>> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +{
>> +    PCIBus *bus;
>> +    PCIBus *iommu_bus;
>> +    uint8_t devfn;
>> +
>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>>      if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>>          return iommu_bus->iommu_ops->get_address_space(bus,
>>                                   iommu_bus->iommu_opaque, devfn);
>> @@ -2724,6 +2739,38 @@ AddressSpace
>*pci_device_iommu_address_space(PCIDevice *dev)
>>      return &address_space_memory;
>>  }
>>
>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>*idev,
>> +                                Error **errp)
>> +{
>> +    PCIBus *bus;
>> +    PCIBus *iommu_bus;
>> +    uint8_t devfn;
>> +
>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
>> +        iommu_bus->iommu_ops && iommu_bus->iommu_ops-
>>set_iommu_device) {
>> +        return iommu_bus->iommu_ops-
>>set_iommu_device(pci_get_bus(dev),
>> +                                                      iommu_bus->iommu_opaque,
>> +                                                      dev->devfn, idev, errp);
>> +    }
>> +    return 0;
>> +}
>> +
>> +void pci_device_unset_iommu_device(PCIDevice *dev)
>> +{
>> +    PCIBus *bus;
>> +    PCIBus *iommu_bus;
>> +    uint8_t devfn;
>> +
>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
>> +        iommu_bus->iommu_ops && iommu_bus->iommu_ops-
>>unset_iommu_device) {
>> +        return iommu_bus->iommu_ops-
>>unset_iommu_device(pci_get_bus(dev),
>> +                                                        iommu_bus->iommu_opaque,
>> +                                                        dev->devfn);
>> +    }
>> +}
>> +
>>  void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void
>*opaque)
>>  {
>>      /*
>Thanks
>
>Eric


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-17 15:37   ` Joao Martins
@ 2024-01-18  8:17     ` Duan, Zhenzhong
  2024-01-18 10:17       ` Yi Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  8:17 UTC (permalink / raw)
  To: Joao Martins
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, qemu-devel



>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to
>vIOMMU
>
>On 15/01/2024 10:13, Zhenzhong Duan wrote:
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 9bfddc1360..cbd035f148 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>>      VFIOContainerBase *bcontainer;
>>      VFIOIOMMUFDContainer *container;
>>      VFIOAddressSpace *space;
>> +    IOMMUFDDevice *idev = &vbasedev->idev;
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      int ret, devfd;
>>      uint32_t ioas_id;
>> @@ -428,6 +429,7 @@ found_container:
>>      QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev,
>container_next);
>>      QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>>
>> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev-
>>devid);
>>      trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>num_irqs,
>>                                     vbasedev->num_regions, vbasedev->flags);
>>      return 0;
>
>In the dirty tracking series, I'll need to fetch out_capabilities from device
>and do a bunch of stuff that is used when allocating hwpt to ask for dirty
>tracking. And this means having iommufd_device_init() be called before we
>call
>iommufd_cdev_attach_container().
>
>Here's what it looks based on an earlier version of your patch:
>
>https://github.com/jpemartins/qemu/commit/433f97a05e0cdd8e3b8563a
>a20e4f22d107219b5
>
>I can move the call earlier in my series, unless there's something specifically
>when you call it here?

I think it's safe to move it earlier, just remember to do the same for existing
container.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback
  2024-01-17 15:44   ` Eric Auger
@ 2024-01-18  8:43     ` Duan, Zhenzhong
  2024-01-18 12:34       ` Eric Auger
  0 siblings, 1 reply; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  8:43 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device
>callback
>
>Hi Zhenzhong,
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This adds set/unset_iommu_device() implementation in Intel vIOMMU.
>> In set call, IOMMUFDDevice is recorded in hash table indexed by
>> PCI BDF.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/hw/i386/intel_iommu.h | 10 +++++
>>  hw/i386/intel_iommu.c         | 79
>+++++++++++++++++++++++++++++++++++
>>  2 files changed, 89 insertions(+)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 7fa0a695c8..c65fdde56f 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -62,6 +62,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>> +typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
>>
>>  /* Context-Entry */
>>  struct VTDContextEntry {
>> @@ -148,6 +149,13 @@ struct VTDAddressSpace {
>>      IOVATree *iova_tree;
>>  };
>>
>> +struct VTDIOMMUFDDevice {
>> +    PCIBus *bus;
>> +    uint8_t devfn;
>> +    IOMMUFDDevice *idev;
>> +    IntelIOMMUState *iommu_state;
>> +};
>> +
>Just wondering whether we shouldn't reuse the VTDAddressSpace to store
>the idev, if any. How have you made your choice. What will it become
>when PASID gets added?

VTDAddressSpace is indexed by aliased BDF, but VTDIOMMUFDDevice is indexed
by device's BDF. So we can't just store VTDIOMMUFDDevice as a pointer in
VTDAddressSpace, may need a list in case more than one device in same address
space. Then a global VTDIOMMUFDDevice list is better for lookup.

For PASID in modern mode which support stage-1 page table, we have
VTDPASIDAddressSpace indexed by device's BDF+PASID, We didn't use
VTDAddressSpace which is for stage-2 page table.

Thanks
Zhenzhong

>>  struct VTDIOTLBEntry {
>>      uint64_t gfn;
>>      uint16_t domain_id;
>> @@ -292,6 +300,8 @@ struct IntelIOMMUState {
>>      /* list of registered notifiers */
>>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>>
>> +    GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
>> +
>>      /* interrupt remapping */
>>      bool intr_enabled;              /* Whether guest enabled IR */
>>      dma_addr_t intr_root;           /* Interrupt remapping table pointer */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index ed5677c0ae..95faf697eb 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -237,6 +237,13 @@ static gboolean vtd_as_equal(gconstpointer v1,
>gconstpointer v2)
>>             (key1->pasid == key2->pasid);
>>  }
>>
>> +static gboolean vtd_as_idev_equal(gconstpointer v1, gconstpointer v2)
>> +{
>> +    const struct vtd_as_key *key1 = v1;
>> +    const struct vtd_as_key *key2 = v2;
>> +
>> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
>> +}
>>  /*
>>   * Note that we use pointer to PCIBus as the key, so hashing/shifting
>>   * based on the pointer value is intended. Note that we deal with
>> @@ -3812,6 +3819,74 @@ VTDAddressSpace
>*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>      return vtd_dev_as;
>>  }
>>
>> +static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>int32_t devfn,
>> +                                    IOMMUFDDevice *idev, Error **errp)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    VTDIOMMUFDDevice *vtd_idev;
>> +    struct vtd_as_key key = {
>> +        .bus = bus,
>> +        .devfn = devfn,
>> +    };
>> +    struct vtd_as_key *new_key;
>> +
>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>> +
>> +    /* None IOMMUFD case */
>> +    if (!idev) {
>> +        return 0;
>> +    }
>> +
>> +    vtd_iommu_lock(s);
>> +
>> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>> +
>> +    if (vtd_idev) {
>> +        error_setg(errp, "IOMMUFD device already exist");
>> +        return -1;
>> +    }
>> +
>> +    new_key = g_malloc(sizeof(*new_key));
>> +    new_key->bus = bus;
>> +    new_key->devfn = devfn;
>> +
>> +    vtd_idev = g_malloc0(sizeof(VTDIOMMUFDDevice));
>> +    vtd_idev->bus = bus;
>> +    vtd_idev->devfn = (uint8_t)devfn;
>> +    vtd_idev->iommu_state = s;
>> +    vtd_idev->idev = idev;
>> +
>> +    g_hash_table_insert(s->vtd_iommufd_dev, new_key, vtd_idev);
>> +
>> +    vtd_iommu_unlock(s);
>> +
>> +    return 0;
>> +}
>> +
>> +static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque,
>int32_t devfn)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    VTDIOMMUFDDevice *vtd_idev;
>> +    struct vtd_as_key key = {
>> +        .bus = bus,
>> +        .devfn = devfn,
>> +    };
>> +
>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>> +
>> +    vtd_iommu_lock(s);
>> +
>> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>> +    if (!vtd_idev) {
>> +        vtd_iommu_unlock(s);
>> +        return;
>> +    }
>> +
>> +    g_hash_table_remove(s->vtd_iommufd_dev, &key);
>> +
>> +    vtd_iommu_unlock(s);
>> +}
>> +
>>  /* Unmap the whole range in the notifier's scope. */
>>  static void vtd_address_space_unmap(VTDAddressSpace *as,
>IOMMUNotifier *n)
>>  {
>> @@ -4107,6 +4182,8 @@ static AddressSpace
>*vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>
>>  static PCIIOMMUOps vtd_iommu_ops = {
>>      .get_address_space = vtd_host_dma_iommu,
>> +    .set_iommu_device = vtd_dev_set_iommu_device,
>> +    .unset_iommu_device = vtd_dev_unset_iommu_device,
>>  };
>>
>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>> @@ -4230,6 +4307,8 @@ static void vtd_realize(DeviceState *dev, Error
>**errp)
>>                                       g_free, g_free);
>>      s->vtd_address_spaces = g_hash_table_new_full(vtd_as_hash,
>vtd_as_equal,
>>                                        g_free, g_free);
>> +    s->vtd_iommufd_dev = g_hash_table_new_full(vtd_as_hash,
>vtd_as_idev_equal,
>> +                                               g_free, g_free);
>>      vtd_init(s);
>>      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>      /* Pseudo address space under root PCI bus. */
>Thanks
>
>Eric


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-17 17:30   ` Eric Auger
@ 2024-01-18  9:23     ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  9:23 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to
>vIOMMU
>
>Hi Zhenzhong,
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> Initialize IOMMUFDDevice in vfio and pass to vIOMMU, so that vIOMMU
>> could get hw IOMMU information.
>>
>> In VFIO legacy backend mode, we still pass a NULL IOMMUFDDevice to
>vIOMMU,
>> in case vIOMMU needs some processing for VFIO legacy backend mode.
>>
>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/hw/vfio/vfio-common.h |  2 ++
>>  hw/vfio/iommufd.c             |  2 ++
>>  hw/vfio/pci.c                 | 24 +++++++++++++++++++-----
>>  3 files changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>> index 9b7ef7d02b..fde0d0ca60 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -31,6 +31,7 @@
>>  #endif
>>  #include "sysemu/sysemu.h"
>>  #include "hw/vfio/vfio-container-base.h"
>> +#include "sysemu/iommufd_device.h"
>>
>>  #define VFIO_MSG_PREFIX "vfio %s: "
>>
>> @@ -126,6 +127,7 @@ typedef struct VFIODevice {
>>      bool dirty_tracking;
>>      int devid;
>>      IOMMUFDBackend *iommufd;
>> +    IOMMUFDDevice idev;
>This looks duplicate of existing fields:
>idev.dev_id is same as above devid. by the way let's try to use the same
>devid everywhere.
>idev.iommufd is same as above iommufd if != NULL.
>So we should at least rationalize.

Indeed, I'll remove devid and *iommufd. Thanks for suggestion.

>>  } VFIODevice;
>>
>>  struct VFIODeviceOps {
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 9bfddc1360..cbd035f148 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>>      VFIOContainerBase *bcontainer;
>>      VFIOIOMMUFDContainer *container;
>>      VFIOAddressSpace *space;
>> +    IOMMUFDDevice *idev = &vbasedev->idev;
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      int ret, devfd;
>>      uint32_t ioas_id;
>> @@ -428,6 +429,7 @@ found_container:
>>      QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev,
>container_next);
>>      QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>>
>> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev-
>>devid);
>>      trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>num_irqs,
>>                                     vbasedev->num_regions, vbasedev->flags);
>>      return 0;
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index d7fe06715c..2c3a5d267b 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3107,11 +3107,21 @@ static void vfio_realize(PCIDevice *pdev,
>Error **errp)
>>
>>      vfio_bars_register(vdev);
>>
>> -    ret = vfio_add_capabilities(vdev, errp);
>> +    if (vbasedev->iommufd) {
>> +        ret = pci_device_set_iommu_device(pdev, &vbasedev->idev, errp);
>> +    } else {
>> +        ret = pci_device_set_iommu_device(pdev, 0, errp);
>> +    }
>>      if (ret) {
>> +        error_prepend(errp, "Failed to set iommu_device: ");
>at the moment it is rather an IOMMUFD device.

Will use "Failed to set IOMMUFD device: "

Thanks
Zhenzhong

>>          goto out_teardown;
>>      }
>>
>> +    ret = vfio_add_capabilities(vdev, errp);
>> +    if (ret) {
>> +        goto out_unset_idev;
>> +    }
>> +
>>      if (vdev->vga) {
>>          vfio_vga_quirk_setup(vdev);
>>      }
>> @@ -3128,7 +3138,7 @@ static void vfio_realize(PCIDevice *pdev, Error
>**errp)
>>              error_setg(errp,
>>                         "cannot support IGD OpRegion feature on hotplugged "
>>                         "device");
>> -            goto out_teardown;
>> +            goto out_unset_idev;
>>          }
>>
>>          ret = vfio_get_dev_region_info(vbasedev,
>> @@ -3137,13 +3147,13 @@ static void vfio_realize(PCIDevice *pdev,
>Error **errp)
>>          if (ret) {
>>              error_setg_errno(errp, -ret,
>>                               "does not support requested IGD OpRegion feature");
>> -            goto out_teardown;
>> +            goto out_unset_idev;
>>          }
>>
>>          ret = vfio_pci_igd_opregion_init(vdev, opregion, errp);
>>          g_free(opregion);
>>          if (ret) {
>> -            goto out_teardown;
>> +            goto out_unset_idev;
>>          }
>>      }
>>
>> @@ -3229,6 +3239,8 @@ out_deregister:
>>      if (vdev->intx.mmap_timer) {
>>          timer_free(vdev->intx.mmap_timer);
>>      }
>> +out_unset_idev:
>> +    pci_device_unset_iommu_device(pdev);
>>  out_teardown:
>>      vfio_teardown_msi(vdev);
>>      vfio_bars_exit(vdev);
>> @@ -3257,6 +3269,7 @@ static void vfio_instance_finalize(Object *obj)
>>  static void vfio_exitfn(PCIDevice *pdev)
>>  {
>>      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>
>>      vfio_unregister_req_notifier(vdev);
>>      vfio_unregister_err_notifier(vdev);
>> @@ -3271,7 +3284,8 @@ static void vfio_exitfn(PCIDevice *pdev)
>>      vfio_teardown_msi(vdev);
>>      vfio_pci_disable_rp_atomics(vdev);
>>      vfio_bars_exit(vdev);
>> -    vfio_migration_exit(&vdev->vbasedev);
>> +    vfio_migration_exit(vbasedev);
>> +    pci_device_unset_iommu_device(pdev);
>>  }
>>
>>  static void vfio_pci_reset(DeviceState *dev)
>Thanks
>
>Eric


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-17 17:56   ` Eric Auger
@ 2024-01-18  9:30     ` Duan, Zhenzhong
  2024-01-18 12:40       ` Eric Auger
  0 siblings, 1 reply; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-18  9:30 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and
>sync host IOMMU cap/ecap
>
>Hi Zhenzhong,
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Add a framework to check and synchronize host IOMMU cap/ecap with
>> vIOMMU cap/ecap.
>>
>> Currently only stage-2 translation is supported which is backed by
>> shadow page table on host side. So we don't need exact matching of
>> each bit of cap/ecap between vIOMMU and host. However, we can still
>> utilize this framework to ensure compatibility of host and vIOMMU's
>> address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
>> which is missed before.
>>
>> When stage-1 translation is supported in future, a.k.a. scalable
>> modern mode, we need to ensure compatibility of each bits. Some
>> bits are user controllable, they should be checked with host side
>> to ensure compatibility. Other bits are not, they should be synced
>> into vIOMMU cap/ecap for compatibility.
>>
>> The sequence will be:
>>
>> vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
>> iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
>> iommu->host_cap/ecap is checked and updated some bits with host
>cap/ecap. ---- vtd_sync_hw_info()
>> iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ----
>vtd_machine_done_hook()
>>
>> iommu->host_cap/ecap is a temporary storage to hold intermediate value
>> when synthesize host cap/ecap and vIOMMU's initial configured cap/ecap.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/hw/i386/intel_iommu.h |  4 ++
>>  hw/i386/intel_iommu.c         | 78
>+++++++++++++++++++++++++++++++----
>>  2 files changed, 75 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index c65fdde56f..b8abbcce12 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -292,6 +292,9 @@ struct IntelIOMMUState {
>>      uint64_t cap;                   /* The value of capability reg */
>>      uint64_t ecap;                  /* The value of extended capability reg */
>>
>> +    uint64_t host_cap;              /* The value of host capability reg */
>> +    uint64_t host_ecap;             /* The value of host ext-capability reg */
>> +
>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>>      GHashTable *iotlb;              /* IOTLB */
>>
>> @@ -314,6 +317,7 @@ struct IntelIOMMUState {
>>      bool dma_translation;           /* Whether DMA translation supported */
>>      bool pasid;                     /* Whether to support PASID */
>>
>> +    bool cap_finalized;             /* Whether VTD capability finalized */
>>      /*
>>       * Protects IOMMU states in general.  Currently it protects the
>>       * per-IOMMU IOTLB cache, and context entry cache in
>VTDAddressSpace.
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 4c1d058ebd..be03fcbf52 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3819,6 +3819,47 @@ VTDAddressSpace
>*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>      return vtd_dev_as;
>>  }
>>
>> +static bool vtd_sync_hw_info(IntelIOMMUState *s, struct
>iommu_hw_info_vtd *vtd,
>> +                             Error **errp)
>> +{
>> +    uint64_t addr_width;
>> +
>> +    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
>> +    if (s->aw_bits > addr_width) {
>> +        error_setg(errp, "User aw-bits: %u > host address width: %lu",
>> +                   s->aw_bits, addr_width);
>> +        return false;
>> +    }
>> +
>> +    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * virtual VT-d which wants nested needs to check the host IOMMU
>> + * nesting cap info behind the assigned devices. Thus that vIOMMU
>> + * could bind guest page table to host.
>> + */
>> +static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
>> +                           Error **errp)
>> +{
>> +    struct iommu_hw_info_vtd vtd;
>> +    enum iommu_hw_info_type type =
>IOMMU_HW_INFO_TYPE_INTEL_VTD;
>> +
>> +    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
>> +        error_setg(errp, "Failed to get IOMMU capability!!!");
>> +        return false;
>> +    }
>> +
>> +    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>> +        error_setg(errp, "IOMMU hardware is not compatible!!!");
>> +        return false;
>> +    }
>> +
>> +    return vtd_sync_hw_info(s, &vtd, errp);
>> +}
>> +
>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t
>devfn,
>>                                      IOMMUFDDevice *idev, Error **errp)
>>  {
>> @@ -3837,6 +3878,10 @@ static int vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int32_t devfn,
>>          return 0;
>>      }
>>
>> +    if (!vtd_check_idev(s, idev, errp)) {In
>In
>[RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for
>hotplugged devices
>https://lore.kernel.org/all/20240117080414.316890-1-
>eric.auger@redhat.com/
>
>I also attempt to pass host iommu info to the virtio-iommu but with
>legacy BE.

I think your patch works with iommufd BE too😊 Because iommufd BE
also fills bcontainer->iova_ranges in iommufd_cdev_get_info_iova_range().

> In my case I want to pass the reserved memory regions which
>also model the aw.
>So this is a pretty similar use case.

Yes.

>
>Why don't we pass the pointer to an opaque iommu_hw_info instead,
>through the PCIIOMMUOps?

Passing iommu_hw_info is ok for this series, but we want more from
IOMMUFDDevice in nesting series. I.e., allocate/free ioas/hwpt,
attach/detach from hwpt, get dirty bitmap, etc. It's more flexible to
let vIOMMU get what it want itself.

>
>
>
>> +        return -1;
>> +    }
>> +
>>      vtd_iommu_lock(s);
>>
>>      vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>> @@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
>>  {
>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>>
>> -    memset(s->csr, 0, DMAR_REG_SIZE);
>> -    memset(s->wmask, 0, DMAR_REG_SIZE);
>> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
>> -    memset(s->womask, 0, DMAR_REG_SIZE);
>> +    /* CAP/ECAP are initialized in machine create done stage */
>> +    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>> +    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>> +    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>> +    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>This change is not documented in the commit msg.
>Sorry I don't get why this is needed?

I'll doc it. Above we have one line to explain when cap/ecap are initialized.
vtd_init() is called in qemu init and guest reset. In qemu init,
Cap/ecap is finalized, after that we don't want cap/ecap to be changed.
So we bypass change to cap/ecap here.

>>
>>      s->root = 0;
>>      s->root_scalable = false;
>> @@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)
>>          vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
>>      }
>>
>> -    vtd_cap_init(s);
>> +    if (!s->cap_finalized) {
>> +        vtd_cap_init(s);
>> +        s->host_cap = s->cap;
>> +        s->host_ecap = s->ecap;
>> +    }
>> +
>>      vtd_reset_caches(s);
>>
>>      /* Define registers with default values and bit semantics */
>>      vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
>> -    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>> -    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>      vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>>      vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>>      vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
>> @@ -4241,6 +4290,12 @@ static bool
>vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>      return true;
>>  }
>>
>> +static void vtd_setup_capability_reg(IntelIOMMUState *s)
>> +{
>> +    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>> +    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>> +}
>> +
>>  static int vtd_machine_done_notify_one(Object *child, void *unused)
>>  {
>>      IntelIOMMUState *iommu =
>INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>> @@ -4259,6 +4314,14 @@ static int
>vtd_machine_done_notify_one(Object *child, void *unused)
>>
>>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>>  {
>> +    IntelIOMMUState *iommu =
>INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>> +
>> +    iommu->cap = iommu->host_cap;
>> +    iommu->ecap = iommu->host_ecap;
>> +    iommu->cap_finalized = true;
>I don't think you can change the defaults like this without taking care
>of compats (migration).

Will bump vtd_vmstate .version_id works?

Thanks
Zhenzhong

>
>Thanks
>
>Eric
>> +
>> +    vtd_setup_capability_reg(iommu);
>> +
>>      object_child_foreach_recursive(object_get_root(),
>>                                     vtd_machine_done_notify_one, NULL);
>>  }
>> @@ -4292,6 +4355,7 @@ static void vtd_realize(DeviceState *dev, Error
>**errp)
>>
>>      QLIST_INIT(&s->vtd_as_with_notifiers);
>>      qemu_mutex_init(&s->iommu_lock);
>> +    s->cap_finalized = false;
>>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>>                            "intel_iommu", DMAR_REG_SIZE);
>>      memory_region_add_subregion(get_system_memory(),


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-18  8:17     ` Duan, Zhenzhong
@ 2024-01-18 10:17       ` Yi Liu
  2024-01-18 10:20         ` Joao Martins
  0 siblings, 1 reply; 46+ messages in thread
From: Yi Liu @ 2024-01-18 10:17 UTC (permalink / raw)
  To: Duan, Zhenzhong, Joao Martins
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, Tian, Kevin, Sun, Yi Y, Peng, Chao P, Yi Sun,
	qemu-devel

On 2024/1/18 16:17, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to
>> vIOMMU
>>
>> On 15/01/2024 10:13, Zhenzhong Duan wrote:
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 9bfddc1360..cbd035f148 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>>       VFIOContainerBase *bcontainer;
>>>       VFIOIOMMUFDContainer *container;
>>>       VFIOAddressSpace *space;
>>> +    IOMMUFDDevice *idev = &vbasedev->idev;
>>>       struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>>       int ret, devfd;
>>>       uint32_t ioas_id;
>>> @@ -428,6 +429,7 @@ found_container:
>>>       QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev,
>> container_next);
>>>       QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>>>
>>> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev-
>>> devid);
>>>       trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>> num_irqs,
>>>                                      vbasedev->num_regions, vbasedev->flags);
>>>       return 0;
>>
>> In the dirty tracking series, I'll need to fetch out_capabilities from device
>> and do a bunch of stuff that is used when allocating hwpt to ask for dirty
>> tracking. And this means having iommufd_device_init() be called before we
>> call
>> iommufd_cdev_attach_container().
>>
>> Here's what it looks based on an earlier version of your patch:
>>
>> https://github.com/jpemartins/qemu/commit/433f97a05e0cdd8e3b8563a
>> a20e4f22d107219b5
>>
>> I can move the call earlier in my series, unless there's something specifically
>> when you call it here?
> 
> I think it's safe to move it earlier, just remember to do the same for existing
> container.

yes, as long as the input of iommufd_device_init() are available in the new 
place. And remember to destroy it if the code failed after initializing
iommufd_device.

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-18 10:17       ` Yi Liu
@ 2024-01-18 10:20         ` Joao Martins
  0 siblings, 0 replies; 46+ messages in thread
From: Joao Martins @ 2024-01-18 10:20 UTC (permalink / raw)
  To: Yi Liu, Duan, Zhenzhong
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, Tian, Kevin, Sun, Yi Y, Peng, Chao P, Yi Sun,
	qemu-devel

On 18/01/2024 10:17, Yi Liu wrote:
> On 2024/1/18 16:17, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to
>>> vIOMMU
>>>
>>> On 15/01/2024 10:13, Zhenzhong Duan wrote:
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 9bfddc1360..cbd035f148 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name,
>>> VFIODevice *vbasedev,
>>>>       VFIOContainerBase *bcontainer;
>>>>       VFIOIOMMUFDContainer *container;
>>>>       VFIOAddressSpace *space;
>>>> +    IOMMUFDDevice *idev = &vbasedev->idev;
>>>>       struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>>>       int ret, devfd;
>>>>       uint32_t ioas_id;
>>>> @@ -428,6 +429,7 @@ found_container:
>>>>       QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev,
>>> container_next);
>>>>       QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>>>>
>>>> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev-
>>>> devid);
>>>>       trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>>> num_irqs,
>>>>                                      vbasedev->num_regions, vbasedev->flags);
>>>>       return 0;
>>>
>>> In the dirty tracking series, I'll need to fetch out_capabilities from device
>>> and do a bunch of stuff that is used when allocating hwpt to ask for dirty
>>> tracking. And this means having iommufd_device_init() be called before we
>>> call
>>> iommufd_cdev_attach_container().
>>>
>>> Here's what it looks based on an earlier version of your patch:
>>>
>>> https://github.com/jpemartins/qemu/commit/433f97a05e0cdd8e3b8563a
>>> a20e4f22d107219b5
>>>
>>> I can move the call earlier in my series, unless there's something specifically
>>> when you call it here?
>>
>> I think it's safe to move it earlier, just remember to do the same for existing
>> container.
> 
> yes, as long as the input of iommufd_device_init() are available in the new
> place. And remember to destroy it if the code failed after initializing
> iommufd_device.
> 
In the way I am using I don't think there's any teardown as no new resources or
things are done. We essentially just initialize @idev and fetch some
capabilities thus nothing needs teardown. But I am not sure is the nesting
needs; perhaps destruction of resources is related to something to that?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback
  2024-01-18  8:43     ` Duan, Zhenzhong
@ 2024-01-18 12:34       ` Eric Auger
  2024-01-19  7:27         ` Duan, Zhenzhong
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-18 12:34 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



On 1/18/24 09:43, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device
>> callback
>>
>> Hi Zhenzhong,
>>
>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> This adds set/unset_iommu_device() implementation in Intel vIOMMU.
>>> In set call, IOMMUFDDevice is recorded in hash table indexed by
>>> PCI BDF.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  include/hw/i386/intel_iommu.h | 10 +++++
>>>  hw/i386/intel_iommu.c         | 79
>> +++++++++++++++++++++++++++++++++++
>>>  2 files changed, 89 insertions(+)
>>>
>>> diff --git a/include/hw/i386/intel_iommu.h
>> b/include/hw/i386/intel_iommu.h
>>> index 7fa0a695c8..c65fdde56f 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -62,6 +62,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>>>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>>>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>>>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>>> +typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
>>>
>>>  /* Context-Entry */
>>>  struct VTDContextEntry {
>>> @@ -148,6 +149,13 @@ struct VTDAddressSpace {
>>>      IOVATree *iova_tree;
>>>  };
>>>
>>> +struct VTDIOMMUFDDevice {
>>> +    PCIBus *bus;
>>> +    uint8_t devfn;
>>> +    IOMMUFDDevice *idev;
>>> +    IntelIOMMUState *iommu_state;
>>> +};
>>> +
>> Just wondering whether we shouldn't reuse the VTDAddressSpace to store
>> the idev, if any. How have you made your choice. What will it become
>> when PASID gets added?
> VTDAddressSpace is indexed by aliased BDF, but VTDIOMMUFDDevice is indexed
> by device's BDF. So we can't just store VTDIOMMUFDDevice as a pointer in
> VTDAddressSpace, may need a list in case more than one device in same address
> space. Then a global VTDIOMMUFDDevice list is better for lookup.

OK but if several devices are hidden under an aliased BDF, can't they
share the host properties (DMAR ecap/cap)?
>
> For PASID in modern mode which support stage-1 page table, we have
> VTDPASIDAddressSpace indexed by device's BDF+PASID, We didn't use
> VTDAddressSpace which is for stage-2 page table.

OK

Thanks

Eric
>
> Thanks
> Zhenzhong
>
>>>  struct VTDIOTLBEntry {
>>>      uint64_t gfn;
>>>      uint16_t domain_id;
>>> @@ -292,6 +300,8 @@ struct IntelIOMMUState {
>>>      /* list of registered notifiers */
>>>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>>>
>>> +    GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
>>> +
>>>      /* interrupt remapping */
>>>      bool intr_enabled;              /* Whether guest enabled IR */
>>>      dma_addr_t intr_root;           /* Interrupt remapping table pointer */
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index ed5677c0ae..95faf697eb 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -237,6 +237,13 @@ static gboolean vtd_as_equal(gconstpointer v1,
>> gconstpointer v2)
>>>             (key1->pasid == key2->pasid);
>>>  }
>>>
>>> +static gboolean vtd_as_idev_equal(gconstpointer v1, gconstpointer v2)
>>> +{
>>> +    const struct vtd_as_key *key1 = v1;
>>> +    const struct vtd_as_key *key2 = v2;
>>> +
>>> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
>>> +}
>>>  /*
>>>   * Note that we use pointer to PCIBus as the key, so hashing/shifting
>>>   * based on the pointer value is intended. Note that we deal with
>>> @@ -3812,6 +3819,74 @@ VTDAddressSpace
>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>>      return vtd_dev_as;
>>>  }
>>>
>>> +static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>> int32_t devfn,
>>> +                                    IOMMUFDDevice *idev, Error **errp)
>>> +{
>>> +    IntelIOMMUState *s = opaque;
>>> +    VTDIOMMUFDDevice *vtd_idev;
>>> +    struct vtd_as_key key = {
>>> +        .bus = bus,
>>> +        .devfn = devfn,
>>> +    };
>>> +    struct vtd_as_key *new_key;
>>> +
>>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>>> +
>>> +    /* None IOMMUFD case */
>>> +    if (!idev) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    vtd_iommu_lock(s);
>>> +
>>> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>>> +
>>> +    if (vtd_idev) {
>>> +        error_setg(errp, "IOMMUFD device already exist");
>>> +        return -1;
>>> +    }
>>> +
>>> +    new_key = g_malloc(sizeof(*new_key));
>>> +    new_key->bus = bus;
>>> +    new_key->devfn = devfn;
>>> +
>>> +    vtd_idev = g_malloc0(sizeof(VTDIOMMUFDDevice));
>>> +    vtd_idev->bus = bus;
>>> +    vtd_idev->devfn = (uint8_t)devfn;
>>> +    vtd_idev->iommu_state = s;
>>> +    vtd_idev->idev = idev;
>>> +
>>> +    g_hash_table_insert(s->vtd_iommufd_dev, new_key, vtd_idev);
>>> +
>>> +    vtd_iommu_unlock(s);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque,
>> int32_t devfn)
>>> +{
>>> +    IntelIOMMUState *s = opaque;
>>> +    VTDIOMMUFDDevice *vtd_idev;
>>> +    struct vtd_as_key key = {
>>> +        .bus = bus,
>>> +        .devfn = devfn,
>>> +    };
>>> +
>>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>>> +
>>> +    vtd_iommu_lock(s);
>>> +
>>> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>>> +    if (!vtd_idev) {
>>> +        vtd_iommu_unlock(s);
>>> +        return;
>>> +    }
>>> +
>>> +    g_hash_table_remove(s->vtd_iommufd_dev, &key);
>>> +
>>> +    vtd_iommu_unlock(s);
>>> +}
>>> +
>>>  /* Unmap the whole range in the notifier's scope. */
>>>  static void vtd_address_space_unmap(VTDAddressSpace *as,
>> IOMMUNotifier *n)
>>>  {
>>> @@ -4107,6 +4182,8 @@ static AddressSpace
>> *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>>  static PCIIOMMUOps vtd_iommu_ops = {
>>>      .get_address_space = vtd_host_dma_iommu,
>>> +    .set_iommu_device = vtd_dev_set_iommu_device,
>>> +    .unset_iommu_device = vtd_dev_unset_iommu_device,
>>>  };
>>>
>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>> @@ -4230,6 +4307,8 @@ static void vtd_realize(DeviceState *dev, Error
>> **errp)
>>>                                       g_free, g_free);
>>>      s->vtd_address_spaces = g_hash_table_new_full(vtd_as_hash,
>> vtd_as_equal,
>>>                                        g_free, g_free);
>>> +    s->vtd_iommufd_dev = g_hash_table_new_full(vtd_as_hash,
>> vtd_as_idev_equal,
>>> +                                               g_free, g_free);
>>>      vtd_init(s);
>>>      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>>      /* Pseudo address space under root PCI bus. */
>> Thanks
>>
>> Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-18  9:30     ` Duan, Zhenzhong
@ 2024-01-18 12:40       ` Eric Auger
  2024-01-19 11:55         ` Duan, Zhenzhong
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-18 12:40 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum



On 1/18/24 10:30, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and
>> sync host IOMMU cap/ecap
>>
>> Hi Zhenzhong,
>>
>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> Add a framework to check and synchronize host IOMMU cap/ecap with
>>> vIOMMU cap/ecap.
>>>
>>> Currently only stage-2 translation is supported which is backed by
>>> shadow page table on host side. So we don't need exact matching of
>>> each bit of cap/ecap between vIOMMU and host. However, we can still
>>> utilize this framework to ensure compatibility of host and vIOMMU's
>>> address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
>>> which is missed before.
>>>
>>> When stage-1 translation is supported in future, a.k.a. scalable
>>> modern mode, we need to ensure compatibility of each bits. Some
>>> bits are user controllable, they should be checked with host side
>>> to ensure compatibility. Other bits are not, they should be synced
>>> into vIOMMU cap/ecap for compatibility.
>>>
>>> The sequence will be:
>>>
>>> vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
>>> iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
>>> iommu->host_cap/ecap is checked and updated some bits with host
>> cap/ecap. ---- vtd_sync_hw_info()
>>> iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ----
>> vtd_machine_done_hook()
>>> iommu->host_cap/ecap is a temporary storage to hold intermediate value
>>> when synthesize host cap/ecap and vIOMMU's initial configured cap/ecap.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  include/hw/i386/intel_iommu.h |  4 ++
>>>  hw/i386/intel_iommu.c         | 78
>> +++++++++++++++++++++++++++++++----
>>>  2 files changed, 75 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/hw/i386/intel_iommu.h
>> b/include/hw/i386/intel_iommu.h
>>> index c65fdde56f..b8abbcce12 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -292,6 +292,9 @@ struct IntelIOMMUState {
>>>      uint64_t cap;                   /* The value of capability reg */
>>>      uint64_t ecap;                  /* The value of extended capability reg */
>>>
>>> +    uint64_t host_cap;              /* The value of host capability reg */
>>> +    uint64_t host_ecap;             /* The value of host ext-capability reg */
>>> +
>>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>>>      GHashTable *iotlb;              /* IOTLB */
>>>
>>> @@ -314,6 +317,7 @@ struct IntelIOMMUState {
>>>      bool dma_translation;           /* Whether DMA translation supported */
>>>      bool pasid;                     /* Whether to support PASID */
>>>
>>> +    bool cap_finalized;             /* Whether VTD capability finalized */
>>>      /*
>>>       * Protects IOMMU states in general.  Currently it protects the
>>>       * per-IOMMU IOTLB cache, and context entry cache in
>> VTDAddressSpace.
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 4c1d058ebd..be03fcbf52 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -3819,6 +3819,47 @@ VTDAddressSpace
>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>>      return vtd_dev_as;
>>>  }
>>>
>>> +static bool vtd_sync_hw_info(IntelIOMMUState *s, struct
>> iommu_hw_info_vtd *vtd,
>>> +                             Error **errp)
>>> +{
>>> +    uint64_t addr_width;
>>> +
>>> +    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
>>> +    if (s->aw_bits > addr_width) {
>>> +        error_setg(errp, "User aw-bits: %u > host address width: %lu",
>>> +                   s->aw_bits, addr_width);
>>> +        return false;
>>> +    }
>>> +
>>> +    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +/*
>>> + * virtual VT-d which wants nested needs to check the host IOMMU
>>> + * nesting cap info behind the assigned devices. Thus that vIOMMU
>>> + * could bind guest page table to host.
>>> + */
>>> +static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
>>> +                           Error **errp)
>>> +{
>>> +    struct iommu_hw_info_vtd vtd;
>>> +    enum iommu_hw_info_type type =
>> IOMMU_HW_INFO_TYPE_INTEL_VTD;
>>> +
>>> +    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
>>> +        error_setg(errp, "Failed to get IOMMU capability!!!");
>>> +        return false;
>>> +    }
>>> +
>>> +    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>>> +        error_setg(errp, "IOMMU hardware is not compatible!!!");
>>> +        return false;
>>> +    }
>>> +
>>> +    return vtd_sync_hw_info(s, &vtd, errp);
>>> +}
>>> +
>>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t
>> devfn,
>>>                                      IOMMUFDDevice *idev, Error **errp)
>>>  {
>>> @@ -3837,6 +3878,10 @@ static int vtd_dev_set_iommu_device(PCIBus
>> *bus, void *opaque, int32_t devfn,
>>>          return 0;
>>>      }
>>>
>>> +    if (!vtd_check_idev(s, idev, errp)) {In
>> In
>> [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for
>> hotplugged devices
>> https://lore.kernel.org/all/20240117080414.316890-1-
>> eric.auger@redhat.com/
>>
>> I also attempt to pass host iommu info to the virtio-iommu but with
>> legacy BE.
> I think your patch works with iommufd BE too😊 Because iommufd BE
> also fills bcontainer->iova_ranges in iommufd_cdev_get_info_iova_range().
correct. I wanted to emphasize that we also have the need to pass host
iommu info in legacy mode for instance. In this series you introduce an
object that works with the iommufd backed but I think if we go this way
we would need another one for the legacy device. So maybe introducing a
base object derived into 2 ones may be the most appropriate? Maybe,
given the assumption that we will use iommufd for new use cases this
legacy object will implement much fewer interfaces but still.
>
>> In my case I want to pass the reserved memory regions which
>> also model the aw.
>> So this is a pretty similar use case.
> Yes.
>
>> Why don't we pass the pointer to an opaque iommu_hw_info instead,
>> through the PCIIOMMUOps?
> Passing iommu_hw_info is ok for this series, but we want more from
> IOMMUFDDevice in nesting series. I.e., allocate/free ioas/hwpt,
> attach/detach from hwpt, get dirty bitmap, etc. It's more flexible to
> let vIOMMU get what it want itself.
OK, would be interesting to define the class for this object. Worth to
be introduced either in the cover letter or in the 1st patch

Eric
>
>>
>>
>>> +        return -1;
>>> +    }
>>> +
>>>      vtd_iommu_lock(s);
>>>
>>>      vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>>> @@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
>>>  {
>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>>>
>>> -    memset(s->csr, 0, DMAR_REG_SIZE);
>>> -    memset(s->wmask, 0, DMAR_REG_SIZE);
>>> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
>>> -    memset(s->womask, 0, DMAR_REG_SIZE);
>>> +    /* CAP/ECAP are initialized in machine create done stage */
>>> +    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>> DMAR_GCMD_REG);
>>> +    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>> DMAR_GCMD_REG);
>>> +    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>> DMAR_GCMD_REG);
>>> +    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>> DMAR_GCMD_REG);
>> This change is not documented in the commit msg.
>> Sorry I don't get why this is needed?
> I'll doc it. Above we have one line to explain when cap/ecap are initialized.
> vtd_init() is called in qemu init and guest reset. In qemu init,
> Cap/ecap is finalized, after that we don't want cap/ecap to be changed.
> So we bypass change to cap/ecap here.
>
>>>      s->root = 0;
>>>      s->root_scalable = false;
>>> @@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)
>>>          vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
>>>      }
>>>
>>> -    vtd_cap_init(s);
>>> +    if (!s->cap_finalized) {
>>> +        vtd_cap_init(s);
>>> +        s->host_cap = s->cap;
>>> +        s->host_ecap = s->ecap;
>>> +    }
>>> +
>>>      vtd_reset_caches(s);
>>>
>>>      /* Define registers with default values and bit semantics */
>>>      vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
>>> -    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>>> -    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>>      vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>>>      vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>>>      vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
>>> @@ -4241,6 +4290,12 @@ static bool
>> vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>>      return true;
>>>  }
>>>
>>> +static void vtd_setup_capability_reg(IntelIOMMUState *s)
>>> +{
>>> +    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>>> +    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>> +}
>>> +
>>>  static int vtd_machine_done_notify_one(Object *child, void *unused)
>>>  {
>>>      IntelIOMMUState *iommu =
>> INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>>> @@ -4259,6 +4314,14 @@ static int
>> vtd_machine_done_notify_one(Object *child, void *unused)
>>>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>>>  {
>>> +    IntelIOMMUState *iommu =
>> INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>>> +
>>> +    iommu->cap = iommu->host_cap;
>>> +    iommu->ecap = iommu->host_ecap;
>>> +    iommu->cap_finalized = true;
>> I don't think you can change the defaults like this without taking care
>> of compats (migration).
> Will bump vtd_vmstate .version_id works?
>
> Thanks
> Zhenzhong
>
>> Thanks
>>
>> Eric
>>> +
>>> +    vtd_setup_capability_reg(iommu);
>>> +
>>>      object_child_foreach_recursive(object_get_root(),
>>>                                     vtd_machine_done_notify_one, NULL);
>>>  }
>>> @@ -4292,6 +4355,7 @@ static void vtd_realize(DeviceState *dev, Error
>> **errp)
>>>      QLIST_INIT(&s->vtd_as_with_notifiers);
>>>      qemu_mutex_init(&s->iommu_lock);
>>> +    s->cap_finalized = false;
>>>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>>>                            "intel_iommu", DMAR_REG_SIZE);
>>>      memory_region_add_subregion(get_system_memory(),



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-15 10:13 ` [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice Zhenzhong Duan
  2024-01-17 14:11   ` Eric Auger
@ 2024-01-18 12:42   ` Eric Auger
  2024-01-19  7:31     ` Duan, Zhenzhong
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-18 12:42 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun



On 1/15/24 11:13, Zhenzhong Duan wrote:
> IOMMUFDDevice represents a device in iommufd and can be used as
> a communication interface between devices (i.e., VFIO, VDPA) and
> vIOMMU.
>
> Currently it includes iommufd handler and device id information
> which could be used by vIOMMU to get hw IOMMU information.
>
> In future nested translation support, vIOMMU is going to have
> more iommufd related operations like allocate hwpt for a device,
> attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.
>
> IOMMUFDDevice is willingly not a QOM object because we don't want
> it to be visible from the user interface.
>
> Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.
>
> Originally-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  MAINTAINERS                     |  4 +--
>  include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
>  backends/iommufd_device.c       | 50 +++++++++++++++++++++++++++++++++
Maybe it is still time to move the iommufd files in a sepate dir, under
hw at the same level as vfio.

Thoughts?

Eric
>  backends/meson.build            |  2 +-
>  4 files changed, 84 insertions(+), 3 deletions(-)
>  create mode 100644 include/sysemu/iommufd_device.h
>  create mode 100644 backends/iommufd_device.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 00ec1f7eca..606dfeb2b1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2171,8 +2171,8 @@ M: Yi Liu <yi.l.liu@intel.com>
>  M: Eric Auger <eric.auger@redhat.com>
>  M: Zhenzhong Duan <zhenzhong.duan@intel.com>
>  S: Supported
> -F: backends/iommufd.c
> -F: include/sysemu/iommufd.h
> +F: backends/iommufd*.c
> +F: include/sysemu/iommufd*.h
>  F: include/qemu/chardev_open.h
>  F: util/chardev_open.c
>  F: docs/devel/vfio-iommufd.rst
> diff --git a/include/sysemu/iommufd_device.h b/include/sysemu/iommufd_device.h
> new file mode 100644
> index 0000000000..795630324b
> --- /dev/null
> +++ b/include/sysemu/iommufd_device.h
> @@ -0,0 +1,31 @@
> +/*
> + * IOMMUFD Device
> + *
> + * Copyright (C) 2024 Intel Corporation.
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Zhenzhong Duan <zhenzhong.duan@intel.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef SYSEMU_IOMMUFD_DEVICE_H
> +#define SYSEMU_IOMMUFD_DEVICE_H
> +
> +#include <linux/iommufd.h>
> +#include "sysemu/iommufd.h"
> +
> +typedef struct IOMMUFDDevice IOMMUFDDevice;
> +
> +/* This is an abstraction of host IOMMUFD device */
> +struct IOMMUFDDevice {
> +    IOMMUFDBackend *iommufd;
> +    uint32_t dev_id;
> +};
> +
> +int iommufd_device_get_info(IOMMUFDDevice *idev,
> +                            enum iommu_hw_info_type *type,
> +                            uint32_t len, void *data);
> +void iommufd_device_init(void *_idev, size_t instance_size,
> +                         IOMMUFDBackend *iommufd, uint32_t dev_id);
> +#endif
> diff --git a/backends/iommufd_device.c b/backends/iommufd_device.c
> new file mode 100644
> index 0000000000..f6e7ca1dbf
> --- /dev/null
> +++ b/backends/iommufd_device.c
> @@ -0,0 +1,50 @@
> +/*
> + * QEMU abstract of Host IOMMU
> + *
> + * Copyright (C) 2024 Intel Corporation.
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Zhenzhong Duan <zhenzhong.duan@intel.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include <sys/ioctl.h>
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/iommufd_device.h"
> +
> +int iommufd_device_get_info(IOMMUFDDevice *idev,
> +                            enum iommu_hw_info_type *type,
> +                            uint32_t len, void *data)
> +{
> +    struct iommu_hw_info info = {
> +        .size = sizeof(info),
> +        .flags = 0,
> +        .dev_id = idev->dev_id,
> +        .data_len = len,
> +        .__reserved = 0,
> +        .data_uptr = (uintptr_t)data,
> +    };
> +    int ret;
> +
> +    ret = ioctl(idev->iommufd->fd, IOMMU_GET_HW_INFO, &info);
> +    if (ret) {
> +        error_report("Failed to get info %m");
> +    } else {
> +        *type = info.out_data_type;
> +    }
> +
> +    return ret;
> +}
> +
> +void iommufd_device_init(void *_idev, size_t instance_size,
> +                         IOMMUFDBackend *iommufd, uint32_t dev_id)
> +{
> +    IOMMUFDDevice *idev = (IOMMUFDDevice *)_idev;
> +
> +    g_assert(sizeof(IOMMUFDDevice) <= instance_size);
> +
> +    idev->iommufd = iommufd;
> +    idev->dev_id = dev_id;
> +}
> diff --git a/backends/meson.build b/backends/meson.build
> index 8b2b111497..c437cdb363 100644
> --- a/backends/meson.build
> +++ b/backends/meson.build
> @@ -24,7 +24,7 @@ if have_vhost_user
>    system_ss.add(when: 'CONFIG_VIRTIO', if_true: files('vhost-user.c'))
>  endif
>  system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost.c'))
> -system_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c'))
> +system_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c', 'iommufd_device.c'))
>  if have_vhost_user_crypto
>    system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost-user.c'))
>  endif



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback
  2024-01-18 12:34       ` Eric Auger
@ 2024-01-19  7:27         ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-19  7:27 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device
>callback
>
>
>
>On 1/18/24 09:43, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Subject: Re: [PATCH rfcv1 3/6] intel_iommu: add
>set/unset_iommu_device
>>> callback
>>>
>>> Hi Zhenzhong,
>>>
>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> This adds set/unset_iommu_device() implementation in Intel vIOMMU.
>>>> In set call, IOMMUFDDevice is recorded in hash table indexed by
>>>> PCI BDF.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>  include/hw/i386/intel_iommu.h | 10 +++++
>>>>  hw/i386/intel_iommu.c         | 79
>>> +++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 89 insertions(+)
>>>>
>>>> diff --git a/include/hw/i386/intel_iommu.h
>>> b/include/hw/i386/intel_iommu.h
>>>> index 7fa0a695c8..c65fdde56f 100644
>>>> --- a/include/hw/i386/intel_iommu.h
>>>> +++ b/include/hw/i386/intel_iommu.h
>>>> @@ -62,6 +62,7 @@ typedef union VTD_IR_TableEntry
>VTD_IR_TableEntry;
>>>>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>>>>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>>>>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>>>> +typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
>>>>
>>>>  /* Context-Entry */
>>>>  struct VTDContextEntry {
>>>> @@ -148,6 +149,13 @@ struct VTDAddressSpace {
>>>>      IOVATree *iova_tree;
>>>>  };
>>>>
>>>> +struct VTDIOMMUFDDevice {
>>>> +    PCIBus *bus;
>>>> +    uint8_t devfn;
>>>> +    IOMMUFDDevice *idev;
>>>> +    IntelIOMMUState *iommu_state;
>>>> +};
>>>> +
>>> Just wondering whether we shouldn't reuse the VTDAddressSpace to
>store
>>> the idev, if any. How have you made your choice. What will it become
>>> when PASID gets added?
>> VTDAddressSpace is indexed by aliased BDF, but VTDIOMMUFDDevice is
>indexed
>> by device's BDF. So we can't just store VTDIOMMUFDDevice as a pointer in
>> VTDAddressSpace, may need a list in case more than one device in same
>address
>> space. Then a global VTDIOMMUFDDevice list is better for lookup.
>
>OK but if several devices are hidden under an aliased BDF, can't they
>share the host properties (DMAR ecap/cap)?

It depends on that if the vfio devices are under same aliased BDF in host.
If vfio devices are configured under same aliased BDF in qemu but they are
not in same aliased BDF in host, their host cap/ecap may be different.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-18 12:42   ` Eric Auger
@ 2024-01-19  7:31     ` Duan, Zhenzhong
  2024-01-22 16:25       ` Cédric Le Goater
  2024-01-23 10:10       ` Eric Auger
  0 siblings, 2 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-19  7:31 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce
>IOMMUFDDevice
>
>
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> IOMMUFDDevice represents a device in iommufd and can be used as
>> a communication interface between devices (i.e., VFIO, VDPA) and
>> vIOMMU.
>>
>> Currently it includes iommufd handler and device id information
>> which could be used by vIOMMU to get hw IOMMU information.
>>
>> In future nested translation support, vIOMMU is going to have
>> more iommufd related operations like allocate hwpt for a device,
>> attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.
>>
>> IOMMUFDDevice is willingly not a QOM object because we don't want
>> it to be visible from the user interface.
>>
>> Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.
>>
>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  MAINTAINERS                     |  4 +--
>>  include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
>>  backends/iommufd_device.c       | 50
>+++++++++++++++++++++++++++++++++
>Maybe it is still time to move the iommufd files in a sepate dir, under
>hw at the same level as vfio.
>
>Thoughts?

Any reason for the movement? Hw dir contains entries to emulate different
Devices. Iommufd is not a real device. It's more a backend.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-18 12:40       ` Eric Auger
@ 2024-01-19 11:55         ` Duan, Zhenzhong
  2024-01-23 13:10           ` Eric Auger
  0 siblings, 1 reply; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-19 11:55 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and
>sync host IOMMU cap/ecap
>
>
>
>On 1/18/24 10:30, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Subject: Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check
>and
>>> sync host IOMMU cap/ecap
>>>
>>> Hi Zhenzhong,
>>>
>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> Add a framework to check and synchronize host IOMMU cap/ecap with
>>>> vIOMMU cap/ecap.
>>>>
>>>> Currently only stage-2 translation is supported which is backed by
>>>> shadow page table on host side. So we don't need exact matching of
>>>> each bit of cap/ecap between vIOMMU and host. However, we can still
>>>> utilize this framework to ensure compatibility of host and vIOMMU's
>>>> address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
>>>> which is missed before.
>>>>
>>>> When stage-1 translation is supported in future, a.k.a. scalable
>>>> modern mode, we need to ensure compatibility of each bits. Some
>>>> bits are user controllable, they should be checked with host side
>>>> to ensure compatibility. Other bits are not, they should be synced
>>>> into vIOMMU cap/ecap for compatibility.
>>>>
>>>> The sequence will be:
>>>>
>>>> vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
>>>> iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
>>>> iommu->host_cap/ecap is checked and updated some bits with host
>>> cap/ecap. ---- vtd_sync_hw_info()
>>>> iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ----
>>> vtd_machine_done_hook()
>>>> iommu->host_cap/ecap is a temporary storage to hold intermediate
>value
>>>> when synthesize host cap/ecap and vIOMMU's initial configured
>cap/ecap.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>  include/hw/i386/intel_iommu.h |  4 ++
>>>>  hw/i386/intel_iommu.c         | 78
>>> +++++++++++++++++++++++++++++++----
>>>>  2 files changed, 75 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/hw/i386/intel_iommu.h
>>> b/include/hw/i386/intel_iommu.h
>>>> index c65fdde56f..b8abbcce12 100644
>>>> --- a/include/hw/i386/intel_iommu.h
>>>> +++ b/include/hw/i386/intel_iommu.h
>>>> @@ -292,6 +292,9 @@ struct IntelIOMMUState {
>>>>      uint64_t cap;                   /* The value of capability reg */
>>>>      uint64_t ecap;                  /* The value of extended capability reg */
>>>>
>>>> +    uint64_t host_cap;              /* The value of host capability reg */
>>>> +    uint64_t host_ecap;             /* The value of host ext-capability reg */
>>>> +
>>>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>>>>      GHashTable *iotlb;              /* IOTLB */
>>>>
>>>> @@ -314,6 +317,7 @@ struct IntelIOMMUState {
>>>>      bool dma_translation;           /* Whether DMA translation supported
>*/
>>>>      bool pasid;                     /* Whether to support PASID */
>>>>
>>>> +    bool cap_finalized;             /* Whether VTD capability finalized */
>>>>      /*
>>>>       * Protects IOMMU states in general.  Currently it protects the
>>>>       * per-IOMMU IOTLB cache, and context entry cache in
>>> VTDAddressSpace.
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 4c1d058ebd..be03fcbf52 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -3819,6 +3819,47 @@ VTDAddressSpace
>>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>>>      return vtd_dev_as;
>>>>  }
>>>>
>>>> +static bool vtd_sync_hw_info(IntelIOMMUState *s, struct
>>> iommu_hw_info_vtd *vtd,
>>>> +                             Error **errp)
>>>> +{
>>>> +    uint64_t addr_width;
>>>> +
>>>> +    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
>>>> +    if (s->aw_bits > addr_width) {
>>>> +        error_setg(errp, "User aw-bits: %u > host address width: %lu",
>>>> +                   s->aw_bits, addr_width);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +/*
>>>> + * virtual VT-d which wants nested needs to check the host IOMMU
>>>> + * nesting cap info behind the assigned devices. Thus that vIOMMU
>>>> + * could bind guest page table to host.
>>>> + */
>>>> +static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice
>*idev,
>>>> +                           Error **errp)
>>>> +{
>>>> +    struct iommu_hw_info_vtd vtd;
>>>> +    enum iommu_hw_info_type type =
>>> IOMMU_HW_INFO_TYPE_INTEL_VTD;
>>>> +
>>>> +    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
>>>> +        error_setg(errp, "Failed to get IOMMU capability!!!");
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>>>> +        error_setg(errp, "IOMMU hardware is not compatible!!!");
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return vtd_sync_hw_info(s, &vtd, errp);
>>>> +}
>>>> +
>>>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>int32_t
>>> devfn,
>>>>                                      IOMMUFDDevice *idev, Error **errp)
>>>>  {
>>>> @@ -3837,6 +3878,10 @@ static int
>vtd_dev_set_iommu_device(PCIBus
>>> *bus, void *opaque, int32_t devfn,
>>>>          return 0;
>>>>      }
>>>>
>>>> +    if (!vtd_check_idev(s, idev, errp)) {In
>>> In
>>> [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for
>>> hotplugged devices
>>> https://lore.kernel.org/all/20240117080414.316890-1-
>>> eric.auger@redhat.com/
>>>
>>> I also attempt to pass host iommu info to the virtio-iommu but with
>>> legacy BE.
>> I think your patch works with iommufd BE too😊 Because iommufd BE
>> also fills bcontainer->iova_ranges in iommufd_cdev_get_info_iova_range().
>correct. I wanted to emphasize that we also have the need to pass host
>iommu info in legacy mode for instance. In this series you introduce an
>object that works with the iommufd backed but I think if we go this way
>we would need another one for the legacy device. So maybe introducing a
>base object derived into 2 ones may be the most appropriate? Maybe,
>given the assumption that we will use iommufd for new use cases this
>legacy object will implement much fewer interfaces but still.

How about this:

enum IOMMU_LEGACY_DEVICE_TYPE {
    IOMMU_LEGACY_VFIO_DEVICE,
    IOMMU_LEGACY_VDPA_DEVICE,
}

typedef struct IOMMULegacyDevice {
    enum IOMMU_LEGACY_DEVICE_TYPE type;

    /* common field */

    union {
        ....
    }

} IOMMULegacyDevice;

typedef struct IOMMUFDDevice {
    IOMMUFDBackend *iommufd;
    uint32_t dev_id;
    uint32_t ioas_id;
} IOMMUFDDevice;

enum IOMMUDEVICE_TYPE {
    IOMMUFD_DEVICE,
    IOMMU_LEGACY_DEVICE,
}

struct IOMMUDevice {
    enum IOMMU_DEVICE_TYPE type;

    /* common field */
    GList *iova_ranges;

    union {
        IOMMULegacyDevice legacy_dev;
        IOMMUFDDevice idev;
    }
}

>>
>>> In my case I want to pass the reserved memory regions which
>>> also model the aw.
>>> So this is a pretty similar use case.
>> Yes.
>>
>>> Why don't we pass the pointer to an opaque iommu_hw_info instead,
>>> through the PCIIOMMUOps?
>> Passing iommu_hw_info is ok for this series, but we want more from
>> IOMMUFDDevice in nesting series. I.e., allocate/free ioas/hwpt,
>> attach/detach from hwpt, get dirty bitmap, etc. It's more flexible to
>> let vIOMMU get what it want itself.
>OK, would be interesting to define the class for this object. Worth to
>be introduced either in the cover letter or in the 1st patch

Not a QOM class because we don't want it showed out through
query-qmp-schema.

Thanks
Zhenzhong

>
>Eric
>>
>>>
>>>
>>>> +        return -1;
>>>> +    }
>>>> +
>>>>      vtd_iommu_lock(s);
>>>>
>>>>      vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>>>> @@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
>>>>  {
>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>>>>
>>>> -    memset(s->csr, 0, DMAR_REG_SIZE);
>>>> -    memset(s->wmask, 0, DMAR_REG_SIZE);
>>>> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
>>>> -    memset(s->womask, 0, DMAR_REG_SIZE);
>>>> +    /* CAP/ECAP are initialized in machine create done stage */
>>>> +    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>> DMAR_GCMD_REG);
>>>> +    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>> DMAR_GCMD_REG);
>>>> +    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>> DMAR_GCMD_REG);
>>>> +    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>> DMAR_GCMD_REG);
>>> This change is not documented in the commit msg.
>>> Sorry I don't get why this is needed?
>> I'll doc it. Above we have one line to explain when cap/ecap are initialized.
>> vtd_init() is called in qemu init and guest reset. In qemu init,
>> Cap/ecap is finalized, after that we don't want cap/ecap to be changed.
>> So we bypass change to cap/ecap here.
>>
>>>>      s->root = 0;
>>>>      s->root_scalable = false;
>>>> @@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)
>>>>          vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
>>>>      }
>>>>
>>>> -    vtd_cap_init(s);
>>>> +    if (!s->cap_finalized) {
>>>> +        vtd_cap_init(s);
>>>> +        s->host_cap = s->cap;
>>>> +        s->host_ecap = s->ecap;
>>>> +    }
>>>> +
>>>>      vtd_reset_caches(s);
>>>>
>>>>      /* Define registers with default values and bit semantics */
>>>>      vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
>>>> -    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>>>> -    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>>>      vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>>>>      vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>>>>      vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
>>>> @@ -4241,6 +4290,12 @@ static bool
>>> vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>>>      return true;
>>>>  }
>>>>
>>>> +static void vtd_setup_capability_reg(IntelIOMMUState *s)
>>>> +{
>>>> +    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>>>> +    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>>> +}
>>>> +
>>>>  static int vtd_machine_done_notify_one(Object *child, void *unused)
>>>>  {
>>>>      IntelIOMMUState *iommu =
>>> INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>>>> @@ -4259,6 +4314,14 @@ static int
>>> vtd_machine_done_notify_one(Object *child, void *unused)
>>>>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>>>>  {
>>>> +    IntelIOMMUState *iommu =
>>> INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>>>> +
>>>> +    iommu->cap = iommu->host_cap;
>>>> +    iommu->ecap = iommu->host_ecap;
>>>> +    iommu->cap_finalized = true;
>>> I don't think you can change the defaults like this without taking care
>>> of compats (migration).
>> Will bump vtd_vmstate .version_id works?
>>
>> Thanks
>> Zhenzhong
>>
>>> Thanks
>>>
>>> Eric
>>>> +
>>>> +    vtd_setup_capability_reg(iommu);
>>>> +
>>>>      object_child_foreach_recursive(object_get_root(),
>>>>                                     vtd_machine_done_notify_one, NULL);
>>>>  }
>>>> @@ -4292,6 +4355,7 @@ static void vtd_realize(DeviceState *dev,
>Error
>>> **errp)
>>>>      QLIST_INIT(&s->vtd_as_with_notifiers);
>>>>      qemu_mutex_init(&s->iommu_lock);
>>>> +    s->cap_finalized = false;
>>>>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>>>>                            "intel_iommu", DMAR_REG_SIZE);
>>>>      memory_region_add_subregion(get_system_memory(),


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-19  7:31     ` Duan, Zhenzhong
@ 2024-01-22 16:25       ` Cédric Le Goater
  2024-01-23  5:51         ` Duan, Zhenzhong
  2024-01-23 10:10       ` Eric Auger
  1 sibling, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2024-01-22 16:25 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger, qemu-devel
  Cc: alex.williamson, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun

On 1/19/24 08:31, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce
>> IOMMUFDDevice
>>
>>
>>
>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>> IOMMUFDDevice represents a device in iommufd and can be used as
>>> a communication interface between devices (i.e., VFIO, VDPA) and
>>> vIOMMU.
>>>
>>> Currently it includes iommufd handler and device id information
>>> which could be used by vIOMMU to get hw IOMMU information.
>>>
>>> In future nested translation support, vIOMMU is going to have
>>> more iommufd related operations like allocate hwpt for a device,
>>> attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.
>>>
>>> IOMMUFDDevice is willingly not a QOM object because we don't want
>>> it to be visible from the user interface.
>>>
>>> Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.
>>>
>>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>   MAINTAINERS                     |  4 +--
>>>   include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
>>>   backends/iommufd_device.c       | 50
>> +++++++++++++++++++++++++++++++++
>> Maybe it is still time to move the iommufd files in a sepate dir, under
>> hw at the same level as vfio.
>>
>> Thoughts?
> 
> Any reason for the movement? Hw dir contains entries to emulate different
> Devices. Iommufd is not a real device. It's more a backend.

I would include the new services in the existing iommufd .[ch] files.
No need for a new file since the changes are all related to the IOMMUFD
device usage.

Thanks,

C.





^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-15 10:13 ` [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device() Zhenzhong Duan
  2024-01-17 14:11   ` Eric Auger
@ 2024-01-22 16:55   ` Cédric Le Goater
  2024-01-23  6:37     ` Duan, Zhenzhong
  1 sibling, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2024-01-22 16:55 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Marcel Apfelbaum

On 1/15/24 11:13, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> This adds pci_device_set/unset_iommu_device() to set/unset
> IOMMUFDDevice for a given PCIe device. Caller of set
> should fail if set operation fails.
> 
> Extract out pci_device_get_iommu_bus_devfn() to facilitate
> implementation of pci_device_set/unset_iommu_device().
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/hw/pci/pci.h | 39 ++++++++++++++++++++++++++++++++++-
>   hw/pci/pci.c         | 49 +++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index fa6313aabc..a810c0ec74 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -7,6 +7,8 @@
>   /* PCI includes legacy ISA access.  */
>   #include "hw/isa/isa.h"
>   
> +#include "sysemu/iommufd_device.h"
> +
>   extern bool pci_available;
>   
>   /* PCI bus */
> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>        *
>        * @devfn: device and function number
>        */
> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int devfn);
> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int devfn);
> +    /**
> +     * @set_iommu_device: set iommufd device for a PCI device to vIOMMU
> +     *
> +     * Optional callback, if not implemented in vIOMMU, then vIOMMU can't
> +     * utilize iommufd specific features.
> +     *
> +     * Return true if iommufd device is accepted, or else return false with
> +     * errp set.
> +     *
> +     * @bus: the #PCIBus of the PCI device.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number of the PCI device.
> +     *
> +     * @idev: the data structure representing iommufd device.
> +     *
> +     */
> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
> +                            IOMMUFDDevice *idev, Error **errp);
> +    /**
> +     * @unset_iommu_device: unset iommufd device for a PCI device from vIOMMU
> +     *
> +     * Optional callback.
> +     *
> +     * @bus: the #PCIBus of the PCI device.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number of the PCI device.
> +     */
> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn);
>   } PCIIOMMUOps;
>   
>   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice *idev,
> +                                Error **errp);
> +void pci_device_unset_iommu_device(PCIDevice *dev);
>   
>   /**
>    * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 76080af580..3848662f95 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2672,7 +2672,10 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>       }
>   }
>   
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> +                                           PCIBus **aliased_pbus,
> +                                           PCIBus **piommu_bus,
> +                                           uint8_t *aliased_pdevfn)
>   {
>       PCIBus *bus = pci_get_bus(dev);
>       PCIBus *iommu_bus = bus;
> @@ -2717,6 +2720,18 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>   
>           iommu_bus = parent_bus;
>       }
> +    *aliased_pbus = bus;
> +    *piommu_bus = iommu_bus;
> +    *aliased_pdevfn = devfn;
> +}
> +
> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>       if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>           return iommu_bus->iommu_ops->get_address_space(bus,
>                                    iommu_bus->iommu_opaque, devfn);
> @@ -2724,6 +2739,38 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>       return &address_space_memory;
>   }
>   
> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice *idev,
> +                                Error **errp)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&

Why do we test iommu_bus in pci_device_un/set_iommu_device routines and
not in pci_device_iommu_address_space() ?


Thanks,

C.


> +        iommu_bus->iommu_ops && iommu_bus->iommu_ops->set_iommu_device) {
> +        return iommu_bus->iommu_ops->set_iommu_device(pci_get_bus(dev),
> +                                                      iommu_bus->iommu_opaque,
> +                                                      dev->devfn, idev, errp);
> +    }
> +    return 0;
> +}
> +
> +void pci_device_unset_iommu_device(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
> +        iommu_bus->iommu_ops && iommu_bus->iommu_ops->unset_iommu_device) {
> +        return iommu_bus->iommu_ops->unset_iommu_device(pci_get_bus(dev),
> +                                                        iommu_bus->iommu_opaque,
> +                                                        dev->devfn);
> +    }
> +}
> +
>   void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
>   {
>       /*



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback
  2024-01-15 10:13 ` [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback Zhenzhong Duan
  2024-01-17 15:44   ` Eric Auger
@ 2024-01-22 17:09   ` Cédric Le Goater
  2024-01-23  9:46     ` Duan, Zhenzhong
  1 sibling, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2024-01-22 17:09 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 1/15/24 11:13, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> This adds set/unset_iommu_device() implementation in Intel vIOMMU.
> In set call, IOMMUFDDevice is recorded in hash table indexed by
> PCI BDF.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/hw/i386/intel_iommu.h | 10 +++++
>   hw/i386/intel_iommu.c         | 79 +++++++++++++++++++++++++++++++++++
>   2 files changed, 89 insertions(+)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 7fa0a695c8..c65fdde56f 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -62,6 +62,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>   typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>   typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>   typedef struct VTDPASIDEntry VTDPASIDEntry;
> +typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
>   
>   /* Context-Entry */
>   struct VTDContextEntry {
> @@ -148,6 +149,13 @@ struct VTDAddressSpace {
>       IOVATree *iova_tree;
>   };
>   
> +struct VTDIOMMUFDDevice {
> +    PCIBus *bus;
> +    uint8_t devfn;
> +    IOMMUFDDevice *idev;
> +    IntelIOMMUState *iommu_state;
> +};

Does the VTDIOMMUFDDevice definition need to be public ?

>   struct VTDIOTLBEntry {
>       uint64_t gfn;
>       uint16_t domain_id;
> @@ -292,6 +300,8 @@ struct IntelIOMMUState {
>       /* list of registered notifiers */
>       QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>   
> +    GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
> +
>       /* interrupt remapping */
>       bool intr_enabled;              /* Whether guest enabled IR */
>       dma_addr_t intr_root;           /* Interrupt remapping table pointer */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ed5677c0ae..95faf697eb 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -237,6 +237,13 @@ static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
>              (key1->pasid == key2->pasid);
>   }
>   
> +static gboolean vtd_as_idev_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    const struct vtd_as_key *key1 = v1;
> +    const struct vtd_as_key *key2 = v2;
> +
> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
> +}
>   /*
>    * Note that we use pointer to PCIBus as the key, so hashing/shifting
>    * based on the pointer value is intended. Note that we deal with
> @@ -3812,6 +3819,74 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>       return vtd_dev_as;
>   }
>   
> +static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
> +                                    IOMMUFDDevice *idev, Error **errp)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDIOMMUFDDevice *vtd_idev;
> +    struct vtd_as_key key = {
> +        .bus = bus,
> +        .devfn = devfn,
> +    };
> +    struct vtd_as_key *new_key;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);

Can we move the assert earlier in the call stack ?
pci_device_get_iommu_bus_devfn() looks like a good place.

> +
> +    /* None IOMMUFD case */
> +    if (!idev) {
> +        return 0;
> +    }

Can we move this test in the helper ? (Looks like an error to me).


Thanks,

C.


> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
> +
> +    if (vtd_idev) {
> +        error_setg(errp, "IOMMUFD device already exist");
> +        return -1;
> +    }
> +
> +    new_key = g_malloc(sizeof(*new_key));
> +    new_key->bus = bus;
> +    new_key->devfn = devfn;
> +
> +    vtd_idev = g_malloc0(sizeof(VTDIOMMUFDDevice));
> +    vtd_idev->bus = bus;
> +    vtd_idev->devfn = (uint8_t)devfn;
> +    vtd_idev->iommu_state = s;
> +    vtd_idev->idev = idev;
> +
> +    g_hash_table_insert(s->vtd_iommufd_dev, new_key, vtd_idev);
> +
> +    vtd_iommu_unlock(s);
> +
> +    return 0;
> +}
> +
> +static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int32_t devfn)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDIOMMUFDDevice *vtd_idev;
> +    struct vtd_as_key key = {
> +        .bus = bus,
> +        .devfn = devfn,
> +    };
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
> +    if (!vtd_idev) {
> +        vtd_iommu_unlock(s);
> +        return;
> +    }
> +
> +    g_hash_table_remove(s->vtd_iommufd_dev, &key);
> +
> +    vtd_iommu_unlock(s);
> +}
> +
>   /* Unmap the whole range in the notifier's scope. */
>   static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>   {
> @@ -4107,6 +4182,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>   
>   static PCIIOMMUOps vtd_iommu_ops = {
>       .get_address_space = vtd_host_dma_iommu,
> +    .set_iommu_device = vtd_dev_set_iommu_device,
> +    .unset_iommu_device = vtd_dev_unset_iommu_device,
>   };
>   
>   static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> @@ -4230,6 +4307,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                        g_free, g_free);
>       s->vtd_address_spaces = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
>                                         g_free, g_free);
> +    s->vtd_iommufd_dev = g_hash_table_new_full(vtd_as_hash, vtd_as_idev_equal,
> +                                               g_free, g_free);
>       vtd_init(s);
>       pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>       /* Pseudo address space under root PCI bus. */



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-15 10:13 ` [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
  2024-01-17 15:37   ` Joao Martins
  2024-01-17 17:30   ` Eric Auger
@ 2024-01-22 17:15   ` Cédric Le Goater
  2024-01-23  9:46     ` Duan, Zhenzhong
  2 siblings, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2024-01-22 17:15 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun

On 1/15/24 11:13, Zhenzhong Duan wrote:
> Initialize IOMMUFDDevice in vfio and pass to vIOMMU, so that vIOMMU
> could get hw IOMMU information.
> 
> In VFIO legacy backend mode, we still pass a NULL IOMMUFDDevice to vIOMMU,
> in case vIOMMU needs some processing for VFIO legacy backend mode.
> 
> Originally-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/hw/vfio/vfio-common.h |  2 ++
>   hw/vfio/iommufd.c             |  2 ++
>   hw/vfio/pci.c                 | 24 +++++++++++++++++++-----
>   3 files changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 9b7ef7d02b..fde0d0ca60 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -31,6 +31,7 @@
>   #endif
>   #include "sysemu/sysemu.h"
>   #include "hw/vfio/vfio-container-base.h"
> +#include "sysemu/iommufd_device.h"
>   
>   #define VFIO_MSG_PREFIX "vfio %s: "
>   
> @@ -126,6 +127,7 @@ typedef struct VFIODevice {
>       bool dirty_tracking;
>       int devid;
>       IOMMUFDBackend *iommufd;
> +    IOMMUFDDevice idev;
>   } VFIODevice;
>   
>   struct VFIODeviceOps {
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 9bfddc1360..cbd035f148 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>       VFIOContainerBase *bcontainer;
>       VFIOIOMMUFDContainer *container;
>       VFIOAddressSpace *space;
> +    IOMMUFDDevice *idev = &vbasedev->idev;
>       struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>       int ret, devfd;
>       uint32_t ioas_id;
> @@ -428,6 +429,7 @@ found_container:
>       QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
>       QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>   
> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid);
>       trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
>                                      vbasedev->num_regions, vbasedev->flags);
>       return 0;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index d7fe06715c..2c3a5d267b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3107,11 +3107,21 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>   
>       vfio_bars_register(vdev);
>   
> -    ret = vfio_add_capabilities(vdev, errp);
> +    if (vbasedev->iommufd) {
> +        ret = pci_device_set_iommu_device(pdev, &vbasedev->idev, errp);
> +    } else {
> +        ret = pci_device_set_iommu_device(pdev, 0, errp);


AFAICT, pci_device_set_iommu_device() with a NULL IOMMUFDDevice will do
nothing. Why call it ?


Thanks,

C.



> +    }
>       if (ret) {
> +        error_prepend(errp, "Failed to set iommu_device: ");
>           goto out_teardown;
>       }
>   
> +    ret = vfio_add_capabilities(vdev, errp);
> +    if (ret) {
> +        goto out_unset_idev;
> +    }
> +
>       if (vdev->vga) {
>           vfio_vga_quirk_setup(vdev);
>       }
> @@ -3128,7 +3138,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>               error_setg(errp,
>                          "cannot support IGD OpRegion feature on hotplugged "
>                          "device");
> -            goto out_teardown;
> +            goto out_unset_idev;
>           }
>   
>           ret = vfio_get_dev_region_info(vbasedev,
> @@ -3137,13 +3147,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>           if (ret) {
>               error_setg_errno(errp, -ret,
>                                "does not support requested IGD OpRegion feature");
> -            goto out_teardown;
> +            goto out_unset_idev;
>           }
>   
>           ret = vfio_pci_igd_opregion_init(vdev, opregion, errp);
>           g_free(opregion);
>           if (ret) {
> -            goto out_teardown;
> +            goto out_unset_idev;
>           }
>       }
>   
> @@ -3229,6 +3239,8 @@ out_deregister:
>       if (vdev->intx.mmap_timer) {
>           timer_free(vdev->intx.mmap_timer);
>       }
> +out_unset_idev:
> +    pci_device_unset_iommu_device(pdev);
>   out_teardown:
>       vfio_teardown_msi(vdev);
>       vfio_bars_exit(vdev);
> @@ -3257,6 +3269,7 @@ static void vfio_instance_finalize(Object *obj)
>   static void vfio_exitfn(PCIDevice *pdev)
>   {
>       VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> +    VFIODevice *vbasedev = &vdev->vbasedev;
>   
>       vfio_unregister_req_notifier(vdev);
>       vfio_unregister_err_notifier(vdev);
> @@ -3271,7 +3284,8 @@ static void vfio_exitfn(PCIDevice *pdev)
>       vfio_teardown_msi(vdev);
>       vfio_pci_disable_rp_atomics(vdev);
>       vfio_bars_exit(vdev);
> -    vfio_migration_exit(&vdev->vbasedev);
> +    vfio_migration_exit(vbasedev);
> +    pci_device_unset_iommu_device(pdev);
>   }
>   
>   static void vfio_pci_reset(DeviceState *dev)



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-22 16:25       ` Cédric Le Goater
@ 2024-01-23  5:51         ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-23  5:51 UTC (permalink / raw)
  To: Cédric Le Goater, eric.auger, qemu-devel
  Cc: alex.williamson, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce
>IOMMUFDDevice
>
>On 1/19/24 08:31, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Subject: Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce
>>> IOMMUFDDevice
>>>
>>>
>>>
>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>> IOMMUFDDevice represents a device in iommufd and can be used as
>>>> a communication interface between devices (i.e., VFIO, VDPA) and
>>>> vIOMMU.
>>>>
>>>> Currently it includes iommufd handler and device id information
>>>> which could be used by vIOMMU to get hw IOMMU information.
>>>>
>>>> In future nested translation support, vIOMMU is going to have
>>>> more iommufd related operations like allocate hwpt for a device,
>>>> attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.
>>>>
>>>> IOMMUFDDevice is willingly not a QOM object because we don't want
>>>> it to be visible from the user interface.
>>>>
>>>> Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.
>>>>
>>>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>   MAINTAINERS                     |  4 +--
>>>>   include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
>>>>   backends/iommufd_device.c       | 50
>>> +++++++++++++++++++++++++++++++++
>>> Maybe it is still time to move the iommufd files in a sepate dir, under
>>> hw at the same level as vfio.
>>>
>>> Thoughts?
>>
>> Any reason for the movement? Hw dir contains entries to emulate
>different
>> Devices. Iommufd is not a real device. It's more a backend.
>
>I would include the new services in the existing iommufd .[ch] files.
>No need for a new file since the changes are all related to the IOMMUFD
>device usage.

Make sense, will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-22 16:55   ` Cédric Le Goater
@ 2024-01-23  6:37     ` Duan, Zhenzhong
  2024-01-23  7:40       ` Cédric Le Goater
  0 siblings, 1 reply; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-23  6:37 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Marcel Apfelbaum



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>pci_device_set/unset_iommu_device()
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This adds pci_device_set/unset_iommu_device() to set/unset
>> IOMMUFDDevice for a given PCIe device. Caller of set
>> should fail if set operation fails.
>>
>> Extract out pci_device_get_iommu_bus_devfn() to facilitate
>> implementation of pci_device_set/unset_iommu_device().
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   include/hw/pci/pci.h | 39 ++++++++++++++++++++++++++++++++++-
>>   hw/pci/pci.c         | 49
>+++++++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index fa6313aabc..a810c0ec74 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -7,6 +7,8 @@
>>   /* PCI includes legacy ISA access.  */
>>   #include "hw/isa/isa.h"
>>
>> +#include "sysemu/iommufd_device.h"
>> +
>>   extern bool pci_available;
>>
>>   /* PCI bus */
>> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>>        *
>>        * @devfn: device and function number
>>        */
>> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>devfn);
>> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>devfn);
>> +    /**
>> +     * @set_iommu_device: set iommufd device for a PCI device to
>vIOMMU
>> +     *
>> +     * Optional callback, if not implemented in vIOMMU, then vIOMMU
>can't
>> +     * utilize iommufd specific features.
>> +     *
>> +     * Return true if iommufd device is accepted, or else return false with
>> +     * errp set.
>> +     *
>> +     * @bus: the #PCIBus of the PCI device.
>> +     *
>> +     * @opaque: the data passed to pci_setup_iommu().
>> +     *
>> +     * @devfn: device and function number of the PCI device.
>> +     *
>> +     * @idev: the data structure representing iommufd device.
>> +     *
>> +     */
>> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
>> +                            IOMMUFDDevice *idev, Error **errp);
>> +    /**
>> +     * @unset_iommu_device: unset iommufd device for a PCI device from
>vIOMMU
>> +     *
>> +     * Optional callback.
>> +     *
>> +     * @bus: the #PCIBus of the PCI device.
>> +     *
>> +     * @opaque: the data passed to pci_setup_iommu().
>> +     *
>> +     * @devfn: device and function number of the PCI device.
>> +     */
>> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t
>devfn);
>>   } PCIIOMMUOps;
>>
>>   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>*idev,
>> +                                Error **errp);
>> +void pci_device_unset_iommu_device(PCIDevice *dev);
>>
>>   /**
>>    * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 76080af580..3848662f95 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2672,7 +2672,10 @@ static void
>pci_device_class_base_init(ObjectClass *klass, void *data)
>>       }
>>   }
>>
>> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>> +                                           PCIBus **aliased_pbus,
>> +                                           PCIBus **piommu_bus,
>> +                                           uint8_t *aliased_pdevfn)
>>   {
>>       PCIBus *bus = pci_get_bus(dev);
>>       PCIBus *iommu_bus = bus;
>> @@ -2717,6 +2720,18 @@ AddressSpace
>*pci_device_iommu_address_space(PCIDevice *dev)
>>
>>           iommu_bus = parent_bus;
>>       }
>> +    *aliased_pbus = bus;
>> +    *piommu_bus = iommu_bus;
>> +    *aliased_pdevfn = devfn;
>> +}
>> +
>> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> +{
>> +    PCIBus *bus;
>> +    PCIBus *iommu_bus;
>> +    uint8_t devfn;
>> +
>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>>       if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>>           return iommu_bus->iommu_ops->get_address_space(bus,
>>                                    iommu_bus->iommu_opaque, devfn);
>> @@ -2724,6 +2739,38 @@ AddressSpace
>*pci_device_iommu_address_space(PCIDevice *dev)
>>       return &address_space_memory;
>>   }
>>
>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>*idev,
>> +                                Error **errp)
>> +{
>> +    PCIBus *bus;
>> +    PCIBus *iommu_bus;
>> +    uint8_t devfn;
>> +
>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
>
>Why do we test iommu_bus in pci_device_un/set_iommu_device routines
>and
>not in pci_device_iommu_address_space() ?

iommu_bus check in pci_device_iommu_address_space() is dropped in
below commit, I didn't find related discussion in mail history, maybe
by accident? I can add it back if it's not intentional.

ba7d12eb8c  hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-23  6:37     ` Duan, Zhenzhong
@ 2024-01-23  7:40       ` Cédric Le Goater
  2024-01-23  9:25         ` Duan, Zhenzhong
  0 siblings, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2024-01-23  7:40 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Marcel Apfelbaum

On 1/23/24 07:37, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>> pci_device_set/unset_iommu_device()
>>
>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> This adds pci_device_set/unset_iommu_device() to set/unset
>>> IOMMUFDDevice for a given PCIe device. Caller of set
>>> should fail if set operation fails.
>>>
>>> Extract out pci_device_get_iommu_bus_devfn() to facilitate
>>> implementation of pci_device_set/unset_iommu_device().
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>    include/hw/pci/pci.h | 39 ++++++++++++++++++++++++++++++++++-
>>>    hw/pci/pci.c         | 49
>> +++++++++++++++++++++++++++++++++++++++++++-
>>>    2 files changed, 86 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>> index fa6313aabc..a810c0ec74 100644
>>> --- a/include/hw/pci/pci.h
>>> +++ b/include/hw/pci/pci.h
>>> @@ -7,6 +7,8 @@
>>>    /* PCI includes legacy ISA access.  */
>>>    #include "hw/isa/isa.h"
>>>
>>> +#include "sysemu/iommufd_device.h"
>>> +
>>>    extern bool pci_available;
>>>
>>>    /* PCI bus */
>>> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>>>         *
>>>         * @devfn: device and function number
>>>         */
>>> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>> devfn);
>>> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>> devfn);
>>> +    /**
>>> +     * @set_iommu_device: set iommufd device for a PCI device to
>> vIOMMU
>>> +     *
>>> +     * Optional callback, if not implemented in vIOMMU, then vIOMMU
>> can't
>>> +     * utilize iommufd specific features.
>>> +     *
>>> +     * Return true if iommufd device is accepted, or else return false with
>>> +     * errp set.
>>> +     *
>>> +     * @bus: the #PCIBus of the PCI device.
>>> +     *
>>> +     * @opaque: the data passed to pci_setup_iommu().
>>> +     *
>>> +     * @devfn: device and function number of the PCI device.
>>> +     *
>>> +     * @idev: the data structure representing iommufd device.
>>> +     *
>>> +     */
>>> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
>>> +                            IOMMUFDDevice *idev, Error **errp);
>>> +    /**
>>> +     * @unset_iommu_device: unset iommufd device for a PCI device from
>> vIOMMU
>>> +     *
>>> +     * Optional callback.
>>> +     *
>>> +     * @bus: the #PCIBus of the PCI device.
>>> +     *
>>> +     * @opaque: the data passed to pci_setup_iommu().
>>> +     *
>>> +     * @devfn: device and function number of the PCI device.
>>> +     */
>>> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t
>> devfn);
>>>    } PCIIOMMUOps;
>>>
>>>    AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>> *idev,
>>> +                                Error **errp);
>>> +void pci_device_unset_iommu_device(PCIDevice *dev);
>>>
>>>    /**
>>>     * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index 76080af580..3848662f95 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -2672,7 +2672,10 @@ static void
>> pci_device_class_base_init(ObjectClass *klass, void *data)
>>>        }
>>>    }
>>>
>>> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>>> +                                           PCIBus **aliased_pbus,
>>> +                                           PCIBus **piommu_bus,
>>> +                                           uint8_t *aliased_pdevfn)
>>>    {
>>>        PCIBus *bus = pci_get_bus(dev);
>>>        PCIBus *iommu_bus = bus;
>>> @@ -2717,6 +2720,18 @@ AddressSpace
>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>
>>>            iommu_bus = parent_bus;
>>>        }
>>> +    *aliased_pbus = bus;
>>> +    *piommu_bus = iommu_bus;
>>> +    *aliased_pdevfn = devfn;
>>> +}
>>> +
>>> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>> +{
>>> +    PCIBus *bus;
>>> +    PCIBus *iommu_bus;
>>> +    uint8_t devfn;
>>> +
>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>>>        if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>>>            return iommu_bus->iommu_ops->get_address_space(bus,
>>>                                     iommu_bus->iommu_opaque, devfn);
>>> @@ -2724,6 +2739,38 @@ AddressSpace
>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>        return &address_space_memory;
>>>    }
>>>
>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>> *idev,
>>> +                                Error **errp)
>>> +{
>>> +    PCIBus *bus;
>>> +    PCIBus *iommu_bus;
>>> +    uint8_t devfn;
>>> +
>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
>>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
>>
>> Why do we test iommu_bus in pci_device_un/set_iommu_device routines
>> and
>> not in pci_device_iommu_address_space() ?
> 
> iommu_bus check in pci_device_iommu_address_space() is dropped in
> below commit, I didn't find related discussion in mail history, maybe
> by accident? I can add it back if it's not intentional.

Can iommu_bus be NULL or should we add an assert ?

C.

> 
> ba7d12eb8c  hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
> 
> Thanks
> Zhenzhong



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-15 10:13 ` [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
  2024-01-17 17:56   ` Eric Auger
@ 2024-01-23  8:39   ` Cédric Le Goater
  2024-01-23 10:01     ` Duan, Zhenzhong
  1 sibling, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2024-01-23  8:39 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum, Vivek Kasireddy

On 1/15/24 11:13, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Add a framework to check and synchronize host IOMMU cap/ecap with
> vIOMMU cap/ecap.
> 
> Currently only stage-2 translation is supported which is backed by
> shadow page table on host side. So we don't need exact matching of
> each bit of cap/ecap between vIOMMU and host. However, we can still
> utilize this framework to ensure compatibility of host and vIOMMU's
> address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
> which is missed before.
> 
> When stage-1 translation is supported in future, a.k.a. scalable
> modern mode, we need to ensure compatibility of each bits. Some
> bits are user controllable, they should be checked with host side
> to ensure compatibility. Other bits are not, they should be synced
> into vIOMMU cap/ecap for compatibility.
> 
> The sequence will be:
> 
> vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
> iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
> iommu->host_cap/ecap is checked and updated some bits with host cap/ecap. ---- vtd_sync_hw_info()
> iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ---- vtd_machine_done_hook()
> 
> iommu->host_cap/ecap is a temporary storage to hold intermediate value
> when synthesize host cap/ecap and vIOMMU's initial configured cap/ecap.


The above "sequence" paragraph is not very clear. The patch may need to
be split further.



> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/hw/i386/intel_iommu.h |  4 ++
>   hw/i386/intel_iommu.c         | 78 +++++++++++++++++++++++++++++++----
>   2 files changed, 75 insertions(+), 7 deletions(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index c65fdde56f..b8abbcce12 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -292,6 +292,9 @@ struct IntelIOMMUState {
>       uint64_t cap;                   /* The value of capability reg */
>       uint64_t ecap;                  /* The value of extended capability reg */
>   
> +    uint64_t host_cap;              /* The value of host capability reg */
> +    uint64_t host_ecap;             /* The value of host ext-capability reg */
> +
>       uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>       GHashTable *iotlb;              /* IOTLB */
>   
> @@ -314,6 +317,7 @@ struct IntelIOMMUState {
>       bool dma_translation;           /* Whether DMA translation supported */
>       bool pasid;                     /* Whether to support PASID */
>   
> +    bool cap_finalized;             /* Whether VTD capability finalized */
>       /*
>        * Protects IOMMU states in general.  Currently it protects the
>        * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 4c1d058ebd..be03fcbf52 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3819,6 +3819,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>       return vtd_dev_as;
>   }
>   
> +static bool vtd_sync_hw_info(IntelIOMMUState *s, struct iommu_hw_info_vtd *vtd,
> +                             Error **errp)
> +{
> +    uint64_t addr_width;
> +
> +    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;

Virek uses the same kind of macro in :

   https://lore.kernel.org/qemu-devel/20240118192049.1796763-1-vivek.kasireddy@intel.com/

What about the + 1 ? Looks like it's missing here, according to 11.4.2
Capability Register.

Could we introduce a common macro in intel_iommu_internal.h ?


> +    if (s->aw_bits > addr_width) {
> +        error_setg(errp, "User aw-bits: %u > host address width: %lu",

I think %lu should be PRId64. This is a general comment. You should avoid
%llx, %lx, etc. in the code.

> +                   s->aw_bits, addr_width);
> +        return false;
> +    }
> +
> +    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
> +
> +    return true;
> +}
> +
> +/*
> + * virtual VT-d which wants nested needs to check the host IOMMU
> + * nesting cap info behind the assigned devices. Thus that vIOMMU
> + * could bind guest page table to host.
> + */
> +static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
> +                           Error **errp)
> +{
> +    struct iommu_hw_info_vtd vtd;
> +    enum iommu_hw_info_type type = IOMMU_HW_INFO_TYPE_INTEL_VTD;
> +
> +    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
> +        error_setg(errp, "Failed to get IOMMU capability!!!");
> +        return false;
> +    }
> +
> +    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> +        error_setg(errp, "IOMMU hardware is not compatible!!!");
> +        return false;
> +    }
> +
> +    return vtd_sync_hw_info(s, &vtd, errp);
> +}
> +
>   static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
>                                       IOMMUFDDevice *idev, Error **errp)
>   {
> @@ -3837,6 +3878,10 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
>           return 0;
>       }
>   
> +    if (!vtd_check_idev(s, idev, errp)) {
> +        return -1;
> +    }
> +
>       vtd_iommu_lock(s);
>   
>       vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
> @@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
>   {
>       X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>   
> -    memset(s->csr, 0, DMAR_REG_SIZE);
> -    memset(s->wmask, 0, DMAR_REG_SIZE);
> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
> -    memset(s->womask, 0, DMAR_REG_SIZE);
> +    /* CAP/ECAP are initialized in machine create done stage */
> +    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
> +    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
> +    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
> +    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE - DMAR_GCMD_REG);
>   
>       s->root = 0;
>       s->root_scalable = false;
> @@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)

vtd_init() is called from reset and from realize. This is redundant.
reset should be enough.


>           vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
>       }
>   
> -    vtd_cap_init(s);
> +    if (!s->cap_finalized) {

ok. so this can only be done in reset.

> +        vtd_cap_init(s);
> +        s->host_cap = s->cap;
> +        s->host_ecap = s->ecap;
> +    }
> +
>       vtd_reset_caches(s);
>   
>       /* Define registers with default values and bit semantics */
>       vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
> -    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
> -    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>       vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>       vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>       vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
> @@ -4241,6 +4290,12 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>       return true;
>   }
>   
> +static void vtd_setup_capability_reg(IntelIOMMUState *s)
> +{
> +    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
> +    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
> +}
> +
>   static int vtd_machine_done_notify_one(Object *child, void *unused)
>   {
>       IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
> @@ -4259,6 +4314,14 @@ static int vtd_machine_done_notify_one(Object *child, void *unused)
>   
>   static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>   {
> +    IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
> +
> +    iommu->cap = iommu->host_cap;
> +    iommu->ecap = iommu->host_ecap;
> +    iommu->cap_finalized = true;
> +
> +    vtd_setup_capability_reg(iommu);
> +

This is confusing. Please split the patch better to reflect the ordering of
the e/cap register settings.


Thanks,

C.




>       object_child_foreach_recursive(object_get_root(),
>                                      vtd_machine_done_notify_one, NULL);
>   }
> @@ -4292,6 +4355,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>   
>       QLIST_INIT(&s->vtd_as_with_notifiers);
>       qemu_mutex_init(&s->iommu_lock);
> +    s->cap_finalized = false;
>       memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>                             "intel_iommu", DMAR_REG_SIZE);
>       memory_region_add_subregion(get_system_memory(),



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-23  7:40       ` Cédric Le Goater
@ 2024-01-23  9:25         ` Duan, Zhenzhong
  2024-01-23 10:18           ` Eric Auger
  0 siblings, 1 reply; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-23  9:25 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Marcel Apfelbaum



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>pci_device_set/unset_iommu_device()
>
>On 1/23/24 07:37, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>>> pci_device_set/unset_iommu_device()
>>>
>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> This adds pci_device_set/unset_iommu_device() to set/unset
>>>> IOMMUFDDevice for a given PCIe device. Caller of set
>>>> should fail if set operation fails.
>>>>
>>>> Extract out pci_device_get_iommu_bus_devfn() to facilitate
>>>> implementation of pci_device_set/unset_iommu_device().
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>    include/hw/pci/pci.h | 39
>++++++++++++++++++++++++++++++++++-
>>>>    hw/pci/pci.c         | 49
>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>    2 files changed, 86 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>> index fa6313aabc..a810c0ec74 100644
>>>> --- a/include/hw/pci/pci.h
>>>> +++ b/include/hw/pci/pci.h
>>>> @@ -7,6 +7,8 @@
>>>>    /* PCI includes legacy ISA access.  */
>>>>    #include "hw/isa/isa.h"
>>>>
>>>> +#include "sysemu/iommufd_device.h"
>>>> +
>>>>    extern bool pci_available;
>>>>
>>>>    /* PCI bus */
>>>> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>>>>         *
>>>>         * @devfn: device and function number
>>>>         */
>>>> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>>> devfn);
>>>> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque,
>int
>>> devfn);
>>>> +    /**
>>>> +     * @set_iommu_device: set iommufd device for a PCI device to
>>> vIOMMU
>>>> +     *
>>>> +     * Optional callback, if not implemented in vIOMMU, then vIOMMU
>>> can't
>>>> +     * utilize iommufd specific features.
>>>> +     *
>>>> +     * Return true if iommufd device is accepted, or else return false with
>>>> +     * errp set.
>>>> +     *
>>>> +     * @bus: the #PCIBus of the PCI device.
>>>> +     *
>>>> +     * @opaque: the data passed to pci_setup_iommu().
>>>> +     *
>>>> +     * @devfn: device and function number of the PCI device.
>>>> +     *
>>>> +     * @idev: the data structure representing iommufd device.
>>>> +     *
>>>> +     */
>>>> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
>>>> +                            IOMMUFDDevice *idev, Error **errp);
>>>> +    /**
>>>> +     * @unset_iommu_device: unset iommufd device for a PCI device
>from
>>> vIOMMU
>>>> +     *
>>>> +     * Optional callback.
>>>> +     *
>>>> +     * @bus: the #PCIBus of the PCI device.
>>>> +     *
>>>> +     * @opaque: the data passed to pci_setup_iommu().
>>>> +     *
>>>> +     * @devfn: device and function number of the PCI device.
>>>> +     */
>>>> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t
>>> devfn);
>>>>    } PCIIOMMUOps;
>>>>
>>>>    AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>>> *idev,
>>>> +                                Error **errp);
>>>> +void pci_device_unset_iommu_device(PCIDevice *dev);
>>>>
>>>>    /**
>>>>     * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>> index 76080af580..3848662f95 100644
>>>> --- a/hw/pci/pci.c
>>>> +++ b/hw/pci/pci.c
>>>> @@ -2672,7 +2672,10 @@ static void
>>> pci_device_class_base_init(ObjectClass *klass, void *data)
>>>>        }
>>>>    }
>>>>
>>>> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>>>> +                                           PCIBus **aliased_pbus,
>>>> +                                           PCIBus **piommu_bus,
>>>> +                                           uint8_t *aliased_pdevfn)
>>>>    {
>>>>        PCIBus *bus = pci_get_bus(dev);
>>>>        PCIBus *iommu_bus = bus;
>>>> @@ -2717,6 +2720,18 @@ AddressSpace
>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>
>>>>            iommu_bus = parent_bus;
>>>>        }
>>>> +    *aliased_pbus = bus;
>>>> +    *piommu_bus = iommu_bus;
>>>> +    *aliased_pdevfn = devfn;
>>>> +}
>>>> +
>>>> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>> +{
>>>> +    PCIBus *bus;
>>>> +    PCIBus *iommu_bus;
>>>> +    uint8_t devfn;
>>>> +
>>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus,
>&devfn);
>>>>        if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>>>>            return iommu_bus->iommu_ops->get_address_space(bus,
>>>>                                     iommu_bus->iommu_opaque, devfn);
>>>> @@ -2724,6 +2739,38 @@ AddressSpace
>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>        return &address_space_memory;
>>>>    }
>>>>
>>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>>> *idev,
>>>> +                                Error **errp)
>>>> +{
>>>> +    PCIBus *bus;
>>>> +    PCIBus *iommu_bus;
>>>> +    uint8_t devfn;
>>>> +
>>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus,
>&devfn);
>>>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
>>>
>>> Why do we test iommu_bus in pci_device_un/set_iommu_device
>routines
>>> and
>>> not in pci_device_iommu_address_space() ?
>>
>> iommu_bus check in pci_device_iommu_address_space() is dropped in
>> below commit, I didn't find related discussion in mail history, maybe
>> by accident? I can add it back if it's not intentional.
>
>Can iommu_bus be NULL or should we add an assert ?

I dig into the history changes of pci_device_iommu_address_space() and
below commit added iommu_bus check.

5af2ae230514  pci: Fix pci_device_iommu_address_space() bus propagation

In theory, !iommu_bus->parent_dev take precedency over !iommu_bus,
So we never see iommu_bus NULL, assert may be better.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback
  2024-01-22 17:09   ` Cédric Le Goater
@ 2024-01-23  9:46     ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-23  9:46 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device
>callback
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This adds set/unset_iommu_device() implementation in Intel vIOMMU.
>> In set call, IOMMUFDDevice is recorded in hash table indexed by
>> PCI BDF.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   include/hw/i386/intel_iommu.h | 10 +++++
>>   hw/i386/intel_iommu.c         | 79
>+++++++++++++++++++++++++++++++++++
>>   2 files changed, 89 insertions(+)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 7fa0a695c8..c65fdde56f 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -62,6 +62,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>>   typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>>   typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>>   typedef struct VTDPASIDEntry VTDPASIDEntry;
>> +typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
>>
>>   /* Context-Entry */
>>   struct VTDContextEntry {
>> @@ -148,6 +149,13 @@ struct VTDAddressSpace {
>>       IOVATree *iova_tree;
>>   };
>>
>> +struct VTDIOMMUFDDevice {
>> +    PCIBus *bus;
>> +    uint8_t devfn;
>> +    IOMMUFDDevice *idev;
>> +    IntelIOMMUState *iommu_state;
>> +};
>
>Does the VTDIOMMUFDDevice definition need to be public ?

No need, will move it to hw/i386/intel_iommu_internal.h
It looks I need to do the same for other definitions in nesting series.

>
>>   struct VTDIOTLBEntry {
>>       uint64_t gfn;
>>       uint16_t domain_id;
>> @@ -292,6 +300,8 @@ struct IntelIOMMUState {
>>       /* list of registered notifiers */
>>       QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>>
>> +    GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
>> +
>>       /* interrupt remapping */
>>       bool intr_enabled;              /* Whether guest enabled IR */
>>       dma_addr_t intr_root;           /* Interrupt remapping table pointer */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index ed5677c0ae..95faf697eb 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -237,6 +237,13 @@ static gboolean vtd_as_equal(gconstpointer v1,
>gconstpointer v2)
>>              (key1->pasid == key2->pasid);
>>   }
>>
>> +static gboolean vtd_as_idev_equal(gconstpointer v1, gconstpointer v2)
>> +{
>> +    const struct vtd_as_key *key1 = v1;
>> +    const struct vtd_as_key *key2 = v2;
>> +
>> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
>> +}
>>   /*
>>    * Note that we use pointer to PCIBus as the key, so hashing/shifting
>>    * based on the pointer value is intended. Note that we deal with
>> @@ -3812,6 +3819,74 @@ VTDAddressSpace
>*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>       return vtd_dev_as;
>>   }
>>
>> +static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>int32_t devfn,
>> +                                    IOMMUFDDevice *idev, Error **errp)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    VTDIOMMUFDDevice *vtd_idev;
>> +    struct vtd_as_key key = {
>> +        .bus = bus,
>> +        .devfn = devfn,
>> +    };
>> +    struct vtd_as_key *new_key;
>> +
>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>
>Can we move the assert earlier in the call stack ?
>pci_device_get_iommu_bus_devfn() looks like a good place.

Sure.

>
>> +
>> +    /* None IOMMUFD case */
>> +    if (!idev) {
>> +        return 0;
>> +    }
>
>Can we move this test in the helper ? (Looks like an error to me).

We need to pass in NULL idev to do further check in nesting series.
See https://github.com/yiliu1765/qemu/commit/7f0bb59575bb5cf38618ae950f68a8661676e881

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-22 17:15   ` Cédric Le Goater
@ 2024-01-23  9:46     ` Duan, Zhenzhong
  2024-01-23 12:54       ` Cédric Le Goater
  0 siblings, 1 reply; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-23  9:46 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to
>vIOMMU
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> Initialize IOMMUFDDevice in vfio and pass to vIOMMU, so that vIOMMU
>> could get hw IOMMU information.
>>
>> In VFIO legacy backend mode, we still pass a NULL IOMMUFDDevice to
>vIOMMU,
>> in case vIOMMU needs some processing for VFIO legacy backend mode.
>>
>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   include/hw/vfio/vfio-common.h |  2 ++
>>   hw/vfio/iommufd.c             |  2 ++
>>   hw/vfio/pci.c                 | 24 +++++++++++++++++++-----
>>   3 files changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>> index 9b7ef7d02b..fde0d0ca60 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -31,6 +31,7 @@
>>   #endif
>>   #include "sysemu/sysemu.h"
>>   #include "hw/vfio/vfio-container-base.h"
>> +#include "sysemu/iommufd_device.h"
>>
>>   #define VFIO_MSG_PREFIX "vfio %s: "
>>
>> @@ -126,6 +127,7 @@ typedef struct VFIODevice {
>>       bool dirty_tracking;
>>       int devid;
>>       IOMMUFDBackend *iommufd;
>> +    IOMMUFDDevice idev;
>>   } VFIODevice;
>>
>>   struct VFIODeviceOps {
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 9bfddc1360..cbd035f148 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>>       VFIOContainerBase *bcontainer;
>>       VFIOIOMMUFDContainer *container;
>>       VFIOAddressSpace *space;
>> +    IOMMUFDDevice *idev = &vbasedev->idev;
>>       struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>       int ret, devfd;
>>       uint32_t ioas_id;
>> @@ -428,6 +429,7 @@ found_container:
>>       QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev,
>container_next);
>>       QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>>
>> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev-
>>devid);
>>       trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>num_irqs,
>>                                      vbasedev->num_regions, vbasedev->flags);
>>       return 0;
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index d7fe06715c..2c3a5d267b 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3107,11 +3107,21 @@ static void vfio_realize(PCIDevice *pdev,
>Error **errp)
>>
>>       vfio_bars_register(vdev);
>>
>> -    ret = vfio_add_capabilities(vdev, errp);
>> +    if (vbasedev->iommufd) {
>> +        ret = pci_device_set_iommu_device(pdev, &vbasedev->idev, errp);
>> +    } else {
>> +        ret = pci_device_set_iommu_device(pdev, 0, errp);
>
>
>AFAICT, pci_device_set_iommu_device() with a NULL IOMMUFDDevice will
>do
>nothing. Why call it ?

We will do something in nesting series, see https://github.com/yiliu1765/qemu/commit/7f0bb59575bb5cf38618ae950f68a8661676e881

Another choice is to call pci_device_set_iommu_device() no matter which backend
is used and check idev->iommufd in vtd_dev_set_iommu_device(). Is this better
for you?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-23  8:39   ` Cédric Le Goater
@ 2024-01-23 10:01     ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-23 10:01 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum, Kasireddy, Vivek



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and
>sync host IOMMU cap/ecap
>
>On 1/15/24 11:13, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Add a framework to check and synchronize host IOMMU cap/ecap with
>> vIOMMU cap/ecap.
>>
>> Currently only stage-2 translation is supported which is backed by
>> shadow page table on host side. So we don't need exact matching of
>> each bit of cap/ecap between vIOMMU and host. However, we can still
>> utilize this framework to ensure compatibility of host and vIOMMU's
>> address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
>> which is missed before.
>>
>> When stage-1 translation is supported in future, a.k.a. scalable
>> modern mode, we need to ensure compatibility of each bits. Some
>> bits are user controllable, they should be checked with host side
>> to ensure compatibility. Other bits are not, they should be synced
>> into vIOMMU cap/ecap for compatibility.
>>
>> The sequence will be:
>>
>> vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
>> iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
>> iommu->host_cap/ecap is checked and updated some bits with host
>cap/ecap. ---- vtd_sync_hw_info()
>> iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ----
>vtd_machine_done_hook()
>>
>> iommu->host_cap/ecap is a temporary storage to hold intermediate value
>> when synthesize host cap/ecap and vIOMMU's initial configured cap/ecap.
>
>
>The above "sequence" paragraph is not very clear. The patch may need to
>be split further.

OK, will do.

>
>
>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   include/hw/i386/intel_iommu.h |  4 ++
>>   hw/i386/intel_iommu.c         | 78
>+++++++++++++++++++++++++++++++----
>>   2 files changed, 75 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index c65fdde56f..b8abbcce12 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -292,6 +292,9 @@ struct IntelIOMMUState {
>>       uint64_t cap;                   /* The value of capability reg */
>>       uint64_t ecap;                  /* The value of extended capability reg */
>>
>> +    uint64_t host_cap;              /* The value of host capability reg */
>> +    uint64_t host_ecap;             /* The value of host ext-capability reg */
>> +
>>       uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>>       GHashTable *iotlb;              /* IOTLB */
>>
>> @@ -314,6 +317,7 @@ struct IntelIOMMUState {
>>       bool dma_translation;           /* Whether DMA translation supported */
>>       bool pasid;                     /* Whether to support PASID */
>>
>> +    bool cap_finalized;             /* Whether VTD capability finalized */
>>       /*
>>        * Protects IOMMU states in general.  Currently it protects the
>>        * per-IOMMU IOTLB cache, and context entry cache in
>VTDAddressSpace.
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 4c1d058ebd..be03fcbf52 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3819,6 +3819,47 @@ VTDAddressSpace
>*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>       return vtd_dev_as;
>>   }
>>
>> +static bool vtd_sync_hw_info(IntelIOMMUState *s, struct
>iommu_hw_info_vtd *vtd,
>> +                             Error **errp)
>> +{
>> +    uint64_t addr_width;
>> +
>> +    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
>
>Virek uses the same kind of macro in :
>
>   https://lore.kernel.org/qemu-devel/20240118192049.1796763-1-
>vivek.kasireddy@intel.com/
>
>What about the + 1 ? Looks like it's missing here, according to 11.4.2
>Capability Register.
>
>Could we introduce a common macro in intel_iommu_internal.h ?

Sure.

>
>
>> +    if (s->aw_bits > addr_width) {
>> +        error_setg(errp, "User aw-bits: %u > host address width: %lu",
>
>I think %lu should be PRId64. This is a general comment. You should avoid
>%llx, %lx, etc. in the code.

Got it.

>
>> +                   s->aw_bits, addr_width);
>> +        return false;
>> +    }
>> +
>> +    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * virtual VT-d which wants nested needs to check the host IOMMU
>> + * nesting cap info behind the assigned devices. Thus that vIOMMU
>> + * could bind guest page table to host.
>> + */
>> +static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
>> +                           Error **errp)
>> +{
>> +    struct iommu_hw_info_vtd vtd;
>> +    enum iommu_hw_info_type type =
>IOMMU_HW_INFO_TYPE_INTEL_VTD;
>> +
>> +    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
>> +        error_setg(errp, "Failed to get IOMMU capability!!!");
>> +        return false;
>> +    }
>> +
>> +    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>> +        error_setg(errp, "IOMMU hardware is not compatible!!!");
>> +        return false;
>> +    }
>> +
>> +    return vtd_sync_hw_info(s, &vtd, errp);
>> +}
>> +
>>   static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>int32_t devfn,
>>                                       IOMMUFDDevice *idev, Error **errp)
>>   {
>> @@ -3837,6 +3878,10 @@ static int vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int32_t devfn,
>>           return 0;
>>       }
>>
>> +    if (!vtd_check_idev(s, idev, errp)) {
>> +        return -1;
>> +    }
>> +
>>       vtd_iommu_lock(s);
>>
>>       vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>> @@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
>>   {
>>       X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>>
>> -    memset(s->csr, 0, DMAR_REG_SIZE);
>> -    memset(s->wmask, 0, DMAR_REG_SIZE);
>> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
>> -    memset(s->womask, 0, DMAR_REG_SIZE);
>> +    /* CAP/ECAP are initialized in machine create done stage */
>> +    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>> +    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>> +    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>> +    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>DMAR_GCMD_REG);
>>
>>       s->root = 0;
>>       s->root_scalable = false;
>> @@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)
>
>vtd_init() is called from reset and from realize. This is redundant.
>reset should be enough.

Looks so. I'll try to see if it break anything.

>
>
>>           vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
>>       }
>>
>> -    vtd_cap_init(s);
>> +    if (!s->cap_finalized) {
>
>ok. so this can only be done in reset.

Not quite get, vtd_init can be called multiple times harmlessly before machine create done.
But once is enough, i.e., in reset. The call in realize looks redundant.

>
>> +        vtd_cap_init(s);
>> +        s->host_cap = s->cap;
>> +        s->host_ecap = s->ecap;
>> +    }
>> +
>>       vtd_reset_caches(s);
>>
>>       /* Define registers with default values and bit semantics */
>>       vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
>> -    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>> -    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>       vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>>       vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>>       vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
>> @@ -4241,6 +4290,12 @@ static bool
>vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>       return true;
>>   }
>>
>> +static void vtd_setup_capability_reg(IntelIOMMUState *s)
>> +{
>> +    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>> +    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>> +}
>> +
>>   static int vtd_machine_done_notify_one(Object *child, void *unused)
>>   {
>>       IntelIOMMUState *iommu =
>INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>> @@ -4259,6 +4314,14 @@ static int
>vtd_machine_done_notify_one(Object *child, void *unused)
>>
>>   static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>>   {
>> +    IntelIOMMUState *iommu =
>INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>> +
>> +    iommu->cap = iommu->host_cap;
>> +    iommu->ecap = iommu->host_ecap;
>> +    iommu->cap_finalized = true;
>> +
>> +    vtd_setup_capability_reg(iommu);
>> +
>
>This is confusing. Please split the patch better to reflect the ordering of
>the e/cap register settings.

Will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice
  2024-01-19  7:31     ` Duan, Zhenzhong
  2024-01-22 16:25       ` Cédric Le Goater
@ 2024-01-23 10:10       ` Eric Auger
  1 sibling, 0 replies; 46+ messages in thread
From: Eric Auger @ 2024-01-23 10:10 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun



On 1/19/24 08:31, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv1 1/6] backends/iommufd_device: introduce
>> IOMMUFDDevice
>>
>>
>>
>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>> IOMMUFDDevice represents a device in iommufd and can be used as
>>> a communication interface between devices (i.e., VFIO, VDPA) and
>>> vIOMMU.
>>>
>>> Currently it includes iommufd handler and device id information
>>> which could be used by vIOMMU to get hw IOMMU information.
>>>
>>> In future nested translation support, vIOMMU is going to have
>>> more iommufd related operations like allocate hwpt for a device,
>>> attach/detach hwpt, etc. So IOMMUFDDevice will be further expanded.
>>>
>>> IOMMUFDDevice is willingly not a QOM object because we don't want
>>> it to be visible from the user interface.
>>>
>>> Introduce a helper iommufd_device_init to initialize IOMMUFDDevice.
>>>
>>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  MAINTAINERS                     |  4 +--
>>>  include/sysemu/iommufd_device.h | 31 ++++++++++++++++++++
>>>  backends/iommufd_device.c       | 50
>> +++++++++++++++++++++++++++++++++
>> Maybe it is still time to move the iommufd files in a sepate dir, under
>> hw at the same level as vfio.
>>
>> Thoughts?
> Any reason for the movement? Hw dir contains entries to emulate different
> Devices. Iommufd is not a real device. It's more a backend.
Well I was thinking it would become bigger and bigger and since you
created a new .c file with new abstraction (devices) the backend dir
flat layout may not be well adapted. But as suggested by Cédric we may
use the existing files.

Thanks

Eric
>
> Thanks
> Zhenzhong



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-23  9:25         ` Duan, Zhenzhong
@ 2024-01-23 10:18           ` Eric Auger
  2024-01-24  9:23             ` Duan, Zhenzhong
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2024-01-23 10:18 UTC (permalink / raw)
  To: Duan, Zhenzhong, Cédric Le Goater, qemu-devel
  Cc: alex.williamson, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum



On 1/23/24 10:25, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>> pci_device_set/unset_iommu_device()
>>
>> On 1/23/24 07:37, Duan, Zhenzhong wrote:
>>>
>>>> -----Original Message-----
>>>> From: Cédric Le Goater <clg@redhat.com>
>>>> Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>>>> pci_device_set/unset_iommu_device()
>>>>
>>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>>
>>>>> This adds pci_device_set/unset_iommu_device() to set/unset
>>>>> IOMMUFDDevice for a given PCIe device. Caller of set
>>>>> should fail if set operation fails.
>>>>>
>>>>> Extract out pci_device_get_iommu_bus_devfn() to facilitate
>>>>> implementation of pci_device_set/unset_iommu_device().
>>>>>
>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> ---
>>>>>    include/hw/pci/pci.h | 39
>> ++++++++++++++++++++++++++++++++++-
>>>>>    hw/pci/pci.c         | 49
>>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>>    2 files changed, 86 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>>> index fa6313aabc..a810c0ec74 100644
>>>>> --- a/include/hw/pci/pci.h
>>>>> +++ b/include/hw/pci/pci.h
>>>>> @@ -7,6 +7,8 @@
>>>>>    /* PCI includes legacy ISA access.  */
>>>>>    #include "hw/isa/isa.h"
>>>>>
>>>>> +#include "sysemu/iommufd_device.h"
>>>>> +
>>>>>    extern bool pci_available;
>>>>>
>>>>>    /* PCI bus */
>>>>> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>>>>>         *
>>>>>         * @devfn: device and function number
>>>>>         */
>>>>> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque, int
>>>> devfn);
>>>>> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque,
>> int
>>>> devfn);
>>>>> +    /**
>>>>> +     * @set_iommu_device: set iommufd device for a PCI device to
>>>> vIOMMU
>>>>> +     *
>>>>> +     * Optional callback, if not implemented in vIOMMU, then vIOMMU
>>>> can't
>>>>> +     * utilize iommufd specific features.
>>>>> +     *
>>>>> +     * Return true if iommufd device is accepted, or else return false with
>>>>> +     * errp set.
>>>>> +     *
>>>>> +     * @bus: the #PCIBus of the PCI device.
>>>>> +     *
>>>>> +     * @opaque: the data passed to pci_setup_iommu().
>>>>> +     *
>>>>> +     * @devfn: device and function number of the PCI device.
>>>>> +     *
>>>>> +     * @idev: the data structure representing iommufd device.
>>>>> +     *
>>>>> +     */
>>>>> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t devfn,
>>>>> +                            IOMMUFDDevice *idev, Error **errp);
>>>>> +    /**
>>>>> +     * @unset_iommu_device: unset iommufd device for a PCI device
>> from
>>>> vIOMMU
>>>>> +     *
>>>>> +     * Optional callback.
>>>>> +     *
>>>>> +     * @bus: the #PCIBus of the PCI device.
>>>>> +     *
>>>>> +     * @opaque: the data passed to pci_setup_iommu().
>>>>> +     *
>>>>> +     * @devfn: device and function number of the PCI device.
>>>>> +     */
>>>>> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t
>>>> devfn);
>>>>>    } PCIIOMMUOps;
>>>>>
>>>>>    AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>>>> *idev,
>>>>> +                                Error **errp);
>>>>> +void pci_device_unset_iommu_device(PCIDevice *dev);
>>>>>
>>>>>    /**
>>>>>     * pci_setup_iommu: Initialize specific IOMMU handlers for a PCIBus
>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>> index 76080af580..3848662f95 100644
>>>>> --- a/hw/pci/pci.c
>>>>> +++ b/hw/pci/pci.c
>>>>> @@ -2672,7 +2672,10 @@ static void
>>>> pci_device_class_base_init(ObjectClass *klass, void *data)
>>>>>        }
>>>>>    }
>>>>>
>>>>> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>>> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>>>>> +                                           PCIBus **aliased_pbus,
>>>>> +                                           PCIBus **piommu_bus,
>>>>> +                                           uint8_t *aliased_pdevfn)
>>>>>    {
>>>>>        PCIBus *bus = pci_get_bus(dev);
>>>>>        PCIBus *iommu_bus = bus;
>>>>> @@ -2717,6 +2720,18 @@ AddressSpace
>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>            iommu_bus = parent_bus;
>>>>>        }
>>>>> +    *aliased_pbus = bus;
>>>>> +    *piommu_bus = iommu_bus;
>>>>> +    *aliased_pdevfn = devfn;
>>>>> +}
>>>>> +
>>>>> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>>> +{
>>>>> +    PCIBus *bus;
>>>>> +    PCIBus *iommu_bus;
>>>>> +    uint8_t devfn;
>>>>> +
>>>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus,
>> &devfn);
>>>>>        if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>>>>>            return iommu_bus->iommu_ops->get_address_space(bus,
>>>>>                                     iommu_bus->iommu_opaque, devfn);
>>>>> @@ -2724,6 +2739,38 @@ AddressSpace
>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>        return &address_space_memory;
>>>>>    }
>>>>>
>>>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>>>> *idev,
>>>>> +                                Error **errp)
>>>>> +{
>>>>> +    PCIBus *bus;
>>>>> +    PCIBus *iommu_bus;
>>>>> +    uint8_t devfn;
>>>>> +
>>>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus,
>> &devfn);
>>>>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
>>>> Why do we test iommu_bus in pci_device_un/set_iommu_device
>> routines
>>>> and
>>>> not in pci_device_iommu_address_space() ?
>>> iommu_bus check in pci_device_iommu_address_space() is dropped in
>>> below commit, I didn't find related discussion in mail history, maybe
>>> by accident? I can add it back if it's not intentional.
>> Can iommu_bus be NULL or should we add an assert ?
> I dig into the history changes of pci_device_iommu_address_space() and
> below commit added iommu_bus check.
>
> 5af2ae230514  pci: Fix pci_device_iommu_address_space() bus propagation
>
> In theory, !iommu_bus->parent_dev take precedency over !iommu_bus,
> So we never see iommu_bus NULL, assert may be better.

I think we had such a discussion in
https://www.mail-archive.com/qemu-devel@nongnu.org/msg994766.html
But maybe this was related to a different call place. I remember I
challenged the check at some point

Eric
>
> Thanks
> Zhenzhong
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-23  9:46     ` Duan, Zhenzhong
@ 2024-01-23 12:54       ` Cédric Le Goater
  2024-01-24  9:26         ` Duan, Zhenzhong
  0 siblings, 1 reply; 46+ messages in thread
From: Cédric Le Goater @ 2024-01-23 12:54 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun

On 1/23/24 10:46, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to
>> vIOMMU
>>
>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>> Initialize IOMMUFDDevice in vfio and pass to vIOMMU, so that vIOMMU
>>> could get hw IOMMU information.
>>>
>>> In VFIO legacy backend mode, we still pass a NULL IOMMUFDDevice to
>> vIOMMU,
>>> in case vIOMMU needs some processing for VFIO legacy backend mode.
>>>
>>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>    include/hw/vfio/vfio-common.h |  2 ++
>>>    hw/vfio/iommufd.c             |  2 ++
>>>    hw/vfio/pci.c                 | 24 +++++++++++++++++++-----
>>>    3 files changed, 23 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>> common.h
>>> index 9b7ef7d02b..fde0d0ca60 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -31,6 +31,7 @@
>>>    #endif
>>>    #include "sysemu/sysemu.h"
>>>    #include "hw/vfio/vfio-container-base.h"
>>> +#include "sysemu/iommufd_device.h"
>>>
>>>    #define VFIO_MSG_PREFIX "vfio %s: "
>>>
>>> @@ -126,6 +127,7 @@ typedef struct VFIODevice {
>>>        bool dirty_tracking;
>>>        int devid;
>>>        IOMMUFDBackend *iommufd;
>>> +    IOMMUFDDevice idev;
>>>    } VFIODevice;
>>>
>>>    struct VFIODeviceOps {
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 9bfddc1360..cbd035f148 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>>        VFIOContainerBase *bcontainer;
>>>        VFIOIOMMUFDContainer *container;
>>>        VFIOAddressSpace *space;
>>> +    IOMMUFDDevice *idev = &vbasedev->idev;
>>>        struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>>        int ret, devfd;
>>>        uint32_t ioas_id;
>>> @@ -428,6 +429,7 @@ found_container:
>>>        QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev,
>> container_next);
>>>        QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>>>
>>> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev-
>>> devid);
>>>        trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>> num_irqs,
>>>                                       vbasedev->num_regions, vbasedev->flags);
>>>        return 0;
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index d7fe06715c..2c3a5d267b 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3107,11 +3107,21 @@ static void vfio_realize(PCIDevice *pdev,
>> Error **errp)
>>>
>>>        vfio_bars_register(vdev);
>>>
>>> -    ret = vfio_add_capabilities(vdev, errp);
>>> +    if (vbasedev->iommufd) {
>>> +        ret = pci_device_set_iommu_device(pdev, &vbasedev->idev, errp);
>>> +    } else {
>>> +        ret = pci_device_set_iommu_device(pdev, 0, errp);
>>
>>
>> AFAICT, pci_device_set_iommu_device() with a NULL IOMMUFDDevice will
>> do
>> nothing. Why call it ?
> 
> We will do something in nesting series, see https://github.com/yiliu1765/qemu/commit/7f0bb59575bb5cf38618ae950f68a8661676e881

ok, that's not much. idev is used as a capability bool and later on
to pass the /dev/iommu fd.  We don't need to support the legacy mode ?

> Another choice is to call pci_device_set_iommu_device() no matter which backend
> is used and check idev->iommufd in vtd_dev_set_iommu_device(). Is this better
> for you?

yes. Should be fine. There is more to it though.

IIUC, what will determine most of the requirements, is the legacy
mode. We also need the host iommu info in that case. As said Eric,
ideally, we should introduce a common abstract "host-iommu-info" struct
and sub structs associated with the iommu backends (iommufd + legacy)
which would be allocated accordingly.

So, IOMMUFDDevice usage should be limited to the iommufd files. All PCI
files should use the common abstract type. We should define these data
structures first. They could be simple C struct for now. We will see if
QOM applies after.

Will take a look at Eric's patchset next.

Thanks,

C.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap
  2024-01-19 11:55         ` Duan, Zhenzhong
@ 2024-01-23 13:10           ` Eric Auger
  0 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2024-01-23 13:10 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, clg, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum



On 1/19/24 12:55, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check and
>> sync host IOMMU cap/ecap
>>
>>
>>
>> On 1/18/24 10:30, Duan, Zhenzhong wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Subject: Re: [PATCH rfcv1 6/6] intel_iommu: add a framework to check
>> and
>>>> sync host IOMMU cap/ecap
>>>>
>>>> Hi Zhenzhong,
>>>>
>>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>>
>>>>> Add a framework to check and synchronize host IOMMU cap/ecap with
>>>>> vIOMMU cap/ecap.
>>>>>
>>>>> Currently only stage-2 translation is supported which is backed by
>>>>> shadow page table on host side. So we don't need exact matching of
>>>>> each bit of cap/ecap between vIOMMU and host. However, we can still
>>>>> utilize this framework to ensure compatibility of host and vIOMMU's
>>>>> address width at least, i.e., vIOMMU's aw_bits <= host aw_bits,
>>>>> which is missed before.
>>>>>
>>>>> When stage-1 translation is supported in future, a.k.a. scalable
>>>>> modern mode, we need to ensure compatibility of each bits. Some
>>>>> bits are user controllable, they should be checked with host side
>>>>> to ensure compatibility. Other bits are not, they should be synced
>>>>> into vIOMMU cap/ecap for compatibility.
>>>>>
>>>>> The sequence will be:
>>>>>
>>>>> vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
>>>>> iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
>>>>> iommu->host_cap/ecap is checked and updated some bits with host
>>>> cap/ecap. ---- vtd_sync_hw_info()
>>>>> iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ----
>>>> vtd_machine_done_hook()
>>>>> iommu->host_cap/ecap is a temporary storage to hold intermediate
>> value
>>>>> when synthesize host cap/ecap and vIOMMU's initial configured
>> cap/ecap.
>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> ---
>>>>>  include/hw/i386/intel_iommu.h |  4 ++
>>>>>  hw/i386/intel_iommu.c         | 78
>>>> +++++++++++++++++++++++++++++++----
>>>>>  2 files changed, 75 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/include/hw/i386/intel_iommu.h
>>>> b/include/hw/i386/intel_iommu.h
>>>>> index c65fdde56f..b8abbcce12 100644
>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>> @@ -292,6 +292,9 @@ struct IntelIOMMUState {
>>>>>      uint64_t cap;                   /* The value of capability reg */
>>>>>      uint64_t ecap;                  /* The value of extended capability reg */
>>>>>
>>>>> +    uint64_t host_cap;              /* The value of host capability reg */
>>>>> +    uint64_t host_ecap;             /* The value of host ext-capability reg */
>>>>> +
>>>>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>>>>>      GHashTable *iotlb;              /* IOTLB */
>>>>>
>>>>> @@ -314,6 +317,7 @@ struct IntelIOMMUState {
>>>>>      bool dma_translation;           /* Whether DMA translation supported
>> */
>>>>>      bool pasid;                     /* Whether to support PASID */
>>>>>
>>>>> +    bool cap_finalized;             /* Whether VTD capability finalized */
>>>>>      /*
>>>>>       * Protects IOMMU states in general.  Currently it protects the
>>>>>       * per-IOMMU IOTLB cache, and context entry cache in
>>>> VTDAddressSpace.
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index 4c1d058ebd..be03fcbf52 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -3819,6 +3819,47 @@ VTDAddressSpace
>>>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>>>>      return vtd_dev_as;
>>>>>  }
>>>>>
>>>>> +static bool vtd_sync_hw_info(IntelIOMMUState *s, struct
>>>> iommu_hw_info_vtd *vtd,
>>>>> +                             Error **errp)
>>>>> +{
>>>>> +    uint64_t addr_width;
>>>>> +
>>>>> +    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
>>>>> +    if (s->aw_bits > addr_width) {
>>>>> +        error_setg(errp, "User aw-bits: %u > host address width: %lu",
>>>>> +                   s->aw_bits, addr_width);
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
>>>>> +
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * virtual VT-d which wants nested needs to check the host IOMMU
>>>>> + * nesting cap info behind the assigned devices. Thus that vIOMMU
>>>>> + * could bind guest page table to host.
>>>>> + */
>>>>> +static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice
>> *idev,
>>>>> +                           Error **errp)
>>>>> +{
>>>>> +    struct iommu_hw_info_vtd vtd;
>>>>> +    enum iommu_hw_info_type type =
>>>> IOMMU_HW_INFO_TYPE_INTEL_VTD;
>>>>> +
>>>>> +    if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
>>>>> +        error_setg(errp, "Failed to get IOMMU capability!!!");
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>>>>> +        error_setg(errp, "IOMMU hardware is not compatible!!!");
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    return vtd_sync_hw_info(s, &vtd, errp);
>>>>> +}
>>>>> +
>>>>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>> int32_t
>>>> devfn,
>>>>>                                      IOMMUFDDevice *idev, Error **errp)
>>>>>  {
>>>>> @@ -3837,6 +3878,10 @@ static int
>> vtd_dev_set_iommu_device(PCIBus
>>>> *bus, void *opaque, int32_t devfn,
>>>>>          return 0;
>>>>>      }
>>>>>
>>>>> +    if (!vtd_check_idev(s, idev, errp)) {In
>>>> In
>>>> [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for
>>>> hotplugged devices
>>>> https://lore.kernel.org/all/20240117080414.316890-1-
>>>> eric.auger@redhat.com/
>>>>
>>>> I also attempt to pass host iommu info to the virtio-iommu but with
>>>> legacy BE.
>>> I think your patch works with iommufd BE too😊 Because iommufd BE
>>> also fills bcontainer->iova_ranges in iommufd_cdev_get_info_iova_range().
>> correct. I wanted to emphasize that we also have the need to pass host
>> iommu info in legacy mode for instance. In this series you introduce an
>> object that works with the iommufd backed but I think if we go this way
>> we would need another one for the legacy device. So maybe introducing a
>> base object derived into 2 ones may be the most appropriate? Maybe,
>> given the assumption that we will use iommufd for new use cases this
>> legacy object will implement much fewer interfaces but still.
> How about this:
>
> enum IOMMU_LEGACY_DEVICE_TYPE {
>     IOMMU_LEGACY_VFIO_DEVICE,
>     IOMMU_LEGACY_VDPA_DEVICE,
> }
>
> typedef struct IOMMULegacyDevice {
>     enum IOMMU_LEGACY_DEVICE_TYPE type;
>
>     /* common field */
>
>     union {
>         ....
>     }
>
> } IOMMULegacyDevice;
>
> typedef struct IOMMUFDDevice {
>     IOMMUFDBackend *iommufd;
>     uint32_t dev_id;
>     uint32_t ioas_id;
> } IOMMUFDDevice;
>
> enum IOMMUDEVICE_TYPE {
>     IOMMUFD_DEVICE,
>     IOMMU_LEGACY_DEVICE,
> }
>
> struct IOMMUDevice {
>     enum IOMMU_DEVICE_TYPE type;
>
>     /* common field */
>     GList *iova_ranges;
>
>     union {
>         IOMMULegacyDevice legacy_dev;
yeah but that's not very nice to have this LegacyDevice def in an
iommufd.c file

Either we define an abstract HostAssignedDevice and derived objects for
both legacy and IOMMUFD or we consider using a different API for legacy
use cases (retrieving resv regions, page size mask, ...).

Eric



+int iommufd_device_get_info(IOMMUFDDevice *idev,
+                            enum iommu_hw_info_type *type,
+                            uint32_t len, void *data);
+void iommufd_device_init(void *_idev, size_t instance_size,
+                         IOMMUFDBackend *iommufd, uint32_t dev_id);

>         IOMMUFDDevice idev;
>     }
> }
>
>>>> In my case I want to pass the reserved memory regions which
>>>> also model the aw.
>>>> So this is a pretty similar use case.
>>> Yes.
>>>
>>>> Why don't we pass the pointer to an opaque iommu_hw_info instead,
>>>> through the PCIIOMMUOps?
>>> Passing iommu_hw_info is ok for this series, but we want more from
>>> IOMMUFDDevice in nesting series. I.e., allocate/free ioas/hwpt,
>>> attach/detach from hwpt, get dirty bitmap, etc. It's more flexible to
>>> let vIOMMU get what it want itself.
>> OK, would be interesting to define the class for this object. Worth to
>> be introduced either in the cover letter or in the 1st patch
> Not a QOM class because we don't want it showed out through
> query-qmp-schema.
>
> Thanks
> Zhenzhong
>
>> Eric
>>>>
>>>>> +        return -1;
>>>>> +    }
>>>>> +
>>>>>      vtd_iommu_lock(s);
>>>>>
>>>>>      vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
>>>>> @@ -4071,10 +4116,11 @@ static void vtd_init(IntelIOMMUState *s)
>>>>>  {
>>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>>>>>
>>>>> -    memset(s->csr, 0, DMAR_REG_SIZE);
>>>>> -    memset(s->wmask, 0, DMAR_REG_SIZE);
>>>>> -    memset(s->w1cmask, 0, DMAR_REG_SIZE);
>>>>> -    memset(s->womask, 0, DMAR_REG_SIZE);
>>>>> +    /* CAP/ECAP are initialized in machine create done stage */
>>>>> +    memset(s->csr + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>>> DMAR_GCMD_REG);
>>>>> +    memset(s->wmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>>> DMAR_GCMD_REG);
>>>>> +    memset(s->w1cmask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>>> DMAR_GCMD_REG);
>>>>> +    memset(s->womask + DMAR_GCMD_REG, 0, DMAR_REG_SIZE -
>>>> DMAR_GCMD_REG);
>>>> This change is not documented in the commit msg.
>>>> Sorry I don't get why this is needed?
>>> I'll doc it. Above we have one line to explain when cap/ecap are initialized.
>>> vtd_init() is called in qemu init and guest reset. In qemu init,
>>> Cap/ecap is finalized, after that we don't want cap/ecap to be changed.
>>> So we bypass change to cap/ecap here.
>>>
>>>>>      s->root = 0;
>>>>>      s->root_scalable = false;
>>>>> @@ -4110,13 +4156,16 @@ static void vtd_init(IntelIOMMUState *s)
>>>>>          vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
>>>>>      }
>>>>>
>>>>> -    vtd_cap_init(s);
>>>>> +    if (!s->cap_finalized) {
>>>>> +        vtd_cap_init(s);
>>>>> +        s->host_cap = s->cap;
>>>>> +        s->host_ecap = s->ecap;
>>>>> +    }
>>>>> +
>>>>>      vtd_reset_caches(s);
>>>>>
>>>>>      /* Define registers with default values and bit semantics */
>>>>>      vtd_define_long(s, DMAR_VER_REG, 0x10UL, 0, 0);
>>>>> -    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>>>>> -    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>>>>      vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>>>>>      vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>>>>>      vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
>>>>> @@ -4241,6 +4290,12 @@ static bool
>>>> vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>>>>      return true;
>>>>>  }
>>>>>
>>>>> +static void vtd_setup_capability_reg(IntelIOMMUState *s)
>>>>> +{
>>>>> +    vtd_define_quad(s, DMAR_CAP_REG, s->cap, 0, 0);
>>>>> +    vtd_define_quad(s, DMAR_ECAP_REG, s->ecap, 0, 0);
>>>>> +}
>>>>> +
>>>>>  static int vtd_machine_done_notify_one(Object *child, void *unused)
>>>>>  {
>>>>>      IntelIOMMUState *iommu =
>>>> INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>>>>> @@ -4259,6 +4314,14 @@ static int
>>>> vtd_machine_done_notify_one(Object *child, void *unused)
>>>>>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>>>>>  {
>>>>> +    IntelIOMMUState *iommu =
>>>> INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>>>>> +
>>>>> +    iommu->cap = iommu->host_cap;
>>>>> +    iommu->ecap = iommu->host_ecap;
>>>>> +    iommu->cap_finalized = true;
>>>> I don't think you can change the defaults like this without taking care
>>>> of compats (migration).
>>> Will bump vtd_vmstate .version_id works?
>>>
>>> Thanks
>>> Zhenzhong
>>>
>>>> Thanks
>>>>
>>>> Eric
>>>>> +
>>>>> +    vtd_setup_capability_reg(iommu);
>>>>> +
>>>>>      object_child_foreach_recursive(object_get_root(),
>>>>>                                     vtd_machine_done_notify_one, NULL);
>>>>>  }
>>>>> @@ -4292,6 +4355,7 @@ static void vtd_realize(DeviceState *dev,
>> Error
>>>> **errp)
>>>>>      QLIST_INIT(&s->vtd_as_with_notifiers);
>>>>>      qemu_mutex_init(&s->iommu_lock);
>>>>> +    s->cap_finalized = false;
>>>>>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>>>>>                            "intel_iommu", DMAR_REG_SIZE);
>>>>>      memory_region_add_subregion(get_system_memory(),



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device()
  2024-01-23 10:18           ` Eric Auger
@ 2024-01-24  9:23             ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-24  9:23 UTC (permalink / raw)
  To: eric.auger, Cédric Le Goater, qemu-devel
  Cc: alex.williamson, peterx, jasowang, mst, jgg, nicolinc,
	joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>pci_device_set/unset_iommu_device()
>
>
>
>On 1/23/24 10:25, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>>> pci_device_set/unset_iommu_device()
>>>
>>> On 1/23/24 07:37, Duan, Zhenzhong wrote:
>>>>
>>>>> -----Original Message-----
>>>>> From: Cédric Le Goater <clg@redhat.com>
>>>>> Subject: Re: [PATCH rfcv1 2/6] hw/pci: introduce
>>>>> pci_device_set/unset_iommu_device()
>>>>>
>>>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>>>
>>>>>> This adds pci_device_set/unset_iommu_device() to set/unset
>>>>>> IOMMUFDDevice for a given PCIe device. Caller of set
>>>>>> should fail if set operation fails.
>>>>>>
>>>>>> Extract out pci_device_get_iommu_bus_devfn() to facilitate
>>>>>> implementation of pci_device_set/unset_iommu_device().
>>>>>>
>>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>> ---
>>>>>>    include/hw/pci/pci.h | 39
>>> ++++++++++++++++++++++++++++++++++-
>>>>>>    hw/pci/pci.c         | 49
>>>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>>>    2 files changed, 86 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>>>> index fa6313aabc..a810c0ec74 100644
>>>>>> --- a/include/hw/pci/pci.h
>>>>>> +++ b/include/hw/pci/pci.h
>>>>>> @@ -7,6 +7,8 @@
>>>>>>    /* PCI includes legacy ISA access.  */
>>>>>>    #include "hw/isa/isa.h"
>>>>>>
>>>>>> +#include "sysemu/iommufd_device.h"
>>>>>> +
>>>>>>    extern bool pci_available;
>>>>>>
>>>>>>    /* PCI bus */
>>>>>> @@ -384,10 +386,45 @@ typedef struct PCIIOMMUOps {
>>>>>>         *
>>>>>>         * @devfn: device and function number
>>>>>>         */
>>>>>> -   AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque,
>int
>>>>> devfn);
>>>>>> +    AddressSpace * (*get_address_space)(PCIBus *bus, void *opaque,
>>> int
>>>>> devfn);
>>>>>> +    /**
>>>>>> +     * @set_iommu_device: set iommufd device for a PCI device to
>>>>> vIOMMU
>>>>>> +     *
>>>>>> +     * Optional callback, if not implemented in vIOMMU, then
>vIOMMU
>>>>> can't
>>>>>> +     * utilize iommufd specific features.
>>>>>> +     *
>>>>>> +     * Return true if iommufd device is accepted, or else return false
>with
>>>>>> +     * errp set.
>>>>>> +     *
>>>>>> +     * @bus: the #PCIBus of the PCI device.
>>>>>> +     *
>>>>>> +     * @opaque: the data passed to pci_setup_iommu().
>>>>>> +     *
>>>>>> +     * @devfn: device and function number of the PCI device.
>>>>>> +     *
>>>>>> +     * @idev: the data structure representing iommufd device.
>>>>>> +     *
>>>>>> +     */
>>>>>> +    int (*set_iommu_device)(PCIBus *bus, void *opaque, int32_t
>devfn,
>>>>>> +                            IOMMUFDDevice *idev, Error **errp);
>>>>>> +    /**
>>>>>> +     * @unset_iommu_device: unset iommufd device for a PCI device
>>> from
>>>>> vIOMMU
>>>>>> +     *
>>>>>> +     * Optional callback.
>>>>>> +     *
>>>>>> +     * @bus: the #PCIBus of the PCI device.
>>>>>> +     *
>>>>>> +     * @opaque: the data passed to pci_setup_iommu().
>>>>>> +     *
>>>>>> +     * @devfn: device and function number of the PCI device.
>>>>>> +     */
>>>>>> +    void (*unset_iommu_device)(PCIBus *bus, void *opaque, int32_t
>>>>> devfn);
>>>>>>    } PCIIOMMUOps;
>>>>>>
>>>>>>    AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>>>>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>>>>> *idev,
>>>>>> +                                Error **errp);
>>>>>> +void pci_device_unset_iommu_device(PCIDevice *dev);
>>>>>>
>>>>>>    /**
>>>>>>     * pci_setup_iommu: Initialize specific IOMMU handlers for a
>PCIBus
>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>> index 76080af580..3848662f95 100644
>>>>>> --- a/hw/pci/pci.c
>>>>>> +++ b/hw/pci/pci.c
>>>>>> @@ -2672,7 +2672,10 @@ static void
>>>>> pci_device_class_base_init(ObjectClass *klass, void *data)
>>>>>>        }
>>>>>>    }
>>>>>>
>>>>>> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>>>>>> +                                           PCIBus **aliased_pbus,
>>>>>> +                                           PCIBus **piommu_bus,
>>>>>> +                                           uint8_t *aliased_pdevfn)
>>>>>>    {
>>>>>>        PCIBus *bus = pci_get_bus(dev);
>>>>>>        PCIBus *iommu_bus = bus;
>>>>>> @@ -2717,6 +2720,18 @@ AddressSpace
>>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>>            iommu_bus = parent_bus;
>>>>>>        }
>>>>>> +    *aliased_pbus = bus;
>>>>>> +    *piommu_bus = iommu_bus;
>>>>>> +    *aliased_pdevfn = devfn;
>>>>>> +}
>>>>>> +
>>>>>> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>> +{
>>>>>> +    PCIBus *bus;
>>>>>> +    PCIBus *iommu_bus;
>>>>>> +    uint8_t devfn;
>>>>>> +
>>>>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus,
>>> &devfn);
>>>>>>        if (!pci_bus_bypass_iommu(bus) && iommu_bus->iommu_ops) {
>>>>>>            return iommu_bus->iommu_ops->get_address_space(bus,
>>>>>>                                     iommu_bus->iommu_opaque, devfn);
>>>>>> @@ -2724,6 +2739,38 @@ AddressSpace
>>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>>        return &address_space_memory;
>>>>>>    }
>>>>>>
>>>>>> +int pci_device_set_iommu_device(PCIDevice *dev, IOMMUFDDevice
>>>>> *idev,
>>>>>> +                                Error **errp)
>>>>>> +{
>>>>>> +    PCIBus *bus;
>>>>>> +    PCIBus *iommu_bus;
>>>>>> +    uint8_t devfn;
>>>>>> +
>>>>>> +    pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus,
>>> &devfn);
>>>>>> +    if (!pci_bus_bypass_iommu(bus) && iommu_bus &&
>>>>> Why do we test iommu_bus in pci_device_un/set_iommu_device
>>> routines
>>>>> and
>>>>> not in pci_device_iommu_address_space() ?
>>>> iommu_bus check in pci_device_iommu_address_space() is dropped in
>>>> below commit, I didn't find related discussion in mail history, maybe
>>>> by accident? I can add it back if it's not intentional.
>>> Can iommu_bus be NULL or should we add an assert ?
>> I dig into the history changes of pci_device_iommu_address_space() and
>> below commit added iommu_bus check.
>>
>> 5af2ae230514  pci: Fix pci_device_iommu_address_space() bus
>propagation
>>
>> In theory, !iommu_bus->parent_dev take precedency over !iommu_bus,
>> So we never see iommu_bus NULL, assert may be better.
>
>I think we had such a discussion in
>https://www.mail-archive.com/qemu-devel@nongnu.org/msg994766.html
>But maybe this was related to a different call place. I remember I
>challenged the check at some point

It seems this question is not discussed further in that thread.
Per my code inspection, PCI root bus's parent_dev should be NULL, so we get
either root bus or sub bus, neither a NULL.
Also tested with PXB bridge which is suspicious scenarios, same.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU
  2024-01-23 12:54       ` Cédric Le Goater
@ 2024-01-24  9:26         ` Duan, Zhenzhong
  0 siblings, 0 replies; 46+ messages in thread
From: Duan, Zhenzhong @ 2024-01-24  9:26 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to
>vIOMMU
>
>On 1/23/24 10:46, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Subject: Re: [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass
>to
>>> vIOMMU
>>>
>>> On 1/15/24 11:13, Zhenzhong Duan wrote:
>>>> Initialize IOMMUFDDevice in vfio and pass to vIOMMU, so that vIOMMU
>>>> could get hw IOMMU information.
>>>>
>>>> In VFIO legacy backend mode, we still pass a NULL IOMMUFDDevice to
>>> vIOMMU,
>>>> in case vIOMMU needs some processing for VFIO legacy backend mode.
>>>>
>>>> Originally-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>    include/hw/vfio/vfio-common.h |  2 ++
>>>>    hw/vfio/iommufd.c             |  2 ++
>>>>    hw/vfio/pci.c                 | 24 +++++++++++++++++++-----
>>>>    3 files changed, 23 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>> common.h
>>>> index 9b7ef7d02b..fde0d0ca60 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -31,6 +31,7 @@
>>>>    #endif
>>>>    #include "sysemu/sysemu.h"
>>>>    #include "hw/vfio/vfio-container-base.h"
>>>> +#include "sysemu/iommufd_device.h"
>>>>
>>>>    #define VFIO_MSG_PREFIX "vfio %s: "
>>>>
>>>> @@ -126,6 +127,7 @@ typedef struct VFIODevice {
>>>>        bool dirty_tracking;
>>>>        int devid;
>>>>        IOMMUFDBackend *iommufd;
>>>> +    IOMMUFDDevice idev;
>>>>    } VFIODevice;
>>>>
>>>>    struct VFIODeviceOps {
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 9bfddc1360..cbd035f148 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -309,6 +309,7 @@ static int iommufd_cdev_attach(const char
>*name,
>>> VFIODevice *vbasedev,
>>>>        VFIOContainerBase *bcontainer;
>>>>        VFIOIOMMUFDContainer *container;
>>>>        VFIOAddressSpace *space;
>>>> +    IOMMUFDDevice *idev = &vbasedev->idev;
>>>>        struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>>>        int ret, devfd;
>>>>        uint32_t ioas_id;
>>>> @@ -428,6 +429,7 @@ found_container:
>>>>        QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev,
>>> container_next);
>>>>        QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>>>>
>>>> +    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev-
>>>> devid);
>>>>        trace_iommufd_cdev_device_info(vbasedev->name, devfd,
>vbasedev-
>>>> num_irqs,
>>>>                                       vbasedev->num_regions, vbasedev->flags);
>>>>        return 0;
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index d7fe06715c..2c3a5d267b 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3107,11 +3107,21 @@ static void vfio_realize(PCIDevice *pdev,
>>> Error **errp)
>>>>
>>>>        vfio_bars_register(vdev);
>>>>
>>>> -    ret = vfio_add_capabilities(vdev, errp);
>>>> +    if (vbasedev->iommufd) {
>>>> +        ret = pci_device_set_iommu_device(pdev, &vbasedev->idev, errp);
>>>> +    } else {
>>>> +        ret = pci_device_set_iommu_device(pdev, 0, errp);
>>>
>>>
>>> AFAICT, pci_device_set_iommu_device() with a NULL IOMMUFDDevice
>will
>>> do
>>> nothing. Why call it ?
>>
>> We will do something in nesting series, see
>https://github.com/yiliu1765/qemu/commit/7f0bb59575bb5cf38618ae950
>f68a8661676e881
>
>ok, that's not much. idev is used as a capability bool and later on
>to pass the /dev/iommu fd.  We don't need to support the legacy mode ?

It's better to have for legacy mode. Especially when we support address
width 57 to QEMU Intel_iommu in future.

>
>> Another choice is to call pci_device_set_iommu_device() no matter which
>backend
>> is used and check idev->iommufd in vtd_dev_set_iommu_device(). Is this
>better
>> for you?
>
>yes. Should be fine. There is more to it though.
>
>IIUC, what will determine most of the requirements, is the legacy
>mode. We also need the host iommu info in that case. As said Eric,
>ideally, we should introduce a common abstract "host-iommu-info" struct
>and sub structs associated with the iommu backends (iommufd + legacy)
>which would be allocated accordingly.

I see, I'll make a rfcv2 as you and Eric suggested and discuss further
with Eric what elements he needs in legacy sub structs.

>
>So, IOMMUFDDevice usage should be limited to the iommufd files. All PCI
>files should use the common abstract type. We should define these data
>structures first. They could be simple C struct for now. We will see if
>QOM applies after.

Got it.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2024-01-24  9:26 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-15 10:13 [PATCH rfcv1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
2024-01-15 10:13 ` [PATCH rfcv1 1/6] backends/iommufd_device: introduce IOMMUFDDevice Zhenzhong Duan
2024-01-17 14:11   ` Eric Auger
2024-01-18  2:58     ` Duan, Zhenzhong
2024-01-18 12:42   ` Eric Auger
2024-01-19  7:31     ` Duan, Zhenzhong
2024-01-22 16:25       ` Cédric Le Goater
2024-01-23  5:51         ` Duan, Zhenzhong
2024-01-23 10:10       ` Eric Auger
2024-01-15 10:13 ` [PATCH rfcv1 2/6] hw/pci: introduce pci_device_set/unset_iommu_device() Zhenzhong Duan
2024-01-17 14:11   ` Eric Auger
2024-01-18  7:58     ` Duan, Zhenzhong
2024-01-22 16:55   ` Cédric Le Goater
2024-01-23  6:37     ` Duan, Zhenzhong
2024-01-23  7:40       ` Cédric Le Goater
2024-01-23  9:25         ` Duan, Zhenzhong
2024-01-23 10:18           ` Eric Auger
2024-01-24  9:23             ` Duan, Zhenzhong
2024-01-15 10:13 ` [PATCH rfcv1 3/6] intel_iommu: add set/unset_iommu_device callback Zhenzhong Duan
2024-01-17 15:44   ` Eric Auger
2024-01-18  8:43     ` Duan, Zhenzhong
2024-01-18 12:34       ` Eric Auger
2024-01-19  7:27         ` Duan, Zhenzhong
2024-01-22 17:09   ` Cédric Le Goater
2024-01-23  9:46     ` Duan, Zhenzhong
2024-01-15 10:13 ` [PATCH rfcv1 4/6] vfio: initialize IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
2024-01-17 15:37   ` Joao Martins
2024-01-18  8:17     ` Duan, Zhenzhong
2024-01-18 10:17       ` Yi Liu
2024-01-18 10:20         ` Joao Martins
2024-01-17 17:30   ` Eric Auger
2024-01-18  9:23     ` Duan, Zhenzhong
2024-01-22 17:15   ` Cédric Le Goater
2024-01-23  9:46     ` Duan, Zhenzhong
2024-01-23 12:54       ` Cédric Le Goater
2024-01-24  9:26         ` Duan, Zhenzhong
2024-01-15 10:13 ` [PATCH rfcv1 5/6] intel_iommu: extract out vtd_cap_init to initialize cap/ecap Zhenzhong Duan
2024-01-17 17:36   ` Eric Auger
2024-01-15 10:13 ` [PATCH rfcv1 6/6] intel_iommu: add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
2024-01-17 17:56   ` Eric Auger
2024-01-18  9:30     ` Duan, Zhenzhong
2024-01-18 12:40       ` Eric Auger
2024-01-19 11:55         ` Duan, Zhenzhong
2024-01-23 13:10           ` Eric Auger
2024-01-23  8:39   ` Cédric Le Goater
2024-01-23 10:01     ` Duan, Zhenzhong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.