All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU
@ 2024-02-28  9:44 Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 1/6] intel_iommu: Add set/unset_iommu_device callback Zhenzhong Duan
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Zhenzhong Duan @ 2024-02-28  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan

Hi,

Based on Joao's suggestion, the iommufd nesting prerequisite series [1]
is further splitted to host IOMMU device abstract part [2] and vIOMMU
check/sync part. This series implements the 2nd part.

This enables vIOMMU to get host IOMMU cap/ecap information by implementing
a new set/unset_iommu_device interface, then vIOMMU could check or sync
with vIOMMU's own cap/ecap config.

It works by having device side, i.e. VFIO, register either an IOMMULegacyDevice
or IOMMUFDDevice to vIOMMU, which includes necessary data to archive that.
Currently only VFIO device is supported, but it could also be used for other
devices, i.e., VDPA.

For coldplugged device, we can get its host IOMMU cap/ecap during qemu init,
then check and sync into vIOMMU cap/ecap.
For hotplugged device, vIOMMU cap/ecap is frozen, we could only check with
vIOMMU cap/ecap, not allowed to update. If check fails, hotplugged will fail.

This is also a prerequisite for incoming iommufd nesting series:
'intel_iommu: Enable stage-1 translation'.

I didn't implement cap/ecap sync for legacy VFIO backend, would like to see
what Eric want to put in IOMMULegacyDevice for virtio-iommu and if I can
utilize some of them.

Because it's becoming clear on community's suggestion, I'd like to remove
rfc tag from this version.

Qemu code can be found at:
https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_preq_part2_v1

[1] https://lore.kernel.org/qemu-devel/20240201072818.327930-1-zhenzhong.duan@intel.com
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-02/msg06314.html

Thanks
Zhenzhong

Changelog:
v1:
- convert HostIOMMUDevice to sub object pointer in vtd_check_hdev

rfcv2:
- introduce common abstract HostIOMMUDevice and sub struct for different BEs (Eric, Cédric)
- remove iommufd_device.[ch] (Cédric)
- remove duplicate iommufd/devid define from VFIODevice (Eric)
- drop the p in aliased_pbus and aliased_pdevfn (Eric)
- assert devfn and iommu_bus in pci_device_get_iommu_bus_devfn (Cédric, Eric)
- use errp in iommufd_device_get_info (Eric)
- split and simplify cap/ecap check/sync code in intel_iommu.c (Cédric)
- move VTDHostIOMMUDevice declaration to intel_iommu_internal.h (Cédric)
- make '(vtd->cap_reg >> 16) & 0x3fULL' a MACRO and add missed '+1' (Cédric)
- block migration if vIOMMU cap/ecap updated based on host IOMMU cap/ecap
- add R-B


Yi Liu (2):
  intel_iommu: Add set/unset_iommu_device callback
  intel_iommu: Add a framework to check and sync host IOMMU cap/ecap

Zhenzhong Duan (4):
  intel_iommu: Extract out vtd_cap_init to initialize cap/ecap
  intel_iommu: Implement check and sync mechanism in iommufd mode
  intel_iommu: Use mgaw instead of s->aw_bits
  intel_iommu: Block migration if cap is updated

 hw/i386/intel_iommu_internal.h |   9 ++
 include/hw/i386/intel_iommu.h  |   4 +
 hw/i386/acpi-build.c           |   3 +-
 hw/i386/intel_iommu.c          | 287 ++++++++++++++++++++++++++-------
 4 files changed, 245 insertions(+), 58 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 1/6] intel_iommu: Add set/unset_iommu_device callback
  2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
@ 2024-02-28  9:44 ` Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 2/6] intel_iommu: Extract out vtd_cap_init to initialize cap/ecap Zhenzhong Duan
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Zhenzhong Duan @ 2024-02-28  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

This adds set/unset_iommu_device() implementation in Intel vIOMMU.
In set call, a pointer to host IOMMU device info is stored in hash
table indexed by PCI BDF.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  8 ++++
 include/hw/i386/intel_iommu.h  |  2 +
 hw/i386/intel_iommu.c          | 74 ++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index f8cf99bddf..becafd03c1 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -537,4 +537,12 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SL_IGN_COM              0xbff0000000000000ULL
 #define VTD_SL_TM                   (1ULL << 62)
 
+
+typedef struct VTDHostIOMMUDevice {
+    IntelIOMMUState *iommu_state;
+    PCIBus *bus;
+    uint8_t devfn;
+    HostIOMMUDevice *dev;
+    QLIST_ENTRY(VTDHostIOMMUDevice) next;
+} VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 7fa0a695c8..bbc7b96add 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -292,6 +292,8 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
+    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1a07faddb4..9b62441439 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -237,6 +237,13 @@ static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
            (key1->pasid == key2->pasid);
 }
 
+static gboolean vtd_as_idev_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct vtd_as_key *key1 = v1;
+    const struct vtd_as_key *key2 = v2;
+
+    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
+}
 /*
  * Note that we use pointer to PCIBus as the key, so hashing/shifting
  * based on the pointer value is intended. Note that we deal with
@@ -3812,6 +3819,68 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
+static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
+                                    HostIOMMUDevice *base_dev, Error **errp)
+{
+    IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hdev;
+    struct vtd_as_key key = {
+        .bus = bus,
+        .devfn = devfn,
+    };
+    struct vtd_as_key *new_key;
+
+    assert(base_dev);
+
+    vtd_iommu_lock(s);
+
+    vtd_hdev = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+
+    if (vtd_hdev) {
+        error_setg(errp, "IOMMUFD device already exist");
+        vtd_iommu_unlock(s);
+        return -EEXIST;
+    }
+
+    vtd_hdev = g_malloc0(sizeof(VTDHostIOMMUDevice));
+    vtd_hdev->bus = bus;
+    vtd_hdev->devfn = (uint8_t)devfn;
+    vtd_hdev->iommu_state = s;
+    vtd_hdev->dev = base_dev;
+
+    new_key = g_malloc(sizeof(*new_key));
+    new_key->bus = bus;
+    new_key->devfn = devfn;
+
+    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hdev);
+
+    vtd_iommu_unlock(s);
+
+    return 0;
+}
+
+static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hdev;
+    struct vtd_as_key key = {
+        .bus = bus,
+        .devfn = devfn,
+    };
+
+    vtd_iommu_lock(s);
+
+    vtd_hdev = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+    if (!vtd_hdev) {
+        vtd_iommu_unlock(s);
+        return;
+    }
+
+    g_hash_table_remove(s->vtd_host_iommu_dev, &key);
+
+    vtd_iommu_unlock(s);
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -4107,6 +4176,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 
 static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
+    .set_iommu_device = vtd_dev_set_iommu_device,
+    .unset_iommu_device = vtd_dev_unset_iommu_device,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
@@ -4230,6 +4301,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                      g_free, g_free);
     s->vtd_address_spaces = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
                                       g_free, g_free);
+    s->vtd_host_iommu_dev = g_hash_table_new_full(vtd_as_hash,
+                                                  vtd_as_idev_equal,
+                                                  g_free, g_free);
     vtd_init(s);
     pci_setup_iommu(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 2/6] intel_iommu: Extract out vtd_cap_init to initialize cap/ecap
  2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 1/6] intel_iommu: Add set/unset_iommu_device callback Zhenzhong Duan
@ 2024-02-28  9:44 ` Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Zhenzhong Duan @ 2024-02-28  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

This is a prerequisite for host cap/ecap sync.

No functional change intended.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 93 ++++++++++++++++++++++++-------------------
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9b62441439..ffa1ad6429 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4003,30 +4003,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
-/* Do the initialization. It will also be called when reset, so pay
- * attention when adding new initialization stuff.
- */
-static void vtd_init(IntelIOMMUState *s)
+static void vtd_cap_init(IntelIOMMUState *s)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
 
-    memset(s->csr, 0, DMAR_REG_SIZE);
-    memset(s->wmask, 0, DMAR_REG_SIZE);
-    memset(s->w1cmask, 0, DMAR_REG_SIZE);
-    memset(s->womask, 0, DMAR_REG_SIZE);
-
-    s->root = 0;
-    s->root_scalable = false;
-    s->dmar_enabled = false;
-    s->intr_enabled = false;
-    s->iq_head = 0;
-    s->iq_tail = 0;
-    s->iq = 0;
-    s->iq_size = 0;
-    s->qi_enabled = false;
-    s->iq_last_desc_type = VTD_INV_DESC_NONE;
-    s->iq_dw = false;
-    s->next_frcd_reg = 0;
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
              VTD_CAP_MGAW(s->aw_bits);
@@ -4043,27 +4023,6 @@ static void vtd_init(IntelIOMMUState *s)
     }
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
-    /*
-     * Rsvd field masks for spte
-     */
-    vtd_spte_rsvd[0] = ~0ULL;
-    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits,
-                                                  x86_iommu->dt_supported);
-    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
-    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
-    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
-
-    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits,
-                                                         x86_iommu->dt_supported);
-    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits,
-                                                         x86_iommu->dt_supported);
-
-    if (s->scalable_mode || s->snoop_control) {
-        vtd_spte_rsvd[1] &= ~VTD_SPTE_SNP;
-        vtd_spte_rsvd_large[2] &= ~VTD_SPTE_SNP;
-        vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
-    }
-
     if (x86_iommu_ir_supported(x86_iommu)) {
         s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
         if (s->intr_eim == ON_OFF_AUTO_ON) {
@@ -4096,6 +4055,56 @@ static void vtd_init(IntelIOMMUState *s)
     if (s->pasid) {
         s->ecap |= VTD_ECAP_PASID;
     }
+}
+
+/*
+ * Do the initialization. It will also be called when reset, so pay
+ * attention when adding new initialization stuff.
+ */
+static void vtd_init(IntelIOMMUState *s)
+{
+    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
+
+    memset(s->csr, 0, DMAR_REG_SIZE);
+    memset(s->wmask, 0, DMAR_REG_SIZE);
+    memset(s->w1cmask, 0, DMAR_REG_SIZE);
+    memset(s->womask, 0, DMAR_REG_SIZE);
+
+    s->root = 0;
+    s->root_scalable = false;
+    s->dmar_enabled = false;
+    s->intr_enabled = false;
+    s->iq_head = 0;
+    s->iq_tail = 0;
+    s->iq = 0;
+    s->iq_size = 0;
+    s->qi_enabled = false;
+    s->iq_last_desc_type = VTD_INV_DESC_NONE;
+    s->iq_dw = false;
+    s->next_frcd_reg = 0;
+
+    vtd_cap_init(s);
+
+    /*
+     * Rsvd field masks for spte
+     */
+    vtd_spte_rsvd[0] = ~0ULL;
+    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits,
+                                                  x86_iommu->dt_supported);
+    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
+    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
+    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
+
+    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits,
+                                                    x86_iommu->dt_supported);
+    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits,
+                                                    x86_iommu->dt_supported);
+
+    if (s->scalable_mode || s->snoop_control) {
+        vtd_spte_rsvd[1] &= ~VTD_SPTE_SNP;
+        vtd_spte_rsvd_large[2] &= ~VTD_SPTE_SNP;
+        vtd_spte_rsvd_large[3] &= ~VTD_SPTE_SNP;
+    }
 
     vtd_reset_caches(s);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 1/6] intel_iommu: Add set/unset_iommu_device callback Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 2/6] intel_iommu: Extract out vtd_cap_init to initialize cap/ecap Zhenzhong Duan
@ 2024-02-28  9:44 ` Zhenzhong Duan
  2024-03-12 17:03   ` Michael S. Tsirkin
  2024-02-28  9:44 ` [PATCH v1 4/6] intel_iommu: Implement check and sync mechanism in iommufd mode Zhenzhong Duan
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Zhenzhong Duan @ 2024-02-28  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

Add a framework to check and synchronize host IOMMU cap/ecap with
vIOMMU cap/ecap.

The sequence will be:

vtd_cap_init() initializes iommu->cap/ecap.
vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
iommu->cap_frozen set when machine create done, iommu->cap/ecap become readonly.

Implementation details for different backends will be in following patches.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h |  1 +
 hw/i386/intel_iommu.c         | 50 ++++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index bbc7b96add..c71a133820 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -283,6 +283,7 @@ struct IntelIOMMUState {
 
     uint64_t cap;                   /* The value of capability reg */
     uint64_t ecap;                  /* The value of extended capability reg */
+    bool cap_frozen;                /* cap/ecap become read-only after frozen */
 
     uint32_t context_cache_gen;     /* Should be in [1,MAX] */
     GHashTable *iotlb;              /* IOTLB */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ffa1ad6429..a9f9dfd6a7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -35,6 +35,8 @@
 #include "sysemu/kvm.h"
 #include "sysemu/dma.h"
 #include "sysemu/sysemu.h"
+#include "hw/vfio/vfio-common.h"
+#include "sysemu/iommufd.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
@@ -3819,6 +3821,38 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
+static int vtd_check_legacy_hdev(IntelIOMMUState *s,
+                                 IOMMULegacyDevice *ldev,
+                                 Error **errp)
+{
+    return 0;
+}
+
+static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
+                                  IOMMUFDDevice *idev,
+                                  Error **errp)
+{
+    return 0;
+}
+
+static int vtd_check_hdev(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hdev,
+                          Error **errp)
+{
+    HostIOMMUDevice *base_dev = vtd_hdev->dev;
+    IOMMUFDDevice *idev;
+
+    if (base_dev->type == HID_LEGACY) {
+        IOMMULegacyDevice *ldev = container_of(base_dev,
+                                               IOMMULegacyDevice, base);
+
+        return vtd_check_legacy_hdev(s, ldev, errp);
+    }
+
+    idev = container_of(base_dev, IOMMUFDDevice, base);
+
+    return vtd_check_iommufd_hdev(s, idev, errp);
+}
+
 static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
                                     HostIOMMUDevice *base_dev, Error **errp)
 {
@@ -3829,6 +3863,7 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         .devfn = devfn,
     };
     struct vtd_as_key *new_key;
+    int ret;
 
     assert(base_dev);
 
@@ -3848,6 +3883,13 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     vtd_hdev->iommu_state = s;
     vtd_hdev->dev = base_dev;
 
+    ret = vtd_check_hdev(s, vtd_hdev, errp);
+    if (ret) {
+        g_free(vtd_hdev);
+        vtd_iommu_unlock(s);
+        return ret;
+    }
+
     new_key = g_malloc(sizeof(*new_key));
     new_key->bus = bus;
     new_key->devfn = devfn;
@@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
     s->iq_dw = false;
     s->next_frcd_reg = 0;
 
-    vtd_cap_init(s);
+    if (!s->cap_frozen) {
+        vtd_cap_init(s);
+    }
 
     /*
      * Rsvd field masks for spte
@@ -4254,6 +4298,10 @@ static int vtd_machine_done_notify_one(Object *child, void *unused)
 
 static void vtd_machine_done_hook(Notifier *notifier, void *unused)
 {
+    IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
+
+    iommu->cap_frozen = true;
+
     object_child_foreach_recursive(object_get_root(),
                                    vtd_machine_done_notify_one, NULL);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 4/6] intel_iommu: Implement check and sync mechanism in iommufd mode
  2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2024-02-28  9:44 ` [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
@ 2024-02-28  9:44 ` Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 5/6] intel_iommu: Use mgaw instead of s->aw_bits Zhenzhong Duan
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Zhenzhong Duan @ 2024-02-28  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Yi Sun, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

We use cap_frozen to mark cap/ecap read/writable or read-only,
At init stage, we allow to update cap/ecap based on host IOMMU
cap/ecap, but when machine create done, cap_frozen is set and
we only allow checking cap/ecap for compatibility.

Currently only stage-2 translation is supported which is backed by
shadow page table on host side. So we don't need exact matching of
each bit of cap/ecap between vIOMMU and host. However, we can still
ensure compatibility of host and vIOMMU's address width at least,
i.e., vIOMMU's mgaw <= host IOMMU mgaw, which is missed before.

When stage-1 translation is supported in future, a.k.a. scalable
modern mode, this mechanism will be further extended to check more
bits.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 include/hw/i386/intel_iommu.h  |  1 +
 hw/i386/intel_iommu.c          | 28 ++++++++++++++++++++++++++++
 3 files changed, 30 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index becafd03c1..72a5cb0859 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -204,6 +204,7 @@
 #define VTD_DOMAIN_ID_MASK          ((1UL << VTD_DOMAIN_ID_SHIFT) - 1)
 #define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
 #define VTD_ADDRESS_SIZE(aw)        (1ULL << (aw))
+#define VTD_CAP_MGAW_MASK           (0x3fULL << 16)
 #define VTD_CAP_MGAW(aw)            ((((aw) - 1) & 0x3fULL) << 16)
 #define VTD_MAMV                    18ULL
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index c71a133820..a0b530ebc6 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -47,6 +47,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(IntelIOMMUState, INTEL_IOMMU_DEVICE)
 #define VTD_HOST_AW_48BIT           48
 #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
 #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
+#define VTD_MGAW_FROM_CAP(cap)      (((cap >> 16) & 0x3fULL) + 1)
 
 #define DMAR_REPORT_F_INTR          (1)
 
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a9f9dfd6a7..2a55268538 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3832,6 +3832,34 @@ static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
                                   IOMMUFDDevice *idev,
                                   Error **errp)
 {
+    struct iommu_hw_info_vtd vtd;
+    enum iommu_hw_info_type type = IOMMU_HW_INFO_TYPE_INTEL_VTD;
+    long host_mgaw, viommu_mgaw = VTD_MGAW_FROM_CAP(s->cap);
+    uint64_t tmp_cap = s->cap;
+    int ret;
+
+    ret = iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd, errp);
+    if (ret) {
+        return ret;
+    }
+
+    if (type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "IOMMU hardware is not compatible");
+        return -EINVAL;
+    }
+
+    host_mgaw = VTD_MGAW_FROM_CAP(vtd.cap_reg);
+    if (viommu_mgaw > host_mgaw) {
+        if (s->cap_frozen) {
+            error_setg(errp, "mgaw %" PRId64 " > host mgaw %" PRId64,
+                       viommu_mgaw, host_mgaw);
+            return -EINVAL;
+        }
+        tmp_cap &= ~VTD_CAP_MGAW_MASK;
+        tmp_cap |= VTD_CAP_MGAW(host_mgaw + 1);
+    }
+
+    s->cap = tmp_cap;
     return 0;
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 5/6] intel_iommu: Use mgaw instead of s->aw_bits
  2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2024-02-28  9:44 ` [PATCH v1 4/6] intel_iommu: Implement check and sync mechanism in iommufd mode Zhenzhong Duan
@ 2024-02-28  9:44 ` Zhenzhong Duan
  2024-02-28  9:44 ` [PATCH v1 6/6] intel_iommu: Block migration if cap is updated Zhenzhong Duan
  2024-03-04  4:17 ` [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Jason Wang
  6 siblings, 0 replies; 22+ messages in thread
From: Zhenzhong Duan @ 2024-02-28  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Igor Mammedov, Ani Sinha,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

Because vIOMMU mgaw can be updated based on host IOMMU mgaw, s->aw_bits
does't necessarily represent the final mgaw now but the mgaw field in
s->cap does.

Replace reference to s->aw_bits with a MACRO S_AW_BITS to fetch mgaw
from s->cap. There are two exceptions on this, aw_bits value sanity
check and s->cap initialization.

ACPI DMAR table is also updated with right mgaw value.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/acpi-build.c  |  3 ++-
 hw/i386/intel_iommu.c | 44 ++++++++++++++++++++++---------------------
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index edc979379c..6467157686 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2159,7 +2159,8 @@ build_dmar_q35(GArray *table_data, BIOSLinker *linker, const char *oem_id,
 
     acpi_table_begin(&table, table_data);
     /* Host Address Width */
-    build_append_int_noprefix(table_data, intel_iommu->aw_bits - 1, 1);
+    build_append_int_noprefix(table_data,
+                              VTD_MGAW_FROM_CAP(intel_iommu->cap), 1);
     build_append_int_noprefix(table_data, dmar_flags, 1); /* Flags */
     g_array_append_vals(table_data, rsvd10, sizeof(rsvd10)); /* Reserved */
 
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2a55268538..e474284e43 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -42,6 +42,8 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 
+#define S_AW_BITS (VTD_MGAW_FROM_CAP(s->cap) + 1)
+
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
     ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
@@ -1410,13 +1412,13 @@ static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
 {
     /* Legacy Mode reserved bits check */
     if (!s->root_scalable &&
-        (re->hi || (re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
+        (re->hi || (re->lo & VTD_ROOT_ENTRY_RSVD(S_AW_BITS))))
         goto rsvd_err;
 
     /* Scalable Mode reserved bits check */
     if (s->root_scalable &&
-        ((re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits)) ||
-         (re->hi & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
+        ((re->lo & VTD_ROOT_ENTRY_RSVD(S_AW_BITS)) ||
+         (re->hi & VTD_ROOT_ENTRY_RSVD(S_AW_BITS))))
         goto rsvd_err;
 
     return 0;
@@ -1433,7 +1435,7 @@ static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
 {
     if (!s->root_scalable &&
         (ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI ||
-         ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
+         ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(S_AW_BITS))) {
         error_report_once("%s: invalid context entry: hi=%"PRIx64
                           ", lo=%"PRIx64" (reserved nonzero)",
                           __func__, ce->hi, ce->lo);
@@ -1441,7 +1443,7 @@ static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
     }
 
     if (s->root_scalable &&
-        (ce->val[0] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(s->aw_bits) ||
+        (ce->val[0] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(S_AW_BITS) ||
          ce->val[1] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 ||
          ce->val[2] ||
          ce->val[3])) {
@@ -1572,7 +1574,7 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
         .hook_fn = vtd_sync_shadow_page_hook,
         .private = (void *)&vtd_as->iommu,
         .notify_unmap = true,
-        .aw = s->aw_bits,
+        .aw = S_AW_BITS,
         .as = vtd_as,
         .domain_id = vtd_get_domain_id(s, ce, vtd_as->pasid),
     };
@@ -1991,7 +1993,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     }
 
     ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
-                               &reads, &writes, s->aw_bits, pasid);
+                               &reads, &writes, S_AW_BITS, pasid);
     if (ret_fr) {
         vtd_report_fault(s, -ret_fr, is_fpd_set, source_id,
                          addr, is_write, pasid != PCI_NO_PASID, pasid);
@@ -2005,7 +2007,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 out:
     vtd_iommu_unlock(s);
     entry->iova = addr & page_mask;
-    entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) & page_mask;
+    entry->translated_addr = vtd_get_slpte_addr(slpte, S_AW_BITS) & page_mask;
     entry->addr_mask = ~page_mask;
     entry->perm = access_flags;
     return true;
@@ -2022,7 +2024,7 @@ error:
 static void vtd_root_table_setup(IntelIOMMUState *s)
 {
     s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
-    s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
+    s->root &= VTD_RTADDR_ADDR_MASK(S_AW_BITS);
 
     vtd_update_scalable_state(s);
 
@@ -2040,7 +2042,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
     uint64_t value = 0;
     value = vtd_get_quad_raw(s, DMAR_IRTA_REG);
     s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1);
-    s->intr_root = value & VTD_IRTA_ADDR_MASK(s->aw_bits);
+    s->intr_root = value & VTD_IRTA_ADDR_MASK(S_AW_BITS);
     s->intr_eime = value & VTD_IRTA_EIME;
 
     /* Notify global invalidation */
@@ -2323,7 +2325,7 @@ static void vtd_handle_gcmd_qie(IntelIOMMUState *s, bool en)
     trace_vtd_inv_qi_enable(en);
 
     if (en) {
-        s->iq = iqa_val & VTD_IQA_IQA_MASK(s->aw_bits);
+        s->iq = iqa_val & VTD_IQA_IQA_MASK(S_AW_BITS);
         /* 2^(x+8) entries */
         s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8 - (s->iq_dw ? 1 : 0));
         s->qi_enabled = true;
@@ -3966,12 +3968,12 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
      * VT-d spec), otherwise we need to consider overflow of 64 bits.
      */
 
-    if (end > VTD_ADDRESS_SIZE(s->aw_bits) - 1) {
+    if (end > VTD_ADDRESS_SIZE(S_AW_BITS) - 1) {
         /*
          * Don't need to unmap regions that is bigger than the whole
          * VT-d supported address space size
          */
-        end = VTD_ADDRESS_SIZE(s->aw_bits) - 1;
+        end = VTD_ADDRESS_SIZE(S_AW_BITS) - 1;
     }
 
     assert(start <= end);
@@ -3979,7 +3981,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 
     while (remain >= VTD_PAGE_SIZE) {
         IOMMUTLBEvent event;
-        uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
+        uint64_t mask = dma_aligned_pow2_mask(start, end, S_AW_BITS);
         uint64_t size = mask + 1;
 
         assert(size);
@@ -4058,7 +4060,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .hook_fn = vtd_replay_hook,
                 .private = (void *)n,
                 .notify_unmap = false,
-                .aw = s->aw_bits,
+                .aw = S_AW_BITS,
                 .as = vtd_as,
                 .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
             };
@@ -4161,15 +4163,15 @@ static void vtd_init(IntelIOMMUState *s)
      * Rsvd field masks for spte
      */
     vtd_spte_rsvd[0] = ~0ULL;
-    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits,
+    vtd_spte_rsvd[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(S_AW_BITS,
                                                   x86_iommu->dt_supported);
-    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
-    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
-    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
+    vtd_spte_rsvd[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(S_AW_BITS);
+    vtd_spte_rsvd[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(S_AW_BITS);
+    vtd_spte_rsvd[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(S_AW_BITS);
 
-    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits,
+    vtd_spte_rsvd_large[2] = VTD_SPTE_LPAGE_L2_RSVD_MASK(S_AW_BITS,
                                                     x86_iommu->dt_supported);
-    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits,
+    vtd_spte_rsvd_large[3] = VTD_SPTE_LPAGE_L3_RSVD_MASK(S_AW_BITS,
                                                     x86_iommu->dt_supported);
 
     if (s->scalable_mode || s->snoop_control) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 6/6] intel_iommu: Block migration if cap is updated
  2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2024-02-28  9:44 ` [PATCH v1 5/6] intel_iommu: Use mgaw instead of s->aw_bits Zhenzhong Duan
@ 2024-02-28  9:44 ` Zhenzhong Duan
  2024-03-04  6:35   ` Prasad Pandit
  2024-03-04  4:17 ` [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Jason Wang
  6 siblings, 1 reply; 22+ messages in thread
From: Zhenzhong Duan @ 2024-02-28  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

When there is VFIO device and vIOMMU cap/ecap is updated based on host
IOMMU cap/ecap, migration should be blocked.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e474284e43..9ca47dbf9a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "hw/i386/apic_internal.h"
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
+#include "migration/blocker.h"
 #include "trace.h"
 
 #define S_AW_BITS (VTD_MGAW_FROM_CAP(s->cap) + 1)
@@ -3830,6 +3831,8 @@ static int vtd_check_legacy_hdev(IntelIOMMUState *s,
     return 0;
 }
 
+static Error *vtd_mig_blocker;
+
 static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
                                   IOMMUFDDevice *idev,
                                   Error **errp)
@@ -3861,8 +3864,17 @@ static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
         tmp_cap |= VTD_CAP_MGAW(host_mgaw + 1);
     }
 
-    s->cap = tmp_cap;
-    return 0;
+    if (s->cap != tmp_cap) {
+        if (vtd_mig_blocker == NULL) {
+            error_setg(&vtd_mig_blocker,
+                       "cap/ecap update from host IOMMU block migration");
+            ret = migrate_add_blocker(&vtd_mig_blocker, errp);
+        }
+        if (!ret) {
+            s->cap = tmp_cap;
+        }
+    }
+    return ret;
 }
 
 static int vtd_check_hdev(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hdev,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU
  2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2024-02-28  9:44 ` [PATCH v1 6/6] intel_iommu: Block migration if cap is updated Zhenzhong Duan
@ 2024-03-04  4:17 ` Jason Wang
  2024-03-04  6:13   ` Duan, Zhenzhong
  6 siblings, 1 reply; 22+ messages in thread
From: Jason Wang @ 2024-03-04  4:17 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng

On Wed, Feb 28, 2024 at 5:46 PM Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
> Hi,
>
> Based on Joao's suggestion, the iommufd nesting prerequisite series [1]
> is further splitted to host IOMMU device abstract part [2] and vIOMMU
> check/sync part. This series implements the 2nd part.
>
> This enables vIOMMU to get host IOMMU cap/ecap information by implementing
> a new set/unset_iommu_device interface, then vIOMMU could check or sync
> with vIOMMU's own cap/ecap config.

Does it mean that it would supress the cap/ecap config from the qemu
command line? If yes, I wonder how to maintain the migration
compatibility.

Thanks

>
> It works by having device side, i.e. VFIO, register either an IOMMULegacyDevice
> or IOMMUFDDevice to vIOMMU, which includes necessary data to archive that.
> Currently only VFIO device is supported, but it could also be used for other
> devices, i.e., VDPA.
>
> For coldplugged device, we can get its host IOMMU cap/ecap during qemu init,
> then check and sync into vIOMMU cap/ecap.
> For hotplugged device, vIOMMU cap/ecap is frozen, we could only check with
> vIOMMU cap/ecap, not allowed to update. If check fails, hotplugged will fail.
>
> This is also a prerequisite for incoming iommufd nesting series:
> 'intel_iommu: Enable stage-1 translation'.
>
> I didn't implement cap/ecap sync for legacy VFIO backend, would like to see
> what Eric want to put in IOMMULegacyDevice for virtio-iommu and if I can
> utilize some of them.
>
> Because it's becoming clear on community's suggestion, I'd like to remove
> rfc tag from this version.
>
> Qemu code can be found at:
> https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_preq_part2_v1
>
> [1] https://lore.kernel.org/qemu-devel/20240201072818.327930-1-zhenzhong.duan@intel.com
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-02/msg06314.html
>
> Thanks
> Zhenzhong
>
> Changelog:
> v1:
> - convert HostIOMMUDevice to sub object pointer in vtd_check_hdev
>
> rfcv2:
> - introduce common abstract HostIOMMUDevice and sub struct for different BEs (Eric, Cédric)
> - remove iommufd_device.[ch] (Cédric)
> - remove duplicate iommufd/devid define from VFIODevice (Eric)
> - drop the p in aliased_pbus and aliased_pdevfn (Eric)
> - assert devfn and iommu_bus in pci_device_get_iommu_bus_devfn (Cédric, Eric)
> - use errp in iommufd_device_get_info (Eric)
> - split and simplify cap/ecap check/sync code in intel_iommu.c (Cédric)
> - move VTDHostIOMMUDevice declaration to intel_iommu_internal.h (Cédric)
> - make '(vtd->cap_reg >> 16) & 0x3fULL' a MACRO and add missed '+1' (Cédric)
> - block migration if vIOMMU cap/ecap updated based on host IOMMU cap/ecap
> - add R-B
>
>
> Yi Liu (2):
>   intel_iommu: Add set/unset_iommu_device callback
>   intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
>
> Zhenzhong Duan (4):
>   intel_iommu: Extract out vtd_cap_init to initialize cap/ecap
>   intel_iommu: Implement check and sync mechanism in iommufd mode
>   intel_iommu: Use mgaw instead of s->aw_bits
>   intel_iommu: Block migration if cap is updated
>
>  hw/i386/intel_iommu_internal.h |   9 ++
>  include/hw/i386/intel_iommu.h  |   4 +
>  hw/i386/acpi-build.c           |   3 +-
>  hw/i386/intel_iommu.c          | 287 ++++++++++++++++++++++++++-------
>  4 files changed, 245 insertions(+), 58 deletions(-)
>
> --
> 2.34.1
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU
  2024-03-04  4:17 ` [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Jason Wang
@ 2024-03-04  6:13   ` Duan, Zhenzhong
  0 siblings, 0 replies; 22+ messages in thread
From: Duan, Zhenzhong @ 2024-03-04  6:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P



>-----Original Message-----
>From: Jason Wang <jasowang@redhat.com>
>Subject: Re: [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with
>vIOMMU
>
>On Wed, Feb 28, 2024 at 5:46 PM Zhenzhong Duan
><zhenzhong.duan@intel.com> wrote:
>>
>> Hi,
>>
>> Based on Joao's suggestion, the iommufd nesting prerequisite series [1]
>> is further splitted to host IOMMU device abstract part [2] and vIOMMU
>> check/sync part. This series implements the 2nd part.
>>
>> This enables vIOMMU to get host IOMMU cap/ecap information by
>implementing
>> a new set/unset_iommu_device interface, then vIOMMU could check or
>sync
>> with vIOMMU's own cap/ecap config.
>
>Does it mean that it would supress the cap/ecap config from the qemu
>command line?

No, cap/ecap have two kinds of bits, one is not controlled by cmdline, e.g., MGAW;
the other is, e.g., we initialize SAGAW through aw_bits.

I treat qemu cmdline with higher priority. We only allow update those bits not related
to any cmdline, e.g., update vIOMMU MGAW with host MGAW.

If there is cmdline controlled cap/ecap bits incompatibility between host and vIOMMU,
vfio device hotplug should fail.

> If yes, I wonder how to maintain the migration compatibility.

If cap/ecap is updated due to above reason, I have below patch to block migration.

[PATCH v1 6/6] intel_iommu: Block migration if cap is updated

Thanks
Zhenzhong

>
>Thanks
>
>>
>> It works by having device side, i.e. VFIO, register either an
>IOMMULegacyDevice
>> or IOMMUFDDevice to vIOMMU, which includes necessary data to archive
>that.
>> Currently only VFIO device is supported, but it could also be used for other
>> devices, i.e., VDPA.
>>
>> For coldplugged device, we can get its host IOMMU cap/ecap during qemu
>init,
>> then check and sync into vIOMMU cap/ecap.
>> For hotplugged device, vIOMMU cap/ecap is frozen, we could only check
>with
>> vIOMMU cap/ecap, not allowed to update. If check fails, hotplugged will
>fail.
>>
>> This is also a prerequisite for incoming iommufd nesting series:
>> 'intel_iommu: Enable stage-1 translation'.
>>
>> I didn't implement cap/ecap sync for legacy VFIO backend, would like to
>see
>> what Eric want to put in IOMMULegacyDevice for virtio-iommu and if I can
>> utilize some of them.
>>
>> Because it's becoming clear on community's suggestion, I'd like to remove
>> rfc tag from this version.
>>
>> Qemu code can be found at:
>>
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_pre
>q_part2_v1
>>
>> [1] https://lore.kernel.org/qemu-devel/20240201072818.327930-1-
>zhenzhong.duan@intel.com
>> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-
>02/msg06314.html
>>
>> Thanks
>> Zhenzhong
>>
>> Changelog:
>> v1:
>> - convert HostIOMMUDevice to sub object pointer in vtd_check_hdev
>>
>> rfcv2:
>> - introduce common abstract HostIOMMUDevice and sub struct for
>different BEs (Eric, Cédric)
>> - remove iommufd_device.[ch] (Cédric)
>> - remove duplicate iommufd/devid define from VFIODevice (Eric)
>> - drop the p in aliased_pbus and aliased_pdevfn (Eric)
>> - assert devfn and iommu_bus in pci_device_get_iommu_bus_devfn
>(Cédric, Eric)
>> - use errp in iommufd_device_get_info (Eric)
>> - split and simplify cap/ecap check/sync code in intel_iommu.c (Cédric)
>> - move VTDHostIOMMUDevice declaration to intel_iommu_internal.h
>(Cédric)
>> - make '(vtd->cap_reg >> 16) & 0x3fULL' a MACRO and add missed '+1'
>(Cédric)
>> - block migration if vIOMMU cap/ecap updated based on host IOMMU
>cap/ecap
>> - add R-B
>>
>>
>> Yi Liu (2):
>>   intel_iommu: Add set/unset_iommu_device callback
>>   intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
>>
>> Zhenzhong Duan (4):
>>   intel_iommu: Extract out vtd_cap_init to initialize cap/ecap
>>   intel_iommu: Implement check and sync mechanism in iommufd mode
>>   intel_iommu: Use mgaw instead of s->aw_bits
>>   intel_iommu: Block migration if cap is updated
>>
>>  hw/i386/intel_iommu_internal.h |   9 ++
>>  include/hw/i386/intel_iommu.h  |   4 +
>>  hw/i386/acpi-build.c           |   3 +-
>>  hw/i386/intel_iommu.c          | 287 ++++++++++++++++++++++++++-------
>>  4 files changed, 245 insertions(+), 58 deletions(-)
>>
>> --
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 6/6] intel_iommu: Block migration if cap is updated
  2024-02-28  9:44 ` [PATCH v1 6/6] intel_iommu: Block migration if cap is updated Zhenzhong Duan
@ 2024-03-04  6:35   ` Prasad Pandit
  2024-03-04  8:11     ` Duan, Zhenzhong
  0 siblings, 1 reply; 22+ messages in thread
From: Prasad Pandit @ 2024-03-04  6:35 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	mst, jgg, nicolinc, joao.m.martins, kevin.tian, yi.l.liu,
	yi.y.sun, chao.p.peng, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On Wed, 28 Feb 2024 at 15:17, Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
> When there is VFIO device and vIOMMU cap/ecap is updated based on host

* cap/ecap -> capability/extended capability registers are updated ...

> IOMMU cap/ecap, migration should be blocked.

* It'll help to mention why migration should be blocked in this case?

> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 16 ++++++++++++++--
> +static Error *vtd_mig_blocker;
> +
>  static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>                                    IOMMUFDDevice *idev,
>                                    Error **errp)
> @@ -3861,8 +3864,17 @@ static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>          tmp_cap |= VTD_CAP_MGAW(host_mgaw + 1);
>      }
>
> -    s->cap = tmp_cap;
> -    return 0;
> +    if (s->cap != tmp_cap) {
> +        if (vtd_mig_blocker == NULL) {
> +            error_setg(&vtd_mig_blocker,
> +                       "cap/ecap update from host IOMMU block migration");
> +            ret = migrate_add_blocker(&vtd_mig_blocker, errp);
> +        }
> +        if (!ret) {
> +            s->cap = tmp_cap;
> +        }
> +    }
> +    return ret;

* I couldn't find vtd_check_* function in the tree, but what happens
if vtd_mig_blocker != NULL? What will be 'ret' then?

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 6/6] intel_iommu: Block migration if cap is updated
  2024-03-04  6:35   ` Prasad Pandit
@ 2024-03-04  8:11     ` Duan, Zhenzhong
  2024-03-04  9:43       ` Prasad Pandit
  0 siblings, 1 reply; 22+ messages in thread
From: Duan, Zhenzhong @ 2024-03-04  8:11 UTC (permalink / raw)
  To: Prasad Pandit
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	mst, jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun,
	Yi Y, Peng, Chao P, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Prasad Pandit <ppandit@redhat.com>
>Subject: Re: [PATCH v1 6/6] intel_iommu: Block migration if cap is updated
>
>On Wed, 28 Feb 2024 at 15:17, Zhenzhong Duan
><zhenzhong.duan@intel.com> wrote:
>> When there is VFIO device and vIOMMU cap/ecap is updated based on
>host
>
>* cap/ecap -> capability/extended capability registers are updated ...

Will do.

>
>> IOMMU cap/ecap, migration should be blocked.
>
>* It'll help to mention why migration should be blocked in this case?

Refine the patch description as below:

This is to deal with a special case when cold plugged vfio device is unplugged
at runtime, then migration triggers.

When qemu on source side starts with cold plugged vfio device, vIOMMU
capability/extended capability registers(cap/ecap) can be updated based
on host cap/ecap, then vIOMMU cap/ecap is frozen after machine creation
done, vIOMMU cap/ecap isn’t restored to default after vfio device is unplugged.
In this case source and dest(default cap/ecap) will have incompatible cap/ecap
value. So migration is blocked if there is cap/ecap update on source side.

If vfio device isn't unplugged at runtime, vfio device's own vIOMMU migration blocker
is redundant with blocker here, but that's harmless.

If vfio devices are all hot plugged after qemu on source side starts, then vIOMMU
cap/ecap is frozen with the default value, we don't make a blocker in this case.

>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu.c | 16 ++++++++++++++--
>> +static Error *vtd_mig_blocker;
>> +
>>  static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>>                                    IOMMUFDDevice *idev,
>>                                    Error **errp)
>> @@ -3861,8 +3864,17 @@ static int
>vtd_check_iommufd_hdev(IntelIOMMUState *s,
>>          tmp_cap |= VTD_CAP_MGAW(host_mgaw + 1);
>>      }
>>
>> -    s->cap = tmp_cap;
>> -    return 0;
>> +    if (s->cap != tmp_cap) {
>> +        if (vtd_mig_blocker == NULL) {
>> +            error_setg(&vtd_mig_blocker,
>> +                       "cap/ecap update from host IOMMU block migration");
>> +            ret = migrate_add_blocker(&vtd_mig_blocker, errp);
>> +        }
>> +        if (!ret) {
>> +            s->cap = tmp_cap;
>> +        }
>> +    }
>> +    return ret;
>
>* I couldn't find vtd_check_* function in the tree, but what happens
>if vtd_mig_blocker != NULL? What will be 'ret' then?

vtd_mig_blocker != NULL means cap is already updated once at least.
In this case, ret is return value 0 from iommufd_device_get_info().

    ret = iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd, errp);
    if (ret) {
        return ret;
    }

Then cap is updated with host cap value tmp_cap. This update only happen
before machine creation done.

Vtd_check_* is in patch3 and patch4:
https://lists.gnu.org/archive/html/qemu-devel/2024-02/msg06418.html
https://lists.gnu.org/archive/html/qemu-devel/2024-02/msg06419.html

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 6/6] intel_iommu: Block migration if cap is updated
  2024-03-04  8:11     ` Duan, Zhenzhong
@ 2024-03-04  9:43       ` Prasad Pandit
  2024-03-04 10:10         ` Duan, Zhenzhong
  0 siblings, 1 reply; 22+ messages in thread
From: Prasad Pandit @ 2024-03-04  9:43 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	mst, jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun,
	Yi Y, Peng, Chao P, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On Mon, 4 Mar 2024 at 13:41, Duan, Zhenzhong <zhenzhong.duan@intel.com> wrote:
> This is to deal with a special case when cold plugged vfio device is unplugged
> at runtime, then migration triggers.
>
> When qemu on source side starts with cold plugged vfio device, vIOMMU

qemu -> QEMU

> capability/extended capability registers(cap/ecap) can be updated based
> on host cap/ecap, then vIOMMU cap/ecap is frozen after machine creation
> done, vIOMMU cap/ecap isn’t restored to default after vfio device is unplugged.
> In this case source and dest(default cap/ecap) will have incompatible cap/ecap
> value. So migration is blocked if there is cap/ecap update on source side.
>
> If vfio device isn't unplugged at runtime, vfio device's own vIOMMU migration blocker
> is redundant with blocker here, but that's harmless.
>
> If vfio devices are all hot plugged after qemu on source side starts, then vIOMMU
> cap/ecap is frozen with the default value, we don't make a blocker in this case.
>

Nice +1

> >> @@ -3861,8 +3864,17 @@ static int
> >vtd_check_iommufd_hdev(IntelIOMMUState *s,
> >>          tmp_cap |= VTD_CAP_MGAW(host_mgaw + 1);
> >>      }
> >>
> >> -    s->cap = tmp_cap;
> >> -    return 0;
> >> +    if (s->cap != tmp_cap) {
> >> +        if (vtd_mig_blocker == NULL) {
> >> +            error_setg(&vtd_mig_blocker,
> >> +                       "cap/ecap update from host IOMMU block migration");
> >> +            ret = migrate_add_blocker(&vtd_mig_blocker, errp);
> >> +        }
> >> +        if (!ret) {
> >> +            s->cap = tmp_cap;
> >> +        }
> >> +    }
> >> +    return ret;
>
> vtd_mig_blocker != NULL means cap is already updated once at least.
> In this case, ret is return value 0 from iommufd_device_get_info().
>
>     ret = iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd, errp);
>     if (ret) {
>         return ret;
>     }
>
> Then cap is updated with host cap value tmp_cap. This update only happen
> before machine creation done.

* After iommufd_device_get_info() ret is != 0. So s->cap won't be
updated then. Hope that is intended.

* With the above tweaks included:
    Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 6/6] intel_iommu: Block migration if cap is updated
  2024-03-04  9:43       ` Prasad Pandit
@ 2024-03-04 10:10         ` Duan, Zhenzhong
  0 siblings, 0 replies; 22+ messages in thread
From: Duan, Zhenzhong @ 2024-03-04 10:10 UTC (permalink / raw)
  To: Prasad Pandit
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	mst, jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun,
	Yi Y, Peng, Chao P, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Prasad Pandit <ppandit@redhat.com>
>Subject: Re: [PATCH v1 6/6] intel_iommu: Block migration if cap is updated
>
>On Mon, 4 Mar 2024 at 13:41, Duan, Zhenzhong
><zhenzhong.duan@intel.com> wrote:
>> This is to deal with a special case when cold plugged vfio device is
>unplugged
>> at runtime, then migration triggers.
>>
>> When qemu on source side starts with cold plugged vfio device, vIOMMU
>
>qemu -> QEMU
>
>> capability/extended capability registers(cap/ecap) can be updated based
>> on host cap/ecap, then vIOMMU cap/ecap is frozen after machine creation
>> done, vIOMMU cap/ecap isn’t restored to default after vfio device is
>unplugged.
>> In this case source and dest(default cap/ecap) will have incompatible
>cap/ecap
>> value. So migration is blocked if there is cap/ecap update on source side.
>>
>> If vfio device isn't unplugged at runtime, vfio device's own vIOMMU
>migration blocker
>> is redundant with blocker here, but that's harmless.
>>
>> If vfio devices are all hot plugged after qemu on source side starts, then
>vIOMMU
>> cap/ecap is frozen with the default value, we don't make a blocker in this
>case.
>>
>
>Nice +1
>
>> >> @@ -3861,8 +3864,17 @@ static int
>> >vtd_check_iommufd_hdev(IntelIOMMUState *s,
>> >>          tmp_cap |= VTD_CAP_MGAW(host_mgaw + 1);
>> >>      }
>> >>
>> >> -    s->cap = tmp_cap;
>> >> -    return 0;
>> >> +    if (s->cap != tmp_cap) {
>> >> +        if (vtd_mig_blocker == NULL) {
>> >> +            error_setg(&vtd_mig_blocker,
>> >> +                       "cap/ecap update from host IOMMU block migration");
>> >> +            ret = migrate_add_blocker(&vtd_mig_blocker, errp);
>> >> +        }
>> >> +        if (!ret) {
>> >> +            s->cap = tmp_cap;
>> >> +        }
>> >> +    }
>> >> +    return ret;
>>
>> vtd_mig_blocker != NULL means cap is already updated once at least.
>> In this case, ret is return value 0 from iommufd_device_get_info().
>>
>>     ret = iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd, errp);
>>     if (ret) {
>>         return ret;
>>     }
>>
>> Then cap is updated with host cap value tmp_cap. This update only happen
>> before machine creation done.
>
>* After iommufd_device_get_info() ret is != 0. So s->cap won't be
>updated then. Hope that is intended.

Yes, iommufd_device_get_info() return ret !=0 means error happen,
we should not update s->cap definitely.

Normally if there are multiple vfio devices backed by different host IOMMUs,
It's possible s->cap updated more than once, e.g., MGAW update: 57->52->48.

>
>* With the above tweaks included:
>    Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>

Thanks for your review.

BRs.
Zhenzhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-02-28  9:44 ` [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
@ 2024-03-12 17:03   ` Michael S. Tsirkin
  2024-03-13  2:52     ` Duan, Zhenzhong
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2024-03-12 17:03 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	jgg, nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Add a framework to check and synchronize host IOMMU cap/ecap with
> vIOMMU cap/ecap.
> 
> The sequence will be:
> 
> vtd_cap_init() initializes iommu->cap/ecap.
> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
> iommu->cap_frozen set when machine create done, iommu->cap/ecap become readonly.
> 
> Implementation details for different backends will be in following patches.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/i386/intel_iommu.h |  1 +
>  hw/i386/intel_iommu.c         | 50 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index bbc7b96add..c71a133820 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
>  
>      uint64_t cap;                   /* The value of capability reg */
>      uint64_t ecap;                  /* The value of extended capability reg */
> +    bool cap_frozen;                /* cap/ecap become read-only after frozen */
>  
>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>      GHashTable *iotlb;              /* IOTLB */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ffa1ad6429..a9f9dfd6a7 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -35,6 +35,8 @@
>  #include "sysemu/kvm.h"
>  #include "sysemu/dma.h"
>  #include "sysemu/sysemu.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/iommufd.h"
>  #include "hw/i386/apic_internal.h"
>  #include "kvm/kvm_i386.h"
>  #include "migration/vmstate.h"
> @@ -3819,6 +3821,38 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>      return vtd_dev_as;
>  }
>  
> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
> +                                 IOMMULegacyDevice *ldev,
> +                                 Error **errp)
> +{
> +    return 0;
> +}
> +
> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
> +                                  IOMMUFDDevice *idev,
> +                                  Error **errp)
> +{
> +    return 0;
> +}
> +
> +static int vtd_check_hdev(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hdev,
> +                          Error **errp)
> +{
> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
> +    IOMMUFDDevice *idev;
> +
> +    if (base_dev->type == HID_LEGACY) {
> +        IOMMULegacyDevice *ldev = container_of(base_dev,
> +                                               IOMMULegacyDevice, base);
> +
> +        return vtd_check_legacy_hdev(s, ldev, errp);
> +    }
> +
> +    idev = container_of(base_dev, IOMMUFDDevice, base);
> +
> +    return vtd_check_iommufd_hdev(s, idev, errp);
> +}
> +
>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>                                      HostIOMMUDevice *base_dev, Error **errp)
>  {
> @@ -3829,6 +3863,7 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>          .devfn = devfn,
>      };
>      struct vtd_as_key *new_key;
> +    int ret;
>  
>      assert(base_dev);
>  
> @@ -3848,6 +3883,13 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>      vtd_hdev->iommu_state = s;
>      vtd_hdev->dev = base_dev;
>  
> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
> +    if (ret) {
> +        g_free(vtd_hdev);
> +        vtd_iommu_unlock(s);
> +        return ret;
> +    }
> +
>      new_key = g_malloc(sizeof(*new_key));
>      new_key->bus = bus;
>      new_key->devfn = devfn;


Okay. So when VFIO device is created, it will call vtd_dev_set_iommu_device
and that in turn will update caps.




> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
>      s->iq_dw = false;
>      s->next_frcd_reg = 0;
>  
> -    vtd_cap_init(s);
> +    if (!s->cap_frozen) {
> +        vtd_cap_init(s);
> +    }
>  

If it's fronzen it's because VFIO was added after machine done.
And then what? I think caps are just wrong?


I think the way to approach this is just by specifying this
as an option on command line.

So if one wants VFIO one has to sync caps with host.
No?



>      /*
>       * Rsvd field masks for spte
> @@ -4254,6 +4298,10 @@ static int vtd_machine_done_notify_one(Object *child, void *unused)
>  
>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>  {
> +    IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
> +
> +    iommu->cap_frozen = true;
> +
>      object_child_foreach_recursive(object_get_root(),
>                                     vtd_machine_done_notify_one, NULL);
>  }
> -- 
> 2.34.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-12 17:03   ` Michael S. Tsirkin
@ 2024-03-13  2:52     ` Duan, Zhenzhong
  2024-03-13  7:07       ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Duan, Zhenzhong @ 2024-03-13  2:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

Hi Michael,

>-----Original Message-----
>From: Michael S. Tsirkin <mst@redhat.com>
>Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>sync host IOMMU cap/ecap
>
>On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Add a framework to check and synchronize host IOMMU cap/ecap with
>> vIOMMU cap/ecap.
>>
>> The sequence will be:
>>
>> vtd_cap_init() initializes iommu->cap/ecap.
>> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
>> iommu->cap_frozen set when machine create done, iommu->cap/ecap
>become readonly.
>>
>> Implementation details for different backends will be in following patches.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/hw/i386/intel_iommu.h |  1 +
>>  hw/i386/intel_iommu.c         | 50
>++++++++++++++++++++++++++++++++++-
>>  2 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index bbc7b96add..c71a133820 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
>>
>>      uint64_t cap;                   /* The value of capability reg */
>>      uint64_t ecap;                  /* The value of extended capability reg */
>> +    bool cap_frozen;                /* cap/ecap become read-only after frozen */
>>
>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>>      GHashTable *iotlb;              /* IOTLB */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index ffa1ad6429..a9f9dfd6a7 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -35,6 +35,8 @@
>>  #include "sysemu/kvm.h"
>>  #include "sysemu/dma.h"
>>  #include "sysemu/sysemu.h"
>> +#include "hw/vfio/vfio-common.h"
>> +#include "sysemu/iommufd.h"
>>  #include "hw/i386/apic_internal.h"
>>  #include "kvm/kvm_i386.h"
>>  #include "migration/vmstate.h"
>> @@ -3819,6 +3821,38 @@ VTDAddressSpace
>*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>      return vtd_dev_as;
>>  }
>>
>> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
>> +                                 IOMMULegacyDevice *ldev,
>> +                                 Error **errp)
>> +{
>> +    return 0;
>> +}
>> +
>> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>> +                                  IOMMUFDDevice *idev,
>> +                                  Error **errp)
>> +{
>> +    return 0;
>> +}
>> +
>> +static int vtd_check_hdev(IntelIOMMUState *s, VTDHostIOMMUDevice
>*vtd_hdev,
>> +                          Error **errp)
>> +{
>> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
>> +    IOMMUFDDevice *idev;
>> +
>> +    if (base_dev->type == HID_LEGACY) {
>> +        IOMMULegacyDevice *ldev = container_of(base_dev,
>> +                                               IOMMULegacyDevice, base);
>> +
>> +        return vtd_check_legacy_hdev(s, ldev, errp);
>> +    }
>> +
>> +    idev = container_of(base_dev, IOMMUFDDevice, base);
>> +
>> +    return vtd_check_iommufd_hdev(s, idev, errp);
>> +}
>> +
>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int
>devfn,
>>                                      HostIOMMUDevice *base_dev, Error **errp)
>>  {
>> @@ -3829,6 +3863,7 @@ static int vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int devfn,
>>          .devfn = devfn,
>>      };
>>      struct vtd_as_key *new_key;
>> +    int ret;
>>
>>      assert(base_dev);
>>
>> @@ -3848,6 +3883,13 @@ static int vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int devfn,
>>      vtd_hdev->iommu_state = s;
>>      vtd_hdev->dev = base_dev;
>>
>> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
>> +    if (ret) {
>> +        g_free(vtd_hdev);
>> +        vtd_iommu_unlock(s);
>> +        return ret;
>> +    }
>> +
>>      new_key = g_malloc(sizeof(*new_key));
>>      new_key->bus = bus;
>>      new_key->devfn = devfn;
>
>
>Okay. So when VFIO device is created, it will call vtd_dev_set_iommu_device
>and that in turn will update caps.
>
>
>
>
>> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
>>      s->iq_dw = false;
>>      s->next_frcd_reg = 0;
>>
>> -    vtd_cap_init(s);
>> +    if (!s->cap_frozen) {
>> +        vtd_cap_init(s);
>> +    }
>>
>
>If it's fronzen it's because VFIO was added after machine done.
>And then what? I think caps are just wrong?

Not quite get your question on caps being wrong. But try to explains:

When a hot plugged vfio device's host iommu cap isn't compatible with
vIOMMU's, hotplug should fail. Currently there is no check for this and
allow hotplug to succeed, but then some issue will reveal later,
e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup iova
mapping beyond host supported iova range, then DMA will fail.

In fact, before this series, cap is not impacted by VFIO, so it's same effect of
frozen after machine done.

>
>
>I think the way to approach this is just by specifying this
>as an option on command line.

Do you mean add a cap_frozen property to intel_iommu?
Vtd_init() is called in realize() and system reset(), so I utilize realize() to init cap
and froze cap before system reset(). If cap_frozen is an option, when it's set to
false, cap could be updated every system reset and it's not a fix value any more.
This may break migration.

>
>So if one wants VFIO one has to sync caps with host.
>No?

Yes, check for compatibility. But it's not preventing the usage of VFIO
with vIOMMU, it finds the incompatible issue earlier and fail hotplug instead of
surprising guest driver failure.

Thanks
Zhenzhong

>
>
>
>>      /*
>>       * Rsvd field masks for spte
>> @@ -4254,6 +4298,10 @@ static int
>vtd_machine_done_notify_one(Object *child, void *unused)
>>
>>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
>>  {
>> +    IntelIOMMUState *iommu =
>INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>> +
>> +    iommu->cap_frozen = true;
>> +
>>      object_child_foreach_recursive(object_get_root(),
>>                                     vtd_machine_done_notify_one, NULL);
>>  }
>> --
>> 2.34.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-13  2:52     ` Duan, Zhenzhong
@ 2024-03-13  7:07       ` Michael S. Tsirkin
  2024-03-13  7:54         ` Duan, Zhenzhong
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2024-03-13  7:07 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote:
> Hi Michael,
> 
> >-----Original Message-----
> >From: Michael S. Tsirkin <mst@redhat.com>
> >Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
> >sync host IOMMU cap/ecap
> >
> >On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
> >> From: Yi Liu <yi.l.liu@intel.com>
> >>
> >> Add a framework to check and synchronize host IOMMU cap/ecap with
> >> vIOMMU cap/ecap.
> >>
> >> The sequence will be:
> >>
> >> vtd_cap_init() initializes iommu->cap/ecap.
> >> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
> >> iommu->cap_frozen set when machine create done, iommu->cap/ecap
> >become readonly.
> >>
> >> Implementation details for different backends will be in following patches.
> >>
> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> >> ---
> >>  include/hw/i386/intel_iommu.h |  1 +
> >>  hw/i386/intel_iommu.c         | 50
> >++++++++++++++++++++++++++++++++++-
> >>  2 files changed, 50 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/hw/i386/intel_iommu.h
> >b/include/hw/i386/intel_iommu.h
> >> index bbc7b96add..c71a133820 100644
> >> --- a/include/hw/i386/intel_iommu.h
> >> +++ b/include/hw/i386/intel_iommu.h
> >> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
> >>
> >>      uint64_t cap;                   /* The value of capability reg */
> >>      uint64_t ecap;                  /* The value of extended capability reg */
> >> +    bool cap_frozen;                /* cap/ecap become read-only after frozen */
> >>
> >>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
> >>      GHashTable *iotlb;              /* IOTLB */
> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >> index ffa1ad6429..a9f9dfd6a7 100644
> >> --- a/hw/i386/intel_iommu.c
> >> +++ b/hw/i386/intel_iommu.c
> >> @@ -35,6 +35,8 @@
> >>  #include "sysemu/kvm.h"
> >>  #include "sysemu/dma.h"
> >>  #include "sysemu/sysemu.h"
> >> +#include "hw/vfio/vfio-common.h"
> >> +#include "sysemu/iommufd.h"
> >>  #include "hw/i386/apic_internal.h"
> >>  #include "kvm/kvm_i386.h"
> >>  #include "migration/vmstate.h"
> >> @@ -3819,6 +3821,38 @@ VTDAddressSpace
> >*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
> >>      return vtd_dev_as;
> >>  }
> >>
> >> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
> >> +                                 IOMMULegacyDevice *ldev,
> >> +                                 Error **errp)
> >> +{
> >> +    return 0;
> >> +}
> >> +
> >> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
> >> +                                  IOMMUFDDevice *idev,
> >> +                                  Error **errp)
> >> +{
> >> +    return 0;
> >> +}
> >> +
> >> +static int vtd_check_hdev(IntelIOMMUState *s, VTDHostIOMMUDevice
> >*vtd_hdev,
> >> +                          Error **errp)
> >> +{
> >> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
> >> +    IOMMUFDDevice *idev;
> >> +
> >> +    if (base_dev->type == HID_LEGACY) {
> >> +        IOMMULegacyDevice *ldev = container_of(base_dev,
> >> +                                               IOMMULegacyDevice, base);
> >> +
> >> +        return vtd_check_legacy_hdev(s, ldev, errp);
> >> +    }
> >> +
> >> +    idev = container_of(base_dev, IOMMUFDDevice, base);
> >> +
> >> +    return vtd_check_iommufd_hdev(s, idev, errp);
> >> +}
> >> +
> >>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int
> >devfn,
> >>                                      HostIOMMUDevice *base_dev, Error **errp)
> >>  {
> >> @@ -3829,6 +3863,7 @@ static int vtd_dev_set_iommu_device(PCIBus
> >*bus, void *opaque, int devfn,
> >>          .devfn = devfn,
> >>      };
> >>      struct vtd_as_key *new_key;
> >> +    int ret;
> >>
> >>      assert(base_dev);
> >>
> >> @@ -3848,6 +3883,13 @@ static int vtd_dev_set_iommu_device(PCIBus
> >*bus, void *opaque, int devfn,
> >>      vtd_hdev->iommu_state = s;
> >>      vtd_hdev->dev = base_dev;
> >>
> >> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
> >> +    if (ret) {
> >> +        g_free(vtd_hdev);
> >> +        vtd_iommu_unlock(s);
> >> +        return ret;
> >> +    }
> >> +
> >>      new_key = g_malloc(sizeof(*new_key));
> >>      new_key->bus = bus;
> >>      new_key->devfn = devfn;
> >
> >
> >Okay. So when VFIO device is created, it will call vtd_dev_set_iommu_device
> >and that in turn will update caps.
> >
> >
> >
> >
> >> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
> >>      s->iq_dw = false;
> >>      s->next_frcd_reg = 0;
> >>
> >> -    vtd_cap_init(s);
> >> +    if (!s->cap_frozen) {
> >> +        vtd_cap_init(s);
> >> +    }
> >>
> >
> >If it's fronzen it's because VFIO was added after machine done.
> >And then what? I think caps are just wrong?
> 
> Not quite get your question on caps being wrong. But try to explains:
> 
> When a hot plugged vfio device's host iommu cap isn't compatible with
> vIOMMU's, hotplug should fail. Currently there is no check for this and
> allow hotplug to succeed, but then some issue will reveal later,
> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup iova
> mapping beyond host supported iova range, then DMA will fail.
> 
> In fact, before this series, cap is not impacted by VFIO, so it's same effect of
> frozen after machine done.
> 
> >
> >
> >I think the way to approach this is just by specifying this
> >as an option on command line.
> 
> Do you mean add a cap_frozen property to intel_iommu?
> Vtd_init() is called in realize() and system reset(), so I utilize realize() to init cap
> and froze cap before system reset(). If cap_frozen is an option, when it's set to
> false, cap could be updated every system reset and it's not a fix value any more.
> This may break migration.

No, I mean either
1. add some kind of vfio-iommu device that is not exposed to guest
   but is not hot pluggable

or

2. add a property to specify ecap, rely on management to set it correctly


> >
> >So if one wants VFIO one has to sync caps with host.
> >No?
> 
> Yes, check for compatibility. But it's not preventing the usage of VFIO
> with vIOMMU, it finds the incompatible issue earlier and fail hotplug instead of
> surprising guest driver failure.
> 
> Thanks
> Zhenzhong


I don't see where the check for compatibility and hotplug failure are.
Did I miss it?

> >
> >
> >
> >>      /*
> >>       * Rsvd field masks for spte
> >> @@ -4254,6 +4298,10 @@ static int
> >vtd_machine_done_notify_one(Object *child, void *unused)
> >>
> >>  static void vtd_machine_done_hook(Notifier *notifier, void *unused)
> >>  {
> >> +    IntelIOMMUState *iommu =
> >INTEL_IOMMU_DEVICE(x86_iommu_get_default());
> >> +
> >> +    iommu->cap_frozen = true;
> >> +
> >>      object_child_foreach_recursive(object_get_root(),
> >>                                     vtd_machine_done_notify_one, NULL);
> >>  }
> >> --
> >> 2.34.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-13  7:07       ` Michael S. Tsirkin
@ 2024-03-13  7:54         ` Duan, Zhenzhong
  2024-03-13 11:17           ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Duan, Zhenzhong @ 2024-03-13  7:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum



>-----Original Message-----
>From: Michael S. Tsirkin <mst@redhat.com>
>Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>sync host IOMMU cap/ecap
>
>On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote:
>> Hi Michael,
>>
>> >-----Original Message-----
>> >From: Michael S. Tsirkin <mst@redhat.com>
>> >Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>> >sync host IOMMU cap/ecap
>> >
>> >On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
>> >> From: Yi Liu <yi.l.liu@intel.com>
>> >>
>> >> Add a framework to check and synchronize host IOMMU cap/ecap with
>> >> vIOMMU cap/ecap.
>> >>
>> >> The sequence will be:
>> >>
>> >> vtd_cap_init() initializes iommu->cap/ecap.
>> >> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
>> >> iommu->cap_frozen set when machine create done, iommu->cap/ecap
>> >become readonly.
>> >>
>> >> Implementation details for different backends will be in following
>patches.
>> >>
>> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> >> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> >> ---
>> >>  include/hw/i386/intel_iommu.h |  1 +
>> >>  hw/i386/intel_iommu.c         | 50
>> >++++++++++++++++++++++++++++++++++-
>> >>  2 files changed, 50 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/include/hw/i386/intel_iommu.h
>> >b/include/hw/i386/intel_iommu.h
>> >> index bbc7b96add..c71a133820 100644
>> >> --- a/include/hw/i386/intel_iommu.h
>> >> +++ b/include/hw/i386/intel_iommu.h
>> >> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
>> >>
>> >>      uint64_t cap;                   /* The value of capability reg */
>> >>      uint64_t ecap;                  /* The value of extended capability reg */
>> >> +    bool cap_frozen;                /* cap/ecap become read-only after
>frozen */
>> >>
>> >>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>> >>      GHashTable *iotlb;              /* IOTLB */
>> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> >> index ffa1ad6429..a9f9dfd6a7 100644
>> >> --- a/hw/i386/intel_iommu.c
>> >> +++ b/hw/i386/intel_iommu.c
>> >> @@ -35,6 +35,8 @@
>> >>  #include "sysemu/kvm.h"
>> >>  #include "sysemu/dma.h"
>> >>  #include "sysemu/sysemu.h"
>> >> +#include "hw/vfio/vfio-common.h"
>> >> +#include "sysemu/iommufd.h"
>> >>  #include "hw/i386/apic_internal.h"
>> >>  #include "kvm/kvm_i386.h"
>> >>  #include "migration/vmstate.h"
>> >> @@ -3819,6 +3821,38 @@ VTDAddressSpace
>> >*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>> >>      return vtd_dev_as;
>> >>  }
>> >>
>> >> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
>> >> +                                 IOMMULegacyDevice *ldev,
>> >> +                                 Error **errp)
>> >> +{
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>> >> +                                  IOMMUFDDevice *idev,
>> >> +                                  Error **errp)
>> >> +{
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static int vtd_check_hdev(IntelIOMMUState *s,
>VTDHostIOMMUDevice
>> >*vtd_hdev,
>> >> +                          Error **errp)
>> >> +{
>> >> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
>> >> +    IOMMUFDDevice *idev;
>> >> +
>> >> +    if (base_dev->type == HID_LEGACY) {
>> >> +        IOMMULegacyDevice *ldev = container_of(base_dev,
>> >> +                                               IOMMULegacyDevice, base);
>> >> +
>> >> +        return vtd_check_legacy_hdev(s, ldev, errp);
>> >> +    }
>> >> +
>> >> +    idev = container_of(base_dev, IOMMUFDDevice, base);
>> >> +
>> >> +    return vtd_check_iommufd_hdev(s, idev, errp);
>> >> +}
>> >> +
>> >>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int
>> >devfn,
>> >>                                      HostIOMMUDevice *base_dev, Error **errp)
>> >>  {
>> >> @@ -3829,6 +3863,7 @@ static int
>vtd_dev_set_iommu_device(PCIBus
>> >*bus, void *opaque, int devfn,
>> >>          .devfn = devfn,
>> >>      };
>> >>      struct vtd_as_key *new_key;
>> >> +    int ret;
>> >>
>> >>      assert(base_dev);
>> >>
>> >> @@ -3848,6 +3883,13 @@ static int
>vtd_dev_set_iommu_device(PCIBus
>> >*bus, void *opaque, int devfn,
>> >>      vtd_hdev->iommu_state = s;
>> >>      vtd_hdev->dev = base_dev;
>> >>
>> >> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
>> >> +    if (ret) {
>> >> +        g_free(vtd_hdev);
>> >> +        vtd_iommu_unlock(s);
>> >> +        return ret;
>> >> +    }
>> >> +
>> >>      new_key = g_malloc(sizeof(*new_key));
>> >>      new_key->bus = bus;
>> >>      new_key->devfn = devfn;
>> >
>> >
>> >Okay. So when VFIO device is created, it will call
>vtd_dev_set_iommu_device
>> >and that in turn will update caps.
>> >
>> >
>> >
>> >
>> >> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
>> >>      s->iq_dw = false;
>> >>      s->next_frcd_reg = 0;
>> >>
>> >> -    vtd_cap_init(s);
>> >> +    if (!s->cap_frozen) {
>> >> +        vtd_cap_init(s);
>> >> +    }
>> >>
>> >
>> >If it's fronzen it's because VFIO was added after machine done.
>> >And then what? I think caps are just wrong?
>>
>> Not quite get your question on caps being wrong. But try to explains:
>>
>> When a hot plugged vfio device's host iommu cap isn't compatible with
>> vIOMMU's, hotplug should fail. Currently there is no check for this and
>> allow hotplug to succeed, but then some issue will reveal later,
>> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup iova
>> mapping beyond host supported iova range, then DMA will fail.
>>
>> In fact, before this series, cap is not impacted by VFIO, so it's same effect of
>> frozen after machine done.
>>
>> >
>> >
>> >I think the way to approach this is just by specifying this
>> >as an option on command line.
>>
>> Do you mean add a cap_frozen property to intel_iommu?
>> Vtd_init() is called in realize() and system reset(), so I utilize realize() to init
>cap
>> and froze cap before system reset(). If cap_frozen is an option, when it's
>set to
>> false, cap could be updated every system reset and it's not a fix value any
>more.
>> This may break migration.
>
>No, I mean either
>1. add some kind of vfio-iommu device that is not exposed to guest
>   but is not hot pluggable

Not quite get, what will such vfio-iommu device be used for if not exposed to guest.
If we just want to avoid vfio device hotplug, we can use
TYPE_VFIO_PCI_NOHOTPLUG device. But even if we use coldplugged
vfio device, we will face same compatibility issue of vIOMMU caps vs. host caps.

>
>or
>
>2. add a property to specify ecap, rely on management to set it correctly

Yes. This deep customization could works, but I think it's difficult to teach user
about how to set each bit in cap/ecaps.

>
>
>> >
>> >So if one wants VFIO one has to sync caps with host.
>> >No?
>>
>> Yes, check for compatibility. But it's not preventing the usage of VFIO
>> with vIOMMU, it finds the incompatible issue earlier and fail hotplug
>instead of
>> surprising guest driver failure.
>>
>> Thanks
>> Zhenzhong
>
>
>I don't see where the check for compatibility and hotplug failure are.
>Did I miss it?

There is some check code in patch4, see https://lists.gnu.org/archive/html/qemu-devel/2024-02/msg06419.html

And more widely used in nesting series's patch6, see https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02746.html
I have not sent a new version of nesting series yet, but the code related to caps check/sync will be
similar.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-13  7:54         ` Duan, Zhenzhong
@ 2024-03-13 11:17           ` Michael S. Tsirkin
  2024-03-14  4:05             ` Duan, Zhenzhong
  2024-03-18 13:20             ` Eric Auger
  0 siblings, 2 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2024-03-13 11:17 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

On Wed, Mar 13, 2024 at 07:54:11AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Michael S. Tsirkin <mst@redhat.com>
> >Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
> >sync host IOMMU cap/ecap
> >
> >On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote:
> >> Hi Michael,
> >>
> >> >-----Original Message-----
> >> >From: Michael S. Tsirkin <mst@redhat.com>
> >> >Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
> >> >sync host IOMMU cap/ecap
> >> >
> >> >On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
> >> >> From: Yi Liu <yi.l.liu@intel.com>
> >> >>
> >> >> Add a framework to check and synchronize host IOMMU cap/ecap with
> >> >> vIOMMU cap/ecap.
> >> >>
> >> >> The sequence will be:
> >> >>
> >> >> vtd_cap_init() initializes iommu->cap/ecap.
> >> >> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
> >> >> iommu->cap_frozen set when machine create done, iommu->cap/ecap
> >> >become readonly.
> >> >>
> >> >> Implementation details for different backends will be in following
> >patches.
> >> >>
> >> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >> >> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> >> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> >> >> ---
> >> >>  include/hw/i386/intel_iommu.h |  1 +
> >> >>  hw/i386/intel_iommu.c         | 50
> >> >++++++++++++++++++++++++++++++++++-
> >> >>  2 files changed, 50 insertions(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/include/hw/i386/intel_iommu.h
> >> >b/include/hw/i386/intel_iommu.h
> >> >> index bbc7b96add..c71a133820 100644
> >> >> --- a/include/hw/i386/intel_iommu.h
> >> >> +++ b/include/hw/i386/intel_iommu.h
> >> >> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
> >> >>
> >> >>      uint64_t cap;                   /* The value of capability reg */
> >> >>      uint64_t ecap;                  /* The value of extended capability reg */
> >> >> +    bool cap_frozen;                /* cap/ecap become read-only after
> >frozen */
> >> >>
> >> >>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
> >> >>      GHashTable *iotlb;              /* IOTLB */
> >> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >> >> index ffa1ad6429..a9f9dfd6a7 100644
> >> >> --- a/hw/i386/intel_iommu.c
> >> >> +++ b/hw/i386/intel_iommu.c
> >> >> @@ -35,6 +35,8 @@
> >> >>  #include "sysemu/kvm.h"
> >> >>  #include "sysemu/dma.h"
> >> >>  #include "sysemu/sysemu.h"
> >> >> +#include "hw/vfio/vfio-common.h"
> >> >> +#include "sysemu/iommufd.h"
> >> >>  #include "hw/i386/apic_internal.h"
> >> >>  #include "kvm/kvm_i386.h"
> >> >>  #include "migration/vmstate.h"
> >> >> @@ -3819,6 +3821,38 @@ VTDAddressSpace
> >> >*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
> >> >>      return vtd_dev_as;
> >> >>  }
> >> >>
> >> >> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
> >> >> +                                 IOMMULegacyDevice *ldev,
> >> >> +                                 Error **errp)
> >> >> +{
> >> >> +    return 0;
> >> >> +}
> >> >> +
> >> >> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
> >> >> +                                  IOMMUFDDevice *idev,
> >> >> +                                  Error **errp)
> >> >> +{
> >> >> +    return 0;
> >> >> +}
> >> >> +
> >> >> +static int vtd_check_hdev(IntelIOMMUState *s,
> >VTDHostIOMMUDevice
> >> >*vtd_hdev,
> >> >> +                          Error **errp)
> >> >> +{
> >> >> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
> >> >> +    IOMMUFDDevice *idev;
> >> >> +
> >> >> +    if (base_dev->type == HID_LEGACY) {
> >> >> +        IOMMULegacyDevice *ldev = container_of(base_dev,
> >> >> +                                               IOMMULegacyDevice, base);
> >> >> +
> >> >> +        return vtd_check_legacy_hdev(s, ldev, errp);
> >> >> +    }
> >> >> +
> >> >> +    idev = container_of(base_dev, IOMMUFDDevice, base);
> >> >> +
> >> >> +    return vtd_check_iommufd_hdev(s, idev, errp);
> >> >> +}
> >> >> +
> >> >>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int
> >> >devfn,
> >> >>                                      HostIOMMUDevice *base_dev, Error **errp)
> >> >>  {
> >> >> @@ -3829,6 +3863,7 @@ static int
> >vtd_dev_set_iommu_device(PCIBus
> >> >*bus, void *opaque, int devfn,
> >> >>          .devfn = devfn,
> >> >>      };
> >> >>      struct vtd_as_key *new_key;
> >> >> +    int ret;
> >> >>
> >> >>      assert(base_dev);
> >> >>
> >> >> @@ -3848,6 +3883,13 @@ static int
> >vtd_dev_set_iommu_device(PCIBus
> >> >*bus, void *opaque, int devfn,
> >> >>      vtd_hdev->iommu_state = s;
> >> >>      vtd_hdev->dev = base_dev;
> >> >>
> >> >> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
> >> >> +    if (ret) {
> >> >> +        g_free(vtd_hdev);
> >> >> +        vtd_iommu_unlock(s);
> >> >> +        return ret;
> >> >> +    }
> >> >> +
> >> >>      new_key = g_malloc(sizeof(*new_key));
> >> >>      new_key->bus = bus;
> >> >>      new_key->devfn = devfn;
> >> >
> >> >
> >> >Okay. So when VFIO device is created, it will call
> >vtd_dev_set_iommu_device
> >> >and that in turn will update caps.
> >> >
> >> >
> >> >
> >> >
> >> >> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
> >> >>      s->iq_dw = false;
> >> >>      s->next_frcd_reg = 0;
> >> >>
> >> >> -    vtd_cap_init(s);
> >> >> +    if (!s->cap_frozen) {
> >> >> +        vtd_cap_init(s);
> >> >> +    }
> >> >>
> >> >
> >> >If it's fronzen it's because VFIO was added after machine done.
> >> >And then what? I think caps are just wrong?
> >>
> >> Not quite get your question on caps being wrong. But try to explains:
> >>
> >> When a hot plugged vfio device's host iommu cap isn't compatible with
> >> vIOMMU's, hotplug should fail. Currently there is no check for this and
> >> allow hotplug to succeed, but then some issue will reveal later,
> >> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup iova
> >> mapping beyond host supported iova range, then DMA will fail.
> >>
> >> In fact, before this series, cap is not impacted by VFIO, so it's same effect of
> >> frozen after machine done.
> >>
> >> >
> >> >
> >> >I think the way to approach this is just by specifying this
> >> >as an option on command line.
> >>
> >> Do you mean add a cap_frozen property to intel_iommu?
> >> Vtd_init() is called in realize() and system reset(), so I utilize realize() to init
> >cap
> >> and froze cap before system reset(). If cap_frozen is an option, when it's
> >set to
> >> false, cap could be updated every system reset and it's not a fix value any
> >more.
> >> This may break migration.
> >
> >No, I mean either
> >1. add some kind of vfio-iommu device that is not exposed to guest
> >   but is not hot pluggable
> 
> Not quite get, what will such vfio-iommu device be used for if not exposed to guest.

It will update the IOMMU.
And do so without need for tricky callbacks.

-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-13 11:17           ` Michael S. Tsirkin
@ 2024-03-14  4:05             ` Duan, Zhenzhong
  2024-03-18 13:20             ` Eric Auger
  1 sibling, 0 replies; 22+ messages in thread
From: Duan, Zhenzhong @ 2024-03-14  4:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum



>-----Original Message-----
>From: Michael S. Tsirkin <mst@redhat.com>
>Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>sync host IOMMU cap/ecap
>
>On Wed, Mar 13, 2024 at 07:54:11AM +0000, Duan, Zhenzhong wrote:
>>
>>
>> >-----Original Message-----
>> >From: Michael S. Tsirkin <mst@redhat.com>
>> >Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>> >sync host IOMMU cap/ecap
>> >
>> >On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote:
>> >> Hi Michael,
>> >>
>> >> >-----Original Message-----
>> >> >From: Michael S. Tsirkin <mst@redhat.com>
>> >> >Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check
>and
>> >> >sync host IOMMU cap/ecap
>> >> >
>> >> >On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
>> >> >> From: Yi Liu <yi.l.liu@intel.com>
>> >> >>
>> >> >> Add a framework to check and synchronize host IOMMU cap/ecap
>with
>> >> >> vIOMMU cap/ecap.
>> >> >>
>> >> >> The sequence will be:
>> >> >>
>> >> >> vtd_cap_init() initializes iommu->cap/ecap.
>> >> >> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
>> >> >> iommu->cap_frozen set when machine create done, iommu-
>>cap/ecap
>> >> >become readonly.
>> >> >>
>> >> >> Implementation details for different backends will be in following
>> >patches.
>> >> >>
>> >> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> >> >> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> >> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> >> >> ---
>> >> >>  include/hw/i386/intel_iommu.h |  1 +
>> >> >>  hw/i386/intel_iommu.c         | 50
>> >> >++++++++++++++++++++++++++++++++++-
>> >> >>  2 files changed, 50 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/include/hw/i386/intel_iommu.h
>> >> >b/include/hw/i386/intel_iommu.h
>> >> >> index bbc7b96add..c71a133820 100644
>> >> >> --- a/include/hw/i386/intel_iommu.h
>> >> >> +++ b/include/hw/i386/intel_iommu.h
>> >> >> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
>> >> >>
>> >> >>      uint64_t cap;                   /* The value of capability reg */
>> >> >>      uint64_t ecap;                  /* The value of extended capability reg
>*/
>> >> >> +    bool cap_frozen;                /* cap/ecap become read-only after
>> >frozen */
>> >> >>
>> >> >>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>> >> >>      GHashTable *iotlb;              /* IOTLB */
>> >> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> >> >> index ffa1ad6429..a9f9dfd6a7 100644
>> >> >> --- a/hw/i386/intel_iommu.c
>> >> >> +++ b/hw/i386/intel_iommu.c
>> >> >> @@ -35,6 +35,8 @@
>> >> >>  #include "sysemu/kvm.h"
>> >> >>  #include "sysemu/dma.h"
>> >> >>  #include "sysemu/sysemu.h"
>> >> >> +#include "hw/vfio/vfio-common.h"
>> >> >> +#include "sysemu/iommufd.h"
>> >> >>  #include "hw/i386/apic_internal.h"
>> >> >>  #include "kvm/kvm_i386.h"
>> >> >>  #include "migration/vmstate.h"
>> >> >> @@ -3819,6 +3821,38 @@ VTDAddressSpace
>> >> >*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>> >> >>      return vtd_dev_as;
>> >> >>  }
>> >> >>
>> >> >> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
>> >> >> +                                 IOMMULegacyDevice *ldev,
>> >> >> +                                 Error **errp)
>> >> >> +{
>> >> >> +    return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>> >> >> +                                  IOMMUFDDevice *idev,
>> >> >> +                                  Error **errp)
>> >> >> +{
>> >> >> +    return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static int vtd_check_hdev(IntelIOMMUState *s,
>> >VTDHostIOMMUDevice
>> >> >*vtd_hdev,
>> >> >> +                          Error **errp)
>> >> >> +{
>> >> >> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
>> >> >> +    IOMMUFDDevice *idev;
>> >> >> +
>> >> >> +    if (base_dev->type == HID_LEGACY) {
>> >> >> +        IOMMULegacyDevice *ldev = container_of(base_dev,
>> >> >> +                                               IOMMULegacyDevice, base);
>> >> >> +
>> >> >> +        return vtd_check_legacy_hdev(s, ldev, errp);
>> >> >> +    }
>> >> >> +
>> >> >> +    idev = container_of(base_dev, IOMMUFDDevice, base);
>> >> >> +
>> >> >> +    return vtd_check_iommufd_hdev(s, idev, errp);
>> >> >> +}
>> >> >> +
>> >> >>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>int
>> >> >devfn,
>> >> >>                                      HostIOMMUDevice *base_dev, Error **errp)
>> >> >>  {
>> >> >> @@ -3829,6 +3863,7 @@ static int
>> >vtd_dev_set_iommu_device(PCIBus
>> >> >*bus, void *opaque, int devfn,
>> >> >>          .devfn = devfn,
>> >> >>      };
>> >> >>      struct vtd_as_key *new_key;
>> >> >> +    int ret;
>> >> >>
>> >> >>      assert(base_dev);
>> >> >>
>> >> >> @@ -3848,6 +3883,13 @@ static int
>> >vtd_dev_set_iommu_device(PCIBus
>> >> >*bus, void *opaque, int devfn,
>> >> >>      vtd_hdev->iommu_state = s;
>> >> >>      vtd_hdev->dev = base_dev;
>> >> >>
>> >> >> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
>> >> >> +    if (ret) {
>> >> >> +        g_free(vtd_hdev);
>> >> >> +        vtd_iommu_unlock(s);
>> >> >> +        return ret;
>> >> >> +    }
>> >> >> +
>> >> >>      new_key = g_malloc(sizeof(*new_key));
>> >> >>      new_key->bus = bus;
>> >> >>      new_key->devfn = devfn;
>> >> >
>> >> >
>> >> >Okay. So when VFIO device is created, it will call
>> >vtd_dev_set_iommu_device
>> >> >and that in turn will update caps.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
>> >> >>      s->iq_dw = false;
>> >> >>      s->next_frcd_reg = 0;
>> >> >>
>> >> >> -    vtd_cap_init(s);
>> >> >> +    if (!s->cap_frozen) {
>> >> >> +        vtd_cap_init(s);
>> >> >> +    }
>> >> >>
>> >> >
>> >> >If it's fronzen it's because VFIO was added after machine done.
>> >> >And then what? I think caps are just wrong?
>> >>
>> >> Not quite get your question on caps being wrong. But try to explains:
>> >>
>> >> When a hot plugged vfio device's host iommu cap isn't compatible with
>> >> vIOMMU's, hotplug should fail. Currently there is no check for this and
>> >> allow hotplug to succeed, but then some issue will reveal later,
>> >> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup iova
>> >> mapping beyond host supported iova range, then DMA will fail.
>> >>
>> >> In fact, before this series, cap is not impacted by VFIO, so it's same
>effect of
>> >> frozen after machine done.
>> >>
>> >> >
>> >> >
>> >> >I think the way to approach this is just by specifying this
>> >> >as an option on command line.
>> >>
>> >> Do you mean add a cap_frozen property to intel_iommu?
>> >> Vtd_init() is called in realize() and system reset(), so I utilize realize() to
>init
>> >cap
>> >> and froze cap before system reset(). If cap_frozen is an option, when it's
>> >set to
>> >> false, cap could be updated every system reset and it's not a fix value
>any
>> >more.
>> >> This may break migration.
>> >
>> >No, I mean either
>> >1. add some kind of vfio-iommu device that is not exposed to guest
>> >   but is not hot pluggable
>>
>> Not quite get, what will such vfio-iommu device be used for if not exposed
>to guest.
>
>It will update the IOMMU.
>And do so without need for tricky callbacks.

Sorry to bother you again, just want to get clear understanding to your suggestions.

1. Is vfio-iommu device type inherited from TYPE_VFIO_PCI_NOHOTPLUG?

2. How to avoid tricky set/unset_iommu_device callbacks with vfio-iommu device,
    if we want to pass vfio-iommu device to vIOMMU to update caps?

3. Do you mean we can loop on vfio_device_list to get vfio-iommu device?
    We want to support vdpa device in future, also need to loop vdpa device list;
    also need to bypass vfio device whose host bridge configured IOMMU bypassed.

4. It looks, with vfio-iommu device hotplug is not supported, or I misunderstand?

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-13 11:17           ` Michael S. Tsirkin
  2024-03-14  4:05             ` Duan, Zhenzhong
@ 2024-03-18 13:20             ` Eric Auger
  2024-03-28  7:20               ` Michael S. Tsirkin
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Auger @ 2024-03-18 13:20 UTC (permalink / raw)
  To: Michael S. Tsirkin, Duan, Zhenzhong
  Cc: qemu-devel, alex.williamson, clg, peterx, jasowang, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

Hi Michael,

On 3/13/24 12:17, Michael S. Tsirkin wrote:
> On Wed, Mar 13, 2024 at 07:54:11AM +0000, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Michael S. Tsirkin <mst@redhat.com>
>>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>>> sync host IOMMU cap/ecap
>>>
>>> On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote:
>>>> Hi Michael,
>>>>
>>>>> -----Original Message-----
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>>>>> sync host IOMMU cap/ecap
>>>>>
>>>>> On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
>>>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>>>
>>>>>> Add a framework to check and synchronize host IOMMU cap/ecap with
>>>>>> vIOMMU cap/ecap.
>>>>>>
>>>>>> The sequence will be:
>>>>>>
>>>>>> vtd_cap_init() initializes iommu->cap/ecap.
>>>>>> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
>>>>>> iommu->cap_frozen set when machine create done, iommu->cap/ecap
>>>>> become readonly.
>>>>>> Implementation details for different backends will be in following
>>> patches.
>>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>> ---
>>>>>>  include/hw/i386/intel_iommu.h |  1 +
>>>>>>  hw/i386/intel_iommu.c         | 50
>>>>> ++++++++++++++++++++++++++++++++++-
>>>>>>  2 files changed, 50 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/hw/i386/intel_iommu.h
>>>>> b/include/hw/i386/intel_iommu.h
>>>>>> index bbc7b96add..c71a133820 100644
>>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>>> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
>>>>>>
>>>>>>      uint64_t cap;                   /* The value of capability reg */
>>>>>>      uint64_t ecap;                  /* The value of extended capability reg */
>>>>>> +    bool cap_frozen;                /* cap/ecap become read-only after
>>> frozen */
>>>>>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>>>>>>      GHashTable *iotlb;              /* IOTLB */
>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>> index ffa1ad6429..a9f9dfd6a7 100644
>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>> @@ -35,6 +35,8 @@
>>>>>>  #include "sysemu/kvm.h"
>>>>>>  #include "sysemu/dma.h"
>>>>>>  #include "sysemu/sysemu.h"
>>>>>> +#include "hw/vfio/vfio-common.h"
>>>>>> +#include "sysemu/iommufd.h"
>>>>>>  #include "hw/i386/apic_internal.h"
>>>>>>  #include "kvm/kvm_i386.h"
>>>>>>  #include "migration/vmstate.h"
>>>>>> @@ -3819,6 +3821,38 @@ VTDAddressSpace
>>>>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>>>>>      return vtd_dev_as;
>>>>>>  }
>>>>>>
>>>>>> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
>>>>>> +                                 IOMMULegacyDevice *ldev,
>>>>>> +                                 Error **errp)
>>>>>> +{
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>>>>>> +                                  IOMMUFDDevice *idev,
>>>>>> +                                  Error **errp)
>>>>>> +{
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static int vtd_check_hdev(IntelIOMMUState *s,
>>> VTDHostIOMMUDevice
>>>>> *vtd_hdev,
>>>>>> +                          Error **errp)
>>>>>> +{
>>>>>> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
>>>>>> +    IOMMUFDDevice *idev;
>>>>>> +
>>>>>> +    if (base_dev->type == HID_LEGACY) {
>>>>>> +        IOMMULegacyDevice *ldev = container_of(base_dev,
>>>>>> +                                               IOMMULegacyDevice, base);
>>>>>> +
>>>>>> +        return vtd_check_legacy_hdev(s, ldev, errp);
>>>>>> +    }
>>>>>> +
>>>>>> +    idev = container_of(base_dev, IOMMUFDDevice, base);
>>>>>> +
>>>>>> +    return vtd_check_iommufd_hdev(s, idev, errp);
>>>>>> +}
>>>>>> +
>>>>>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int
>>>>> devfn,
>>>>>>                                      HostIOMMUDevice *base_dev, Error **errp)
>>>>>>  {
>>>>>> @@ -3829,6 +3863,7 @@ static int
>>> vtd_dev_set_iommu_device(PCIBus
>>>>> *bus, void *opaque, int devfn,
>>>>>>          .devfn = devfn,
>>>>>>      };
>>>>>>      struct vtd_as_key *new_key;
>>>>>> +    int ret;
>>>>>>
>>>>>>      assert(base_dev);
>>>>>>
>>>>>> @@ -3848,6 +3883,13 @@ static int
>>> vtd_dev_set_iommu_device(PCIBus
>>>>> *bus, void *opaque, int devfn,
>>>>>>      vtd_hdev->iommu_state = s;
>>>>>>      vtd_hdev->dev = base_dev;
>>>>>>
>>>>>> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
>>>>>> +    if (ret) {
>>>>>> +        g_free(vtd_hdev);
>>>>>> +        vtd_iommu_unlock(s);
>>>>>> +        return ret;
>>>>>> +    }
>>>>>> +
>>>>>>      new_key = g_malloc(sizeof(*new_key));
>>>>>>      new_key->bus = bus;
>>>>>>      new_key->devfn = devfn;
>>>>>
>>>>> Okay. So when VFIO device is created, it will call
>>> vtd_dev_set_iommu_device
>>>>> and that in turn will update caps.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
>>>>>>      s->iq_dw = false;
>>>>>>      s->next_frcd_reg = 0;
>>>>>>
>>>>>> -    vtd_cap_init(s);
>>>>>> +    if (!s->cap_frozen) {
>>>>>> +        vtd_cap_init(s);
>>>>>> +    }
>>>>>>
>>>>> If it's fronzen it's because VFIO was added after machine done.
>>>>> And then what? I think caps are just wrong?
>>>> Not quite get your question on caps being wrong. But try to explains:
>>>>
>>>> When a hot plugged vfio device's host iommu cap isn't compatible with
>>>> vIOMMU's, hotplug should fail. Currently there is no check for this and
>>>> allow hotplug to succeed, but then some issue will reveal later,
>>>> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup iova
>>>> mapping beyond host supported iova range, then DMA will fail.
>>>>
>>>> In fact, before this series, cap is not impacted by VFIO, so it's same effect of
>>>> frozen after machine done.
>>>>
>>>>>
>>>>> I think the way to approach this is just by specifying this
>>>>> as an option on command line.
>>>> Do you mean add a cap_frozen property to intel_iommu?
>>>> Vtd_init() is called in realize() and system reset(), so I utilize realize() to init
>>> cap
>>>> and froze cap before system reset(). If cap_frozen is an option, when it's
>>> set to
>>>> false, cap could be updated every system reset and it's not a fix value any
>>> more.
>>>> This may break migration.
>>> No, I mean either
>>> 1. add some kind of vfio-iommu device that is not exposed to guest
>>>   but is not hot pluggable
>> Not quite get, what will such vfio-iommu device be used for if not exposed to guest.
> It will update the IOMMU.
> And do so without need for tricky callbacks.
>

At the moment the only way VFIO can pass info to vIOMMU is through the
IOMMU MR API (IOMMUMemoryRegionClass).
Unfortunately this API is not fully adapted to new use cases because it
relies on the IOMMU MR to be enabled which causes the VFIO

vfio_listener_region_add() to be called and call the relevant IOMMU MR callback. Before the IOMMU MR enablement there is no way to convey info from VFIO to the vIOMMU. That why, for several years and since the beginning of discussions related to nested IOMMU, Peter Xu, Yi Liu, myself headed towards the usage of PCIIOMMUOps instead.


You will find in 
[RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
https://lore.kernel.org/all/20240117080414.316890-1-eric.auger@redhat.com/
yet another example of such kind of PCIIOMMUOps. In that series instead of relying on a HostIOMMUDevice object as proposed by Zhenzhong it uses a direct callback but it is still the same principle and the HostIOMMUDevice looks better to accomodate both iommufd and VFIO backend.


host reserved memory regions is not something we can easiliy describe at vIOMMU level. The info are fetched from host and propagated to the vIOMMU

I think nested stage setup would also need this PCIIOMMUOps trick to setup stage 1 info. So the introduction of the so called HostIOMMUDevice object has a broader scope than a gaw option or ecap option set I think.

If you don't like this approach, please can you elaborate on the "vfio-iommu device" proposal above so that we can explore it.

Thank you in advance

Eric



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-18 13:20             ` Eric Auger
@ 2024-03-28  7:20               ` Michael S. Tsirkin
  2024-03-29  3:22                 ` Duan, Zhenzhong
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2024-03-28  7:20 UTC (permalink / raw)
  To: Eric Auger
  Cc: Duan, Zhenzhong, qemu-devel, alex.williamson, clg, peterx,
	jasowang, jgg, nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L,
	Sun, Yi Y, Peng, Chao P, Yi Sun, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

On Mon, Mar 18, 2024 at 02:20:50PM +0100, Eric Auger wrote:
> Hi Michael,
> 
> On 3/13/24 12:17, Michael S. Tsirkin wrote:
> > On Wed, Mar 13, 2024 at 07:54:11AM +0000, Duan, Zhenzhong wrote:
> >>
> >>> -----Original Message-----
> >>> From: Michael S. Tsirkin <mst@redhat.com>
> >>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
> >>> sync host IOMMU cap/ecap
> >>>
> >>> On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote:
> >>>> Hi Michael,
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
> >>>>> sync host IOMMU cap/ecap
> >>>>>
> >>>>> On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan wrote:
> >>>>>> From: Yi Liu <yi.l.liu@intel.com>
> >>>>>>
> >>>>>> Add a framework to check and synchronize host IOMMU cap/ecap with
> >>>>>> vIOMMU cap/ecap.
> >>>>>>
> >>>>>> The sequence will be:
> >>>>>>
> >>>>>> vtd_cap_init() initializes iommu->cap/ecap.
> >>>>>> vtd_check_hdev() update iommu->cap/ecap based on host cap/ecap.
> >>>>>> iommu->cap_frozen set when machine create done, iommu->cap/ecap
> >>>>> become readonly.
> >>>>>> Implementation details for different backends will be in following
> >>> patches.
> >>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >>>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> >>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> >>>>>> ---
> >>>>>>  include/hw/i386/intel_iommu.h |  1 +
> >>>>>>  hw/i386/intel_iommu.c         | 50
> >>>>> ++++++++++++++++++++++++++++++++++-
> >>>>>>  2 files changed, 50 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/include/hw/i386/intel_iommu.h
> >>>>> b/include/hw/i386/intel_iommu.h
> >>>>>> index bbc7b96add..c71a133820 100644
> >>>>>> --- a/include/hw/i386/intel_iommu.h
> >>>>>> +++ b/include/hw/i386/intel_iommu.h
> >>>>>> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
> >>>>>>
> >>>>>>      uint64_t cap;                   /* The value of capability reg */
> >>>>>>      uint64_t ecap;                  /* The value of extended capability reg */
> >>>>>> +    bool cap_frozen;                /* cap/ecap become read-only after
> >>> frozen */
> >>>>>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
> >>>>>>      GHashTable *iotlb;              /* IOTLB */
> >>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >>>>>> index ffa1ad6429..a9f9dfd6a7 100644
> >>>>>> --- a/hw/i386/intel_iommu.c
> >>>>>> +++ b/hw/i386/intel_iommu.c
> >>>>>> @@ -35,6 +35,8 @@
> >>>>>>  #include "sysemu/kvm.h"
> >>>>>>  #include "sysemu/dma.h"
> >>>>>>  #include "sysemu/sysemu.h"
> >>>>>> +#include "hw/vfio/vfio-common.h"
> >>>>>> +#include "sysemu/iommufd.h"
> >>>>>>  #include "hw/i386/apic_internal.h"
> >>>>>>  #include "kvm/kvm_i386.h"
> >>>>>>  #include "migration/vmstate.h"
> >>>>>> @@ -3819,6 +3821,38 @@ VTDAddressSpace
> >>>>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
> >>>>>>      return vtd_dev_as;
> >>>>>>  }
> >>>>>>
> >>>>>> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
> >>>>>> +                                 IOMMULegacyDevice *ldev,
> >>>>>> +                                 Error **errp)
> >>>>>> +{
> >>>>>> +    return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
> >>>>>> +                                  IOMMUFDDevice *idev,
> >>>>>> +                                  Error **errp)
> >>>>>> +{
> >>>>>> +    return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static int vtd_check_hdev(IntelIOMMUState *s,
> >>> VTDHostIOMMUDevice
> >>>>> *vtd_hdev,
> >>>>>> +                          Error **errp)
> >>>>>> +{
> >>>>>> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
> >>>>>> +    IOMMUFDDevice *idev;
> >>>>>> +
> >>>>>> +    if (base_dev->type == HID_LEGACY) {
> >>>>>> +        IOMMULegacyDevice *ldev = container_of(base_dev,
> >>>>>> +                                               IOMMULegacyDevice, base);
> >>>>>> +
> >>>>>> +        return vtd_check_legacy_hdev(s, ldev, errp);
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    idev = container_of(base_dev, IOMMUFDDevice, base);
> >>>>>> +
> >>>>>> +    return vtd_check_iommufd_hdev(s, idev, errp);
> >>>>>> +}
> >>>>>> +
> >>>>>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int
> >>>>> devfn,
> >>>>>>                                      HostIOMMUDevice *base_dev, Error **errp)
> >>>>>>  {
> >>>>>> @@ -3829,6 +3863,7 @@ static int
> >>> vtd_dev_set_iommu_device(PCIBus
> >>>>> *bus, void *opaque, int devfn,
> >>>>>>          .devfn = devfn,
> >>>>>>      };
> >>>>>>      struct vtd_as_key *new_key;
> >>>>>> +    int ret;
> >>>>>>
> >>>>>>      assert(base_dev);
> >>>>>>
> >>>>>> @@ -3848,6 +3883,13 @@ static int
> >>> vtd_dev_set_iommu_device(PCIBus
> >>>>> *bus, void *opaque, int devfn,
> >>>>>>      vtd_hdev->iommu_state = s;
> >>>>>>      vtd_hdev->dev = base_dev;
> >>>>>>
> >>>>>> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
> >>>>>> +    if (ret) {
> >>>>>> +        g_free(vtd_hdev);
> >>>>>> +        vtd_iommu_unlock(s);
> >>>>>> +        return ret;
> >>>>>> +    }
> >>>>>> +
> >>>>>>      new_key = g_malloc(sizeof(*new_key));
> >>>>>>      new_key->bus = bus;
> >>>>>>      new_key->devfn = devfn;
> >>>>>
> >>>>> Okay. So when VFIO device is created, it will call
> >>> vtd_dev_set_iommu_device
> >>>>> and that in turn will update caps.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState *s)
> >>>>>>      s->iq_dw = false;
> >>>>>>      s->next_frcd_reg = 0;
> >>>>>>
> >>>>>> -    vtd_cap_init(s);
> >>>>>> +    if (!s->cap_frozen) {
> >>>>>> +        vtd_cap_init(s);
> >>>>>> +    }
> >>>>>>
> >>>>> If it's fronzen it's because VFIO was added after machine done.
> >>>>> And then what? I think caps are just wrong?
> >>>> Not quite get your question on caps being wrong. But try to explains:
> >>>>
> >>>> When a hot plugged vfio device's host iommu cap isn't compatible with
> >>>> vIOMMU's, hotplug should fail. Currently there is no check for this and
> >>>> allow hotplug to succeed, but then some issue will reveal later,
> >>>> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup iova
> >>>> mapping beyond host supported iova range, then DMA will fail.
> >>>>
> >>>> In fact, before this series, cap is not impacted by VFIO, so it's same effect of
> >>>> frozen after machine done.
> >>>>
> >>>>>
> >>>>> I think the way to approach this is just by specifying this
> >>>>> as an option on command line.
> >>>> Do you mean add a cap_frozen property to intel_iommu?
> >>>> Vtd_init() is called in realize() and system reset(), so I utilize realize() to init
> >>> cap
> >>>> and froze cap before system reset(). If cap_frozen is an option, when it's
> >>> set to
> >>>> false, cap could be updated every system reset and it's not a fix value any
> >>> more.
> >>>> This may break migration.
> >>> No, I mean either
> >>> 1. add some kind of vfio-iommu device that is not exposed to guest
> >>>   but is not hot pluggable
> >> Not quite get, what will such vfio-iommu device be used for if not exposed to guest.
> > It will update the IOMMU.
> > And do so without need for tricky callbacks.
> >
> 
> At the moment the only way VFIO can pass info to vIOMMU is through the
> IOMMU MR API (IOMMUMemoryRegionClass).
> Unfortunately this API is not fully adapted to new use cases because it
> relies on the IOMMU MR to be enabled which causes the VFIO
> 
> vfio_listener_region_add() to be called and call the relevant IOMMU MR callback. Before the IOMMU MR enablement there is no way to convey info from VFIO to the vIOMMU. That why, for several years and since the beginning of discussions related to nested IOMMU, Peter Xu, Yi Liu, myself headed towards the usage of PCIIOMMUOps instead.
> 
> 
> You will find in 
> [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
> https://lore.kernel.org/all/20240117080414.316890-1-eric.auger@redhat.com/
> yet another example of such kind of PCIIOMMUOps. In that series instead of relying on a HostIOMMUDevice object as proposed by Zhenzhong it uses a direct callback but it is still the same principle and the HostIOMMUDevice looks better to accomodate both iommufd and VFIO backend.
> 
> 
> host reserved memory regions is not something we can easiliy describe at vIOMMU level. The info are fetched from host and propagated to the vIOMMU
> 
> I think nested stage setup would also need this PCIIOMMUOps trick to setup stage 1 info. So the introduction of the so called HostIOMMUDevice object has a broader scope than a gaw option or ecap option set I think.
> 
> If you don't like this approach, please can you elaborate on the "vfio-iommu device" proposal above so that we can explore it.
> 
> Thank you in advance
> 
> Eric


Hi Eric,
Sorry about the delay in answering - was on vacation.

My concern is with user interface not with the internal API used.
The concern is simple - assymetry and complex rules in handling devices.
E.g. a non hotplugged vfio device adjusts the vIOMMU, then you can
remove it and hotplug another one and it works, but if you start
without vfio then hotplug then it might fail because it's too late
to adjust the vIOMMU.

And what I am saying, is that we either want something on the qemu command line
that will tell the vIOMMU "please use info from the host" or
have management take info from the host and supply that to the vIOMMU.
The former seems more user friendly, for sure. The disadvantage of it
is that one can't e.g. take the least common denominator between two
hosts and make things migrateable in this way.
If we limit our ambition to VFIO that might be acceptable, after all
VFIO already assumes very similar hardware on src and dst of migration.
The later is what we did for aw-bits.

I see lots of acceptable options, but yes it seems nice to have
something single that will work for all IOMMUs and also maybe live under
VFIO? Thus I came up with the idea of a stub device living under vfio
who's job is just to be present on command line and adjust the guest
machine (ATM the viommu but we might see other uses too) to match host.


-- 
MST



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap
  2024-03-28  7:20               ` Michael S. Tsirkin
@ 2024-03-29  3:22                 ` Duan, Zhenzhong
  0 siblings, 0 replies; 22+ messages in thread
From: Duan, Zhenzhong @ 2024-03-29  3:22 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eric Auger
  Cc: qemu-devel, alex.williamson, clg, peterx, jasowang, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

Hi Michael,

>-----Original Message-----
>From: Michael S. Tsirkin <mst@redhat.com>
>Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and
>sync host IOMMU cap/ecap
>
>On Mon, Mar 18, 2024 at 02:20:50PM +0100, Eric Auger wrote:
>> Hi Michael,
>>
>> On 3/13/24 12:17, Michael S. Tsirkin wrote:
>> > On Wed, Mar 13, 2024 at 07:54:11AM +0000, Duan, Zhenzhong wrote:
>> >>
>> >>> -----Original Message-----
>> >>> From: Michael S. Tsirkin <mst@redhat.com>
>> >>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check
>and
>> >>> sync host IOMMU cap/ecap
>> >>>
>> >>> On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote:
>> >>>> Hi Michael,
>> >>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Michael S. Tsirkin <mst@redhat.com>
>> >>>>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to
>check and
>> >>>>> sync host IOMMU cap/ecap
>> >>>>>
>> >>>>> On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan
>wrote:
>> >>>>>> From: Yi Liu <yi.l.liu@intel.com>
>> >>>>>>
>> >>>>>> Add a framework to check and synchronize host IOMMU cap/ecap
>with
>> >>>>>> vIOMMU cap/ecap.
>> >>>>>>
>> >>>>>> The sequence will be:
>> >>>>>>
>> >>>>>> vtd_cap_init() initializes iommu->cap/ecap.
>> >>>>>> vtd_check_hdev() update iommu->cap/ecap based on host
>cap/ecap.
>> >>>>>> iommu->cap_frozen set when machine create done, iommu-
>>cap/ecap
>> >>>>> become readonly.
>> >>>>>> Implementation details for different backends will be in following
>> >>> patches.
>> >>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> >>>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> >>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> >>>>>> ---
>> >>>>>>  include/hw/i386/intel_iommu.h |  1 +
>> >>>>>>  hw/i386/intel_iommu.c         | 50
>> >>>>> ++++++++++++++++++++++++++++++++++-
>> >>>>>>  2 files changed, 50 insertions(+), 1 deletion(-)
>> >>>>>>
>> >>>>>> diff --git a/include/hw/i386/intel_iommu.h
>> >>>>> b/include/hw/i386/intel_iommu.h
>> >>>>>> index bbc7b96add..c71a133820 100644
>> >>>>>> --- a/include/hw/i386/intel_iommu.h
>> >>>>>> +++ b/include/hw/i386/intel_iommu.h
>> >>>>>> @@ -283,6 +283,7 @@ struct IntelIOMMUState {
>> >>>>>>
>> >>>>>>      uint64_t cap;                   /* The value of capability reg */
>> >>>>>>      uint64_t ecap;                  /* The value of extended capability reg
>*/
>> >>>>>> +    bool cap_frozen;                /* cap/ecap become read-only after
>> >>> frozen */
>> >>>>>>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>> >>>>>>      GHashTable *iotlb;              /* IOTLB */
>> >>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> >>>>>> index ffa1ad6429..a9f9dfd6a7 100644
>> >>>>>> --- a/hw/i386/intel_iommu.c
>> >>>>>> +++ b/hw/i386/intel_iommu.c
>> >>>>>> @@ -35,6 +35,8 @@
>> >>>>>>  #include "sysemu/kvm.h"
>> >>>>>>  #include "sysemu/dma.h"
>> >>>>>>  #include "sysemu/sysemu.h"
>> >>>>>> +#include "hw/vfio/vfio-common.h"
>> >>>>>> +#include "sysemu/iommufd.h"
>> >>>>>>  #include "hw/i386/apic_internal.h"
>> >>>>>>  #include "kvm/kvm_i386.h"
>> >>>>>>  #include "migration/vmstate.h"
>> >>>>>> @@ -3819,6 +3821,38 @@ VTDAddressSpace
>> >>>>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>> >>>>>>      return vtd_dev_as;
>> >>>>>>  }
>> >>>>>>
>> >>>>>> +static int vtd_check_legacy_hdev(IntelIOMMUState *s,
>> >>>>>> +                                 IOMMULegacyDevice *ldev,
>> >>>>>> +                                 Error **errp)
>> >>>>>> +{
>> >>>>>> +    return 0;
>> >>>>>> +}
>> >>>>>> +
>> >>>>>> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s,
>> >>>>>> +                                  IOMMUFDDevice *idev,
>> >>>>>> +                                  Error **errp)
>> >>>>>> +{
>> >>>>>> +    return 0;
>> >>>>>> +}
>> >>>>>> +
>> >>>>>> +static int vtd_check_hdev(IntelIOMMUState *s,
>> >>> VTDHostIOMMUDevice
>> >>>>> *vtd_hdev,
>> >>>>>> +                          Error **errp)
>> >>>>>> +{
>> >>>>>> +    HostIOMMUDevice *base_dev = vtd_hdev->dev;
>> >>>>>> +    IOMMUFDDevice *idev;
>> >>>>>> +
>> >>>>>> +    if (base_dev->type == HID_LEGACY) {
>> >>>>>> +        IOMMULegacyDevice *ldev = container_of(base_dev,
>> >>>>>> +                                               IOMMULegacyDevice, base);
>> >>>>>> +
>> >>>>>> +        return vtd_check_legacy_hdev(s, ldev, errp);
>> >>>>>> +    }
>> >>>>>> +
>> >>>>>> +    idev = container_of(base_dev, IOMMUFDDevice, base);
>> >>>>>> +
>> >>>>>> +    return vtd_check_iommufd_hdev(s, idev, errp);
>> >>>>>> +}
>> >>>>>> +
>> >>>>>>  static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>int
>> >>>>> devfn,
>> >>>>>>                                      HostIOMMUDevice *base_dev, Error **errp)
>> >>>>>>  {
>> >>>>>> @@ -3829,6 +3863,7 @@ static int
>> >>> vtd_dev_set_iommu_device(PCIBus
>> >>>>> *bus, void *opaque, int devfn,
>> >>>>>>          .devfn = devfn,
>> >>>>>>      };
>> >>>>>>      struct vtd_as_key *new_key;
>> >>>>>> +    int ret;
>> >>>>>>
>> >>>>>>      assert(base_dev);
>> >>>>>>
>> >>>>>> @@ -3848,6 +3883,13 @@ static int
>> >>> vtd_dev_set_iommu_device(PCIBus
>> >>>>> *bus, void *opaque, int devfn,
>> >>>>>>      vtd_hdev->iommu_state = s;
>> >>>>>>      vtd_hdev->dev = base_dev;
>> >>>>>>
>> >>>>>> +    ret = vtd_check_hdev(s, vtd_hdev, errp);
>> >>>>>> +    if (ret) {
>> >>>>>> +        g_free(vtd_hdev);
>> >>>>>> +        vtd_iommu_unlock(s);
>> >>>>>> +        return ret;
>> >>>>>> +    }
>> >>>>>> +
>> >>>>>>      new_key = g_malloc(sizeof(*new_key));
>> >>>>>>      new_key->bus = bus;
>> >>>>>>      new_key->devfn = devfn;
>> >>>>>
>> >>>>> Okay. So when VFIO device is created, it will call
>> >>> vtd_dev_set_iommu_device
>> >>>>> and that in turn will update caps.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState
>*s)
>> >>>>>>      s->iq_dw = false;
>> >>>>>>      s->next_frcd_reg = 0;
>> >>>>>>
>> >>>>>> -    vtd_cap_init(s);
>> >>>>>> +    if (!s->cap_frozen) {
>> >>>>>> +        vtd_cap_init(s);
>> >>>>>> +    }
>> >>>>>>
>> >>>>> If it's fronzen it's because VFIO was added after machine done.
>> >>>>> And then what? I think caps are just wrong?
>> >>>> Not quite get your question on caps being wrong. But try to explains:
>> >>>>
>> >>>> When a hot plugged vfio device's host iommu cap isn't compatible
>with
>> >>>> vIOMMU's, hotplug should fail. Currently there is no check for this
>and
>> >>>> allow hotplug to succeed, but then some issue will reveal later,
>> >>>> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup
>iova
>> >>>> mapping beyond host supported iova range, then DMA will fail.
>> >>>>
>> >>>> In fact, before this series, cap is not impacted by VFIO, so it's same
>effect of
>> >>>> frozen after machine done.
>> >>>>
>> >>>>>
>> >>>>> I think the way to approach this is just by specifying this
>> >>>>> as an option on command line.
>> >>>> Do you mean add a cap_frozen property to intel_iommu?
>> >>>> Vtd_init() is called in realize() and system reset(), so I utilize realize()
>to init
>> >>> cap
>> >>>> and froze cap before system reset(). If cap_frozen is an option, when
>it's
>> >>> set to
>> >>>> false, cap could be updated every system reset and it's not a fix value
>any
>> >>> more.
>> >>>> This may break migration.
>> >>> No, I mean either
>> >>> 1. add some kind of vfio-iommu device that is not exposed to guest
>> >>>   but is not hot pluggable
>> >> Not quite get, what will such vfio-iommu device be used for if not
>exposed to guest.
>> > It will update the IOMMU.
>> > And do so without need for tricky callbacks.
>> >
>>
>> At the moment the only way VFIO can pass info to vIOMMU is through the
>> IOMMU MR API (IOMMUMemoryRegionClass).
>> Unfortunately this API is not fully adapted to new use cases because it
>> relies on the IOMMU MR to be enabled which causes the VFIO
>>
>> vfio_listener_region_add() to be called and call the relevant IOMMU MR
>callback. Before the IOMMU MR enablement there is no way to convey info
>from VFIO to the vIOMMU. That why, for several years and since the
>beginning of discussions related to nested IOMMU, Peter Xu, Yi Liu, myself
>headed towards the usage of PCIIOMMUOps instead.
>>
>>
>> You will find in
>> [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for
>hotplugged devices
>> https://lore.kernel.org/all/20240117080414.316890-1-
>eric.auger@redhat.com/
>> yet another example of such kind of PCIIOMMUOps. In that series instead
>of relying on a HostIOMMUDevice object as proposed by Zhenzhong it uses a
>direct callback but it is still the same principle and the HostIOMMUDevice
>looks better to accomodate both iommufd and VFIO backend.
>>
>>
>> host reserved memory regions is not something we can easiliy describe at
>vIOMMU level. The info are fetched from host and propagated to the
>vIOMMU
>>
>> I think nested stage setup would also need this PCIIOMMUOps trick to
>setup stage 1 info. So the introduction of the so called HostIOMMUDevice
>object has a broader scope than a gaw option or ecap option set I think.
>>
>> If you don't like this approach, please can you elaborate on the "vfio-
>iommu device" proposal above so that we can explore it.
>>
>> Thank you in advance
>>
>> Eric
>
>
>Hi Eric,
>Sorry about the delay in answering - was on vacation.
>
>My concern is with user interface not with the internal API used.
>The concern is simple - assymetry and complex rules in handling devices.
>E.g. a non hotplugged vfio device adjusts the vIOMMU, then you can
>remove it and hotplug another one and it works, but if you start
>without vfio then hotplug then it might fail because it's too late
>to adjust the vIOMMU.

Just clarify the adjustment from vfio is only zeroing the cap/ecap's 1 bits,
never set an original 0 bits. So if hotplug fails, then coldplug should
also fail for the same device.

>
>And what I am saying, is that we either want something on the qemu
>command line
>that will tell the vIOMMU "please use info from the host" or
>have management take info from the host and supply that to the vIOMMU.
>The former seems more user friendly, for sure. The disadvantage of it
>is that one can't e.g. take the least common denominator between two
>hosts and make things migrateable in this way.
>If we limit our ambition to VFIO that might be acceptable, after all
>VFIO already assumes very similar hardware on src and dst of migration.
>The later is what we did for aw-bits.

We can get host IOMMU's hw cap/ecap by reading below file:

/sys/devices/virtual/iommu/dmar[*]/intel-iommu/[cap|ecap]

But it's not accurate as kernel command line can limit the support from
software aspect, e.g., intel_iommu=off/sm_off.

So I'd like to follow up the 2nd way, add two options which defaults 0,

+    DEFINE_PROP_UINT64("host_cap", IntelIOMMUState, host_cap, 0),
+    DEFINE_PROP_UINT64("host_ecap", IntelIOMMUState, host_ecap, 0),

When any is set by management, it overrides vIOMMU default value,
Or else use default.

With cap/ecap set by management, we still need to have the cap/ecap
check logic between host and vIOMMU, but no update, no frozen.

The suggestion for management is to set a value based on
/sys/devices/virtual/iommu/dmar[*]/intel-iommu/[cap|ecap]
But do not assign a random value.

Let me know if I misunderstand anything😊

>
>I see lots of acceptable options, but yes it seems nice to have
>something single that will work for all IOMMUs and also maybe live under
>VFIO? Thus I came up with the idea of a stub device living under vfio
>who's job is just to be present on command line and adjust the guest
>machine (ATM the viommu but we might see other uses too) to match host.

IIUC, kernel doesn’t provide any ioctl() for a stub device to get
the least common denominator of cap/ecap of all the host IOMMU units.
So I'd like to try your suggestion as above to add options.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-03-29  3:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-28  9:44 [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Zhenzhong Duan
2024-02-28  9:44 ` [PATCH v1 1/6] intel_iommu: Add set/unset_iommu_device callback Zhenzhong Duan
2024-02-28  9:44 ` [PATCH v1 2/6] intel_iommu: Extract out vtd_cap_init to initialize cap/ecap Zhenzhong Duan
2024-02-28  9:44 ` [PATCH v1 3/6] intel_iommu: Add a framework to check and sync host IOMMU cap/ecap Zhenzhong Duan
2024-03-12 17:03   ` Michael S. Tsirkin
2024-03-13  2:52     ` Duan, Zhenzhong
2024-03-13  7:07       ` Michael S. Tsirkin
2024-03-13  7:54         ` Duan, Zhenzhong
2024-03-13 11:17           ` Michael S. Tsirkin
2024-03-14  4:05             ` Duan, Zhenzhong
2024-03-18 13:20             ` Eric Auger
2024-03-28  7:20               ` Michael S. Tsirkin
2024-03-29  3:22                 ` Duan, Zhenzhong
2024-02-28  9:44 ` [PATCH v1 4/6] intel_iommu: Implement check and sync mechanism in iommufd mode Zhenzhong Duan
2024-02-28  9:44 ` [PATCH v1 5/6] intel_iommu: Use mgaw instead of s->aw_bits Zhenzhong Duan
2024-02-28  9:44 ` [PATCH v1 6/6] intel_iommu: Block migration if cap is updated Zhenzhong Duan
2024-03-04  6:35   ` Prasad Pandit
2024-03-04  8:11     ` Duan, Zhenzhong
2024-03-04  9:43       ` Prasad Pandit
2024-03-04 10:10         ` Duan, Zhenzhong
2024-03-04  4:17 ` [PATCH v1 0/6] Check and sync host IOMMU cap/ecap with vIOMMU Jason Wang
2024-03-04  6:13   ` Duan, Zhenzhong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.