All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/4] PASID support for Intel IOMMU
@ 2022-03-21  5:54 Jason Wang
  2022-03-21  5:54 ` [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry Jason Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Jason Wang @ 2022-03-21  5:54 UTC (permalink / raw)
  To: mst, peterx; +Cc: Jason Wang, yi.l.liu, yi.y.sun, qemu-devel

Hi All:

This series tries to introduce PASID support for Intel IOMMU. The work
is based on the previous scalabe mode support by implement the
ECAP_PASID. A new "x-pasid-mode" is introduced to enable this
mode. All internal vIOMMU codes were extended to support PASID instead
of the current RID2PASID method. The code is also capable of
provisiong address space with PASID. Note that no devices can issue
PASID DMA right now, this needs future work.

This will be used for prototying PASID based device like virtio or
future vPASID support for Intel IOMMU.

Test has been done with the Linux guest with scalalbe mode enabled and
disabled. A virtio prototype[1][2] that can issue PAISD based DMA
request were also tested, different PASID were used in TX and RX in
those testing drivers.

This series depends on the fixes[3][4] of intel-iommu:

Changes since V1:

- speed up IOMMU translation when RID2PASID is not used
- remove the unnecessary L1 PASID invalidation descriptor support
- adding support for cacthing the translation to interrupt range when
  in the case of PT and scalable mode
- refine the comments to explain the hash algorithm used in IOTLB
  lookups

Please review.

[1] https://github.com/jasowang/qemu.git virtio-pasid
[2] https://github.com/jasowang/linux.git virtio-pasid
[3] https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg02173.html
[4] https://lists.gnu.org/archive/html/qemu-devel/2022-03/msg04441.html

Jason Wang (4):
  intel-iommu: don't warn guest errors when getting rid2pasid entry
  intel-iommu: drop VTDBus
  intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function
  intel-iommu: PASID support

 hw/i386/intel_iommu.c          | 641 ++++++++++++++++++++++-----------
 hw/i386/intel_iommu_internal.h |  14 +-
 hw/i386/trace-events           |   2 +
 include/hw/i386/intel_iommu.h  |  18 +-
 include/hw/pci/pci_bus.h       |   2 +
 5 files changed, 450 insertions(+), 227 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-21  5:54 [PATCH V2 0/4] PASID support for Intel IOMMU Jason Wang
@ 2022-03-21  5:54 ` Jason Wang
  2022-03-24  8:21   ` Tian, Kevin
  2022-03-21  5:54 ` [PATCH V2 2/4] intel-iommu: drop VTDBus Jason Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-21  5:54 UTC (permalink / raw)
  To: mst, peterx; +Cc: Jason Wang, yi.l.liu, yi.y.sun, qemu-devel

We use to warn on wrong rid2pasid entry. But this error could be
triggered by the guest and could happens during initialization. So
let's don't warn in this case.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/i386/intel_iommu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 874d01c162..90964b201c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1554,8 +1554,10 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce)
     if (s->root_scalable) {
         ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
         if (ret) {
-            error_report_once("%s: vtd_ce_get_rid2pasid_entry error: %"PRId32,
-                              __func__, ret);
+            /*
+             * This error is guest triggerable. We should assumt PT
+             * not enabled for safety.
+             */
             return false;
         }
         return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V2 2/4] intel-iommu: drop VTDBus
  2022-03-21  5:54 [PATCH V2 0/4] PASID support for Intel IOMMU Jason Wang
  2022-03-21  5:54 ` [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry Jason Wang
@ 2022-03-21  5:54 ` Jason Wang
  2022-04-22  1:17   ` Peter Xu
  2022-03-21  5:54 ` [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function Jason Wang
  2022-03-21  5:54 ` [PATCH V2 4/4] intel-iommu: PASID support Jason Wang
  3 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-21  5:54 UTC (permalink / raw)
  To: mst, peterx; +Cc: Jason Wang, yi.l.liu, yi.y.sun, qemu-devel

We introduce VTDBus structure as an intermediate step for searching
the address space. This works well with SID based matching/lookup. But
when we want to support SID plus PASID based address space lookup,
this intermediate steps turns out to be a burden. So the patch simply
drops the VTDBus structure and use the PCIBus and devfn as the key for
the g_hash_table(). This simplifies the codes and the future PASID
extension.

To prevent being slower for past vtd_find_as_from_bus_num() callers, a
vtd_as cache indexed by the bus number is introduced to store the last
recent search result of a vtd_as belongs to a specific bus.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/i386/intel_iommu.c         | 238 +++++++++++++++++-----------------
 include/hw/i386/intel_iommu.h |  11 +-
 2 files changed, 123 insertions(+), 126 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 90964b201c..5851a17d0e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -61,6 +61,16 @@
     }                                                                         \
 }
 
+/*
+ * PCI bus number (or SID) is not reliable since the device is usaully
+ * initalized before guest can configure the PCI bridge
+ * (SECONDARY_BUS_NUMBER).
+ */
+struct vtd_as_key {
+    PCIBus *bus;
+    uint8_t devfn;
+};
+
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
@@ -210,6 +220,31 @@ static guint vtd_uint64_hash(gconstpointer v)
     return (guint)*(const uint64_t *)v;
 }
 
+static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct vtd_as_key *key1 = v1;
+    const struct vtd_as_key *key2 = v2;
+
+    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
+}
+
+static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
+{
+    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
+}
+
+/*
+ * Note that we use pointer to PCIBus as the key, so hashing/shifting
+ * based on the pointer value is intended.
+ */
+static guint vtd_as_hash(gconstpointer v)
+{
+    const struct vtd_as_key *key = v;
+    guint value = (guint)(uintptr_t)key->bus;
+
+    return (guint)(value << 8 | key->devfn);
+}
+
 static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
                                           gpointer user_data)
 {
@@ -248,22 +283,14 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
 static void vtd_reset_context_cache_locked(IntelIOMMUState *s)
 {
     VTDAddressSpace *vtd_as;
-    VTDBus *vtd_bus;
-    GHashTableIter bus_it;
-    uint32_t devfn_it;
+    GHashTableIter as_it;
 
     trace_vtd_context_cache_reset();
 
-    g_hash_table_iter_init(&bus_it, s->vtd_as_by_busptr);
+    g_hash_table_iter_init(&as_it, s->vtd_as);
 
-    while (g_hash_table_iter_next (&bus_it, NULL, (void**)&vtd_bus)) {
-        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
-            vtd_as = vtd_bus->dev_as[devfn_it];
-            if (!vtd_as) {
-                continue;
-            }
-            vtd_as->context_cache_entry.context_cache_gen = 0;
-        }
+    while (g_hash_table_iter_next (&as_it, NULL, (void**)&vtd_as)) {
+        vtd_as->context_cache_entry.context_cache_gen = 0;
     }
     s->context_cache_gen = 1;
 }
@@ -993,32 +1020,6 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     return slpte & rsvd_mask;
 }
 
-/* Find the VTD address space associated with a given bus number */
-static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
-{
-    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
-    GHashTableIter iter;
-
-    if (vtd_bus) {
-        return vtd_bus;
-    }
-
-    /*
-     * Iterate over the registered buses to find the one which
-     * currently holds this bus number and update the bus_num
-     * lookup table.
-     */
-    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
-    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
-        if (pci_bus_num(vtd_bus->bus) == bus_num) {
-            s->vtd_as_by_bus_num[bus_num] = vtd_bus;
-            return vtd_bus;
-        }
-    }
-
-    return NULL;
-}
-
 /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
@@ -1634,24 +1635,13 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
 
 static void vtd_switch_address_space_all(IntelIOMMUState *s)
 {
+    VTDAddressSpace *vtd_as;
     GHashTableIter iter;
-    VTDBus *vtd_bus;
-    int i;
-
-    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
-    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
-        for (i = 0; i < PCI_DEVFN_MAX; i++) {
-            if (!vtd_bus->dev_as[i]) {
-                continue;
-            }
-            vtd_switch_address_space(vtd_bus->dev_as[i]);
-        }
-    }
-}
 
-static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
-{
-    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
+    g_hash_table_iter_init(&iter, s->vtd_as);
+    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_as)) {
+        vtd_switch_address_space(vtd_as);
+    }
 }
 
 static const bool vtd_qualified_faults[] = {
@@ -1688,18 +1678,39 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
     return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
 }
 
+static gboolean vtd_find_as_by_sid(gpointer key, gpointer value,
+                                   gpointer user_data)
+{
+    struct vtd_as_key *as_key = (struct vtd_as_key *)key;
+    uint16_t target_sid = *(uint16_t *)user_data;
+    uint16_t sid = vtd_make_source_id(pci_bus_num(as_key->bus),
+                                      as_key->devfn);
+    return sid == target_sid;
+}
+
+static VTDAddressSpace *vtd_get_as_by_sid(IntelIOMMUState *s, uint16_t sid)
+{
+    uint8_t bus_num = sid >> 8;
+    VTDAddressSpace *vtd_as = s->vtd_as_cache[bus_num];
+
+    if (vtd_as &&
+        (sid == vtd_make_source_id(pci_bus_num(vtd_as->bus),
+                                   vtd_as->devfn))) {
+        return vtd_as;
+    }
+
+    vtd_as = g_hash_table_find(s->vtd_as, vtd_find_as_by_sid, &sid);
+    s->vtd_as_cache[bus_num] = vtd_as;
+
+    return vtd_as;
+}
+
 static void vtd_pt_enable_fast_path(IntelIOMMUState *s, uint16_t source_id)
 {
-    VTDBus *vtd_bus;
     VTDAddressSpace *vtd_as;
     bool success = false;
 
-    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
-    if (!vtd_bus) {
-        goto out;
-    }
-
-    vtd_as = vtd_bus->dev_as[VTD_SID_TO_DEVFN(source_id)];
+    vtd_as = vtd_get_as_by_sid(s, source_id);
     if (!vtd_as) {
         goto out;
     }
@@ -1907,11 +1918,10 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                                           uint16_t source_id,
                                           uint16_t func_mask)
 {
+    GHashTableIter as_it;
     uint16_t mask;
-    VTDBus *vtd_bus;
     VTDAddressSpace *vtd_as;
     uint8_t bus_n, devfn;
-    uint16_t devfn_it;
 
     trace_vtd_inv_desc_cc_devices(source_id, func_mask);
 
@@ -1934,32 +1944,31 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
     mask = ~mask;
 
     bus_n = VTD_SID_TO_BUS(source_id);
-    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
-    if (vtd_bus) {
-        devfn = VTD_SID_TO_DEVFN(source_id);
-        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
-            vtd_as = vtd_bus->dev_as[devfn_it];
-            if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
-                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
-                                             VTD_PCI_FUNC(devfn_it));
-                vtd_iommu_lock(s);
-                vtd_as->context_cache_entry.context_cache_gen = 0;
-                vtd_iommu_unlock(s);
-                /*
-                 * Do switch address space when needed, in case if the
-                 * device passthrough bit is switched.
-                 */
-                vtd_switch_address_space(vtd_as);
-                /*
-                 * So a device is moving out of (or moving into) a
-                 * domain, resync the shadow page table.
-                 * This won't bring bad even if we have no such
-                 * notifier registered - the IOMMU notification
-                 * framework will skip MAP notifications if that
-                 * happened.
-                 */
-                vtd_sync_shadow_page_table(vtd_as);
-            }
+    devfn = VTD_SID_TO_DEVFN(source_id);
+
+    g_hash_table_iter_init(&as_it, s->vtd_as);
+    while (g_hash_table_iter_next(&as_it, NULL, (void**)&vtd_as)) {
+        if ((pci_bus_num(vtd_as->bus) == bus_n) &&
+            (vtd_as->devfn & mask) == (devfn & mask)) {
+            trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(vtd_as->devfn),
+                                         VTD_PCI_FUNC(vtd_as->devfn));
+            vtd_iommu_lock(s);
+            vtd_as->context_cache_entry.context_cache_gen = 0;
+            vtd_iommu_unlock(s);
+            /*
+             * Do switch address space when needed, in case if the
+             * device passthrough bit is switched.
+             */
+            vtd_switch_address_space(vtd_as);
+            /*
+             * So a device is moving out of (or moving into) a
+             * domain, resync the shadow page table.
+             * This won't bring bad even if we have no such
+             * notifier registered - the IOMMU notification
+             * framework will skip MAP notifications if that
+             * happened.
+             */
+            vtd_sync_shadow_page_table(vtd_as);
         }
     }
 }
@@ -2473,18 +2482,13 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
 {
     VTDAddressSpace *vtd_dev_as;
     IOMMUTLBEvent event;
-    struct VTDBus *vtd_bus;
     hwaddr addr;
     uint64_t sz;
     uint16_t sid;
-    uint8_t devfn;
     bool size;
-    uint8_t bus_num;
 
     addr = VTD_INV_DESC_DEVICE_IOTLB_ADDR(inv_desc->hi);
     sid = VTD_INV_DESC_DEVICE_IOTLB_SID(inv_desc->lo);
-    devfn = sid & 0xff;
-    bus_num = sid >> 8;
     size = VTD_INV_DESC_DEVICE_IOTLB_SIZE(inv_desc->hi);
 
     if ((inv_desc->lo & VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO) ||
@@ -2495,12 +2499,11 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
         return false;
     }
 
-    vtd_bus = vtd_find_as_from_bus_num(s, bus_num);
-    if (!vtd_bus) {
-        goto done;
-    }
-
-    vtd_dev_as = vtd_bus->dev_as[devfn];
+    /*
+     * Using sid is OK since the guest should have finished the
+     * initialization of both the bus and device.
+     */
+    vtd_dev_as = vtd_get_as_by_sid(s, sid);
     if (!vtd_dev_as) {
         goto done;
     }
@@ -3426,27 +3429,27 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
 
 VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
 {
-    uintptr_t key = (uintptr_t)bus;
-    VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
+    /*
+     * We can't simply use sid here since the bus number might not be
+     * initialized by the guest.
+     */
+    struct vtd_as_key key = {
+        .bus = bus,
+        .devfn = devfn,
+    };
     VTDAddressSpace *vtd_dev_as;
     char name[128];
 
-    if (!vtd_bus) {
-        uintptr_t *new_key = g_malloc(sizeof(*new_key));
-        *new_key = (uintptr_t)bus;
-        /* No corresponding free() */
-        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
-                            PCI_DEVFN_MAX);
-        vtd_bus->bus = bus;
-        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
-    }
+    vtd_dev_as = g_hash_table_lookup(s->vtd_as, &key);
+    if (!vtd_dev_as) {
+        struct vtd_as_key *new_key = g_malloc(sizeof(*new_key));
 
-    vtd_dev_as = vtd_bus->dev_as[devfn];
+        new_key->bus = bus;
+        new_key->devfn = devfn;
 
-    if (!vtd_dev_as) {
         snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
                  PCI_FUNC(devfn));
-        vtd_bus->dev_as[devfn] = vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
+        vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
 
         vtd_dev_as->bus = bus;
         vtd_dev_as->devfn = (uint8_t)devfn;
@@ -3502,6 +3505,8 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
                                             &vtd_dev_as->nodmar, 0);
 
         vtd_switch_address_space(vtd_dev_as);
+
+        g_hash_table_insert(s->vtd_as, new_key, vtd_dev_as);
     }
     return vtd_dev_as;
 }
@@ -3875,7 +3880,6 @@ static void vtd_realize(DeviceState *dev, Error **errp)
 
     QLIST_INIT(&s->vtd_as_with_notifiers);
     qemu_mutex_init(&s->iommu_lock);
-    memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
 
@@ -3897,8 +3901,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     /* No corresponding destroy */
     s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
                                      g_free, g_free);
-    s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
-                                              g_free, g_free);
+    s->vtd_as = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
+                                      g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
     pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3b5ac869db..fa1bed353c 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -58,7 +58,6 @@ typedef struct VTDContextEntry VTDContextEntry;
 typedef struct VTDContextCacheEntry VTDContextCacheEntry;
 typedef struct VTDAddressSpace VTDAddressSpace;
 typedef struct VTDIOTLBEntry VTDIOTLBEntry;
-typedef struct VTDBus VTDBus;
 typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
@@ -111,12 +110,6 @@ struct VTDAddressSpace {
     IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
-struct VTDBus {
-    PCIBus* bus;		/* A reference to the bus to provide translation for */
-    /* A table of VTDAddressSpace objects indexed by devfn */
-    VTDAddressSpace *dev_as[];
-};
-
 struct VTDIOTLBEntry {
     uint64_t gfn;
     uint16_t domain_id;
@@ -253,8 +246,8 @@ struct IntelIOMMUState {
     uint32_t context_cache_gen;     /* Should be in [1,MAX] */
     GHashTable *iotlb;              /* IOTLB */
 
-    GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
-    VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    GHashTable *vtd_as;             /* VTD address space indexed by source id*/
+    VTDAddressSpace *vtd_as_cache[VTD_PCI_BUS_MAX]; /* VTD address space cache */
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function
  2022-03-21  5:54 [PATCH V2 0/4] PASID support for Intel IOMMU Jason Wang
  2022-03-21  5:54 ` [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry Jason Wang
  2022-03-21  5:54 ` [PATCH V2 2/4] intel-iommu: drop VTDBus Jason Wang
@ 2022-03-21  5:54 ` Jason Wang
  2022-03-24  8:26   ` Tian, Kevin
  2022-04-22 13:08   ` Peter Xu
  2022-03-21  5:54 ` [PATCH V2 4/4] intel-iommu: PASID support Jason Wang
  3 siblings, 2 replies; 43+ messages in thread
From: Jason Wang @ 2022-03-21  5:54 UTC (permalink / raw)
  To: mst, peterx; +Cc: Jason Wang, yi.l.liu, yi.y.sun, qemu-devel

We used to have a macro for VTD_PE_GET_FPD_ERR() but it has an
internal goto which prevents it from being reused. This patch convert
that macro to a dedicated function and let the caller to decide what
to do (e.g using goto or not). This makes sure it can be re-used for
other function that requires fault reporting.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5851a17d0e..82787f9850 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -49,17 +49,6 @@
 /* pe operations */
 #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
 #define VTD_PE_GET_LEVEL(pe) (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
-#define VTD_PE_GET_FPD_ERR(ret_fr, is_fpd_set, s, source_id, addr, is_write) {\
-    if (ret_fr) {                                                             \
-        ret_fr = -ret_fr;                                                     \
-        if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {                   \
-            trace_vtd_fault_disabled();                                       \
-        } else {                                                              \
-            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);      \
-        }                                                                     \
-        goto error;                                                           \
-    }                                                                         \
-}
 
 /*
  * PCI bus number (or SID) is not reliable since the device is usaully
@@ -1724,6 +1713,19 @@ out:
     trace_vtd_pt_enable_fast_path(source_id, success);
 }
 
+static void vtd_qualify_report_fault(IntelIOMMUState *s,
+                                     int err, bool is_fpd_set,
+                                     uint16_t source_id,
+                                     hwaddr addr,
+                                     bool is_write)
+{
+    if (is_fpd_set && vtd_is_qualified_fault(err)) {
+        trace_vtd_fault_disabled();
+    } else {
+        vtd_report_dmar_fault(s, source_id, addr, err, is_write);
+    }
+}
+
 /* Map dev to context-entry then do a paging-structures walk to do a iommu
  * translation.
  *
@@ -1784,7 +1786,11 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
         if (!is_fpd_set && s->root_scalable) {
             ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
-            VTD_PE_GET_FPD_ERR(ret_fr, is_fpd_set, s, source_id, addr, is_write);
+            if (ret_fr) {
+                vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
+                                         source_id, addr, is_write);
+                goto error;
+            }
         }
     } else {
         ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
@@ -1792,7 +1798,11 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         if (!ret_fr && !is_fpd_set && s->root_scalable) {
             ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
         }
-        VTD_PE_GET_FPD_ERR(ret_fr, is_fpd_set, s, source_id, addr, is_write);
+        if (ret_fr) {
+            vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
+                                     source_id, addr, is_write);
+            goto error;
+        }
         /* Update context-cache */
         trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
                                   cc_entry->context_cache_gen,
@@ -1828,7 +1838,11 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
                                &reads, &writes, s->aw_bits);
-    VTD_PE_GET_FPD_ERR(ret_fr, is_fpd_set, s, source_id, addr, is_write);
+    if (ret_fr) {
+        vtd_qualify_report_fault(s, -ret_fr, is_fpd_set, source_id,
+                                 addr, is_write);
+        goto error;
+    }
 
     page_mask = vtd_slpt_level_page_mask(level);
     access_flags = IOMMU_ACCESS_FLAG(reads, writes);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-21  5:54 [PATCH V2 0/4] PASID support for Intel IOMMU Jason Wang
                   ` (2 preceding siblings ...)
  2022-03-21  5:54 ` [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function Jason Wang
@ 2022-03-21  5:54 ` Jason Wang
  2022-03-24  8:53   ` Tian, Kevin
                     ` (2 more replies)
  3 siblings, 3 replies; 43+ messages in thread
From: Jason Wang @ 2022-03-21  5:54 UTC (permalink / raw)
  To: mst, peterx; +Cc: Jason Wang, yi.l.liu, yi.y.sun, qemu-devel

This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
existing support for scalable mode, we need to implement the following
missing parts:

1) tag VTDAddressSpace with PASID and support IOMMU/DMA translation
   with PASID
2) tag IOTLB with PASID
3) PASID cache and its flush
4) Fault recording with PASID

For simplicity:

1) PASID cache is not implemented so we can simply implement the PASID
cache flush as a nop.
2) Fault recording with PASID is not supported, NFR is not changed.

All of the above is not mandatory and could be implemented in the
future.

Note that though PASID based IOMMU translation is ready but no device
can issue PASID DMA right now. In this case, PCI_NO_PASID is used as
PASID to identify the address w/ PASID. vtd_find_add_as() has been
extended to provision address space with PASID which could be utilized
by the future extension of PCI core to allow device model to use PASID
based DMA translation.

This feature would be useful for:

1) prototyping PASID support for devices like virtio
2) future vPASID work
3) future PRS and vSVA work

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/i386/intel_iommu.c          | 357 +++++++++++++++++++++++++--------
 hw/i386/intel_iommu_internal.h |  14 +-
 hw/i386/trace-events           |   2 +
 include/hw/i386/intel_iommu.h  |   7 +-
 include/hw/pci/pci_bus.h       |   2 +
 5 files changed, 296 insertions(+), 86 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 82787f9850..13447fda16 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -58,6 +58,14 @@
 struct vtd_as_key {
     PCIBus *bus;
     uint8_t devfn;
+    uint32_t pasid;
+};
+
+struct vtd_iotlb_key {
+    uint16_t sid;
+    uint32_t pasid;
+    uint64_t gfn;
+    uint32_t level;
 };
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
@@ -199,14 +207,24 @@ static inline gboolean vtd_as_has_map_notifier(VTDAddressSpace *as)
 }
 
 /* GHashTable functions */
-static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
+static gboolean vtd_iotlb_equal(gconstpointer v1, gconstpointer v2)
 {
-    return *((const uint64_t *)v1) == *((const uint64_t *)v2);
+    const struct vtd_iotlb_key *key1 = v1;
+    const struct vtd_iotlb_key *key2 = v2;
+
+    return key1->sid == key2->sid &&
+           key1->pasid == key2->pasid &&
+           key1->level == key2->level &&
+           key1->gfn == key2->gfn;
 }
 
-static guint vtd_uint64_hash(gconstpointer v)
+static guint vtd_iotlb_hash(gconstpointer v)
 {
-    return (guint)*(const uint64_t *)v;
+    const struct vtd_iotlb_key *key = v;
+
+    return key->gfn | ((key->sid) << VTD_IOTLB_SID_SHIFT) |
+           (key->level) << VTD_IOTLB_LVL_SHIFT |
+           (key->pasid) << VTD_IOTLB_PASID_SHIFT;
 }
 
 static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
@@ -214,7 +232,8 @@ static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
     const struct vtd_as_key *key1 = v1;
     const struct vtd_as_key *key2 = v2;
 
-    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
+    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn) &&
+           (key1->pasid == key2->pasid);
 }
 
 static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
@@ -306,13 +325,6 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_unlock(s);
 }
 
-static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
-                                  uint32_t level)
-{
-    return gfn | ((uint64_t)(source_id) << VTD_IOTLB_SID_SHIFT) |
-           ((uint64_t)(level) << VTD_IOTLB_LVL_SHIFT);
-}
-
 static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
 {
     return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
@@ -320,15 +332,17 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
 
 /* Must be called with IOMMU lock held */
 static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
-                                       hwaddr addr)
+                                       hwaddr addr, uint32_t pasid)
 {
+    struct vtd_iotlb_key key;
     VTDIOTLBEntry *entry;
-    uint64_t key;
     int level;
 
     for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; level++) {
-        key = vtd_get_iotlb_key(vtd_get_iotlb_gfn(addr, level),
-                                source_id, level);
+        key.gfn = vtd_get_iotlb_gfn(addr, level);
+        key.level = level;
+        key.sid = source_id;
+        key.pasid = pasid;
         entry = g_hash_table_lookup(s->iotlb, &key);
         if (entry) {
             goto out;
@@ -342,10 +356,11 @@ out:
 /* Must be with IOMMU lock held */
 static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
                              uint16_t domain_id, hwaddr addr, uint64_t slpte,
-                             uint8_t access_flags, uint32_t level)
+                             uint8_t access_flags, uint32_t level,
+                             uint32_t pasid)
 {
     VTDIOTLBEntry *entry = g_malloc(sizeof(*entry));
-    uint64_t *key = g_malloc(sizeof(*key));
+    struct vtd_iotlb_key *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
     trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
@@ -359,7 +374,13 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     entry->slpte = slpte;
     entry->access_flags = access_flags;
     entry->mask = vtd_slpt_level_page_mask(level);
-    *key = vtd_get_iotlb_key(gfn, source_id, level);
+    entry->pasid = pasid;
+
+    key->gfn = gfn;
+    key->sid = source_id;
+    key->level = level;
+    key->pasid = pasid;
+
     g_hash_table_replace(s->iotlb, key, entry);
 }
 
@@ -823,13 +844,15 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
 
 static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
                                       VTDContextEntry *ce,
-                                      VTDPASIDEntry *pe)
+                                      VTDPASIDEntry *pe,
+                                      uint32_t pasid)
 {
-    uint32_t pasid;
     dma_addr_t pasid_dir_base;
     int ret = 0;
 
-    pasid = VTD_CE_GET_RID2PASID(ce);
+    if (pasid == PCI_NO_PASID) {
+        pasid = VTD_CE_GET_RID2PASID(ce);
+    }
     pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
     ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
 
@@ -838,15 +861,17 @@ static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
 
 static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
                                 VTDContextEntry *ce,
-                                bool *pe_fpd_set)
+                                bool *pe_fpd_set,
+                                uint32_t pasid)
 {
     int ret;
-    uint32_t pasid;
     dma_addr_t pasid_dir_base;
     VTDPASIDDirEntry pdire;
     VTDPASIDEntry pe;
 
-    pasid = VTD_CE_GET_RID2PASID(ce);
+    if (pasid == PCI_NO_PASID) {
+        pasid = VTD_CE_GET_RID2PASID(ce);
+    }
     pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
 
     /*
@@ -892,12 +917,13 @@ static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
 }
 
 static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
-                                   VTDContextEntry *ce)
+                                   VTDContextEntry *ce,
+                                   uint32_t pasid)
 {
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
         return VTD_PE_GET_LEVEL(&pe);
     }
 
@@ -910,12 +936,13 @@ static inline uint32_t vtd_ce_get_agaw(VTDContextEntry *ce)
 }
 
 static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
-                                  VTDContextEntry *ce)
+                                  VTDContextEntry *ce,
+                                  uint32_t pasid)
 {
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
         return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
     }
 
@@ -957,31 +984,33 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
 }
 
 static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
-                                      VTDContextEntry *ce, uint8_t aw)
+                                      VTDContextEntry *ce, uint8_t aw,
+                                      uint32_t pasid)
 {
-    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce);
+    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce, pasid);
     return 1ULL << MIN(ce_agaw, aw);
 }
 
 /* Return true if IOVA passes range check, otherwise false. */
 static inline bool vtd_iova_range_check(IntelIOMMUState *s,
                                         uint64_t iova, VTDContextEntry *ce,
-                                        uint8_t aw)
+                                        uint8_t aw, uint32_t pasid)
 {
     /*
      * Check if @iova is above 2^X-1, where X is the minimum of MGAW
      * in CAP_REG and AW in context-entry.
      */
-    return !(iova & ~(vtd_iova_limit(s, ce, aw) - 1));
+    return !(iova & ~(vtd_iova_limit(s, ce, aw, pasid) - 1));
 }
 
 static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
-                                          VTDContextEntry *ce)
+                                          VTDContextEntry *ce,
+                                          uint32_t pasid)
 {
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
         return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
     }
 
@@ -1015,16 +1044,17 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
 static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
                              uint64_t iova, bool is_write,
                              uint64_t *slptep, uint32_t *slpte_level,
-                             bool *reads, bool *writes, uint8_t aw_bits)
+                             bool *reads, bool *writes, uint8_t aw_bits,
+                             uint32_t pasid)
 {
-    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
-    uint32_t level = vtd_get_iova_level(s, ce);
+    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
+    uint32_t level = vtd_get_iova_level(s, ce, pasid);
     uint32_t offset;
     uint64_t slpte;
     uint64_t access_right_check;
     uint64_t xlat, size;
 
-    if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
+    if (!vtd_iova_range_check(s, iova, ce, aw_bits, pasid)) {
         error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
                           __func__, iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
@@ -1040,7 +1070,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
         if (slpte == (uint64_t)-1) {
             error_report_once("%s: detected read error on DMAR slpte "
                               "(iova=0x%" PRIx64 ")", __func__, iova);
-            if (level == vtd_get_iova_level(s, ce)) {
+            if (level == vtd_get_iova_level(s, ce, pasid)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
             } else {
@@ -1304,18 +1334,19 @@ next:
  */
 static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
                          uint64_t start, uint64_t end,
-                         vtd_page_walk_info *info)
+                         vtd_page_walk_info *info,
+                         uint32_t pasid)
 {
-    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
-    uint32_t level = vtd_get_iova_level(s, ce);
+    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
+    uint32_t level = vtd_get_iova_level(s, ce, pasid);
 
-    if (!vtd_iova_range_check(s, start, ce, info->aw)) {
+    if (!vtd_iova_range_check(s, start, ce, info->aw, pasid)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
-    if (!vtd_iova_range_check(s, end, ce, info->aw)) {
+    if (!vtd_iova_range_check(s, end, ce, info->aw, pasid)) {
         /* Fix end so that it reaches the maximum */
-        end = vtd_iova_limit(s, ce, info->aw);
+        end = vtd_iova_limit(s, ce, info->aw, pasid);
     }
 
     return vtd_page_walk_level(addr, start, end, level, true, true, info);
@@ -1383,7 +1414,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
      * has valid rid2pasid setting, which includes valid
      * rid2pasid field and corresponding pasid entry setting
      */
-    return vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1466,12 +1497,13 @@ static int vtd_sync_shadow_page_hook(IOMMUTLBEvent *event,
 }
 
 static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
-                                  VTDContextEntry *ce)
+                                  VTDContextEntry *ce,
+                                  uint32_t pasid)
 {
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
         return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     }
 
@@ -1489,10 +1521,10 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
         .notify_unmap = true,
         .aw = s->aw_bits,
         .as = vtd_as,
-        .domain_id = vtd_get_domain_id(s, ce),
+        .domain_id = vtd_get_domain_id(s, ce, vtd_as->pasid),
     };
 
-    return vtd_page_walk(s, ce, addr, addr + size, &info);
+    return vtd_page_walk(s, ce, addr, addr + size, &info, vtd_as->pasid);
 }
 
 static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
@@ -1536,13 +1568,14 @@ static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
  * 1st-level translation or 2nd-level translation, it depends
  * on PGTT setting.
  */
-static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce)
+static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
+                               uint32_t pasid)
 {
     VTDPASIDEntry pe;
     int ret;
 
     if (s->root_scalable) {
-        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
         if (ret) {
             /*
              * This error is guest triggerable. We should assumt PT
@@ -1578,19 +1611,20 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
         return false;
     }
 
-    return vtd_dev_pt_enabled(s, &ce);
+    return vtd_dev_pt_enabled(s, &ce, as->pasid);
 }
 
 /* Return whether the device is using IOMMU translation. */
 static bool vtd_switch_address_space(VTDAddressSpace *as)
 {
-    bool use_iommu;
+    bool use_iommu, pt;
     /* Whether we need to take the BQL on our own */
     bool take_bql = !qemu_mutex_iothread_locked();
 
     assert(as);
 
     use_iommu = as->iommu_state->dmar_enabled && !vtd_as_pt_enabled(as);
+    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
 
     trace_vtd_switch_address_space(pci_bus_num(as->bus),
                                    VTD_PCI_SLOT(as->devfn),
@@ -1610,11 +1644,53 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
     if (use_iommu) {
         memory_region_set_enabled(&as->nodmar, false);
         memory_region_set_enabled(MEMORY_REGION(&as->iommu), true);
+        /*
+         * vt-d spec v3.4 3.14:
+         *
+         * """
+         * Requests-with-PASID with input address in range 0xFEEx_xxxx
+         * are translated normally like any other request-with-PASID
+         * through DMA-remapping hardware.
+         * """
+         *
+         * Need to disable ir for as with PASID.
+         */
+        if (as->pasid != PCI_NO_PASID) {
+            memory_region_set_enabled(&as->iommu_ir, false);
+        } else {
+            memory_region_set_enabled(&as->iommu_ir, true);
+        }
     } else {
         memory_region_set_enabled(MEMORY_REGION(&as->iommu), false);
         memory_region_set_enabled(&as->nodmar, true);
     }
 
+    /*
+     * vtd-spec v3.4 3.14:
+     *
+     * """
+     * Requests-with-PASID with input address in range 0xFEEx_xxxx are
+     * translated normally like any other request-with-PASID through
+     * DMA-remapping hardware. However, if such a request is processed
+     * using pass-through translation, it will be blocked as described
+     * in the paragraph below.
+     *
+     * Software must not program paging-structure entries to remap any
+     * address to the interrupt address range. Untranslated requests
+     * and translation requests that result in an address in the
+     * interrupt range will be blocked with condition code LGN.4 or
+     * SGN.8.
+     * """
+     *
+     * We enable per as memory region (iommu_ir_fault) for catching
+     * the tranlsation for interrupt range through PASID + PT.
+     */
+    if (pt && as->pasid != PCI_NO_PASID) {
+        memory_region_set_enabled(&as->iommu_ir_fault, true);
+    } else {
+        memory_region_set_enabled(&as->iommu_ir_fault, false);
+    }
+
     if (take_bql) {
         qemu_mutex_unlock_iothread();
     }
@@ -1747,13 +1823,14 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     uint8_t bus_num = pci_bus_num(bus);
     VTDContextCacheEntry *cc_entry;
     uint64_t slpte, page_mask;
-    uint32_t level;
+    uint32_t level, pasid = vtd_as->pasid;
     uint16_t source_id = vtd_make_source_id(bus_num, devfn);
     int ret_fr;
     bool is_fpd_set = false;
     bool reads = true;
     bool writes = true;
     uint8_t access_flags;
+    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
     VTDIOTLBEntry *iotlb_entry;
 
     /*
@@ -1766,15 +1843,17 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     cc_entry = &vtd_as->context_cache_entry;
 
-    /* Try to fetch slpte form IOTLB */
-    iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
-    if (iotlb_entry) {
-        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
-                                 iotlb_entry->domain_id);
-        slpte = iotlb_entry->slpte;
-        access_flags = iotlb_entry->access_flags;
-        page_mask = iotlb_entry->mask;
-        goto out;
+    /* Try to fetch slpte form IOTLB, we don't need RID2PASID logic */
+    if (!rid2pasid) {
+        iotlb_entry = vtd_lookup_iotlb(s, source_id, addr, pasid);
+        if (iotlb_entry) {
+            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+                                     iotlb_entry->domain_id);
+            slpte = iotlb_entry->slpte;
+            access_flags = iotlb_entry->access_flags;
+            page_mask = iotlb_entry->mask;
+            goto out;
+        }
     }
 
     /* Try to fetch context-entry from cache first */
@@ -1785,7 +1864,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         ce = cc_entry->context_entry;
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
         if (!is_fpd_set && s->root_scalable) {
-            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
+            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, pasid);
             if (ret_fr) {
                 vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
                                          source_id, addr, is_write);
@@ -1796,7 +1875,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
         if (!ret_fr && !is_fpd_set && s->root_scalable) {
-            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
+            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, pasid);
         }
         if (ret_fr) {
             vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
@@ -1811,11 +1890,15 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
+    if (rid2pasid) {
+        pasid = VTD_CE_GET_RID2PASID(&ce);
+    }
+
     /*
      * We don't need to translate for pass-through context entries.
      * Also, let's ignore IOTLB caching as well for PT devices.
      */
-    if (vtd_dev_pt_enabled(s, &ce)) {
+    if (vtd_dev_pt_enabled(s, &ce, pasid)) {
         entry->iova = addr & VTD_PAGE_MASK_4K;
         entry->translated_addr = entry->iova;
         entry->addr_mask = ~VTD_PAGE_MASK_4K;
@@ -1836,8 +1919,21 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         return true;
     }
 
+    /* Try to fetch slpte form IOTLB for RID2PASID slow path */
+    if (rid2pasid) {
+        iotlb_entry = vtd_lookup_iotlb(s, source_id, addr, pasid);
+        if (iotlb_entry) {
+            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+                                     iotlb_entry->domain_id);
+            slpte = iotlb_entry->slpte;
+            access_flags = iotlb_entry->access_flags;
+            page_mask = iotlb_entry->mask;
+            goto out;
+        }
+    }
+
     ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
-                               &reads, &writes, s->aw_bits);
+                               &reads, &writes, s->aw_bits, pasid);
     if (ret_fr) {
         vtd_qualify_report_fault(s, -ret_fr, is_fpd_set, source_id,
                                  addr, is_write);
@@ -1846,8 +1942,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     page_mask = vtd_slpt_level_page_mask(level);
     access_flags = IOMMU_ACCESS_FLAG(reads, writes);
-    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
-                     access_flags, level);
+    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce, pasid),
+                     addr, slpte, access_flags, level, pasid);
 out:
     vtd_iommu_unlock(s);
     entry->iova = addr & page_mask;
@@ -2039,7 +2135,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
-            domain_id == vtd_get_domain_id(s, &ce)) {
+            domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             vtd_sync_shadow_page_table(vtd_as);
         }
     }
@@ -2047,7 +2143,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 
 static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                            uint16_t domain_id, hwaddr addr,
-                                           uint8_t am)
+                                             uint8_t am, uint32_t pasid)
 {
     VTDAddressSpace *vtd_as;
     VTDContextEntry ce;
@@ -2055,9 +2151,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     hwaddr size = (1 << am) * VTD_PAGE_SIZE;
 
     QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
+        if (pasid != PCI_NO_PASID && pasid != vtd_as->pasid)
+            continue;
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
-        if (!ret && domain_id == vtd_get_domain_id(s, &ce)) {
+        if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             if (vtd_as_has_map_notifier(vtd_as)) {
                 /*
                  * As long as we have MAP notifications registered in
@@ -2101,7 +2199,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
     vtd_iommu_unlock(s);
-    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, PCI_NO_PASID);
 }
 
 /* Flush IOTLB
@@ -3168,6 +3266,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
     DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
     DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
+    DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -3441,7 +3540,63 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
     },
 };
 
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
+                                         hwaddr addr, bool is_write)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_n = pci_bus_num(vtd_as->bus);
+    uint16_t sid = vtd_make_source_id(bus_n, vtd_as->devfn);
+    bool is_fpd_set = false;
+    VTDContextEntry ce;
+
+    assert(vtd_as->pasid != PCI_NO_PASID);
+
+    /* Try out best to fetch FPD, we can't do anything more */
+    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+        if (!is_fpd_set && s->root_scalable) {
+            vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
+        }
+    }
+
+    vtd_qualify_report_fault(s, VTD_FR_SM_INTERRUPT_ADDR,
+                             is_fpd_set, sid, addr, is_write);
+}
+
+static MemTxResult vtd_mem_ir_fault_read(void *opaque, hwaddr addr,
+                                         uint64_t *data, unsigned size,
+                                         MemTxAttrs attrs)
+{
+    vtd_report_ir_illegal_access(opaque, addr, false);
+
+    return MEMTX_ERROR;
+}
+
+static MemTxResult vtd_mem_ir_fault_write(void *opaque, hwaddr addr,
+                                          uint64_t value, unsigned size,
+                                          MemTxAttrs attrs)
+{
+    vtd_report_ir_illegal_access(opaque, addr, true);
+
+    return MEMTX_ERROR;
+}
+
+static const MemoryRegionOps vtd_mem_ir_fault_ops = {
+    .read_with_attrs = vtd_mem_ir_fault_read,
+    .write_with_attrs = vtd_mem_ir_fault_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 4,
+    },
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 4,
+    },
+};
+
+VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
+                                 int devfn, unsigned int pasid)
 {
     /*
      * We can't simply use sid here since the bus number might not be
@@ -3450,6 +3605,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     struct vtd_as_key key = {
         .bus = bus,
         .devfn = devfn,
+        .pasid = pasid,
     };
     VTDAddressSpace *vtd_dev_as;
     char name[128];
@@ -3460,13 +3616,21 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
 
         new_key->bus = bus;
         new_key->devfn = devfn;
+        new_key->pasid = pasid;
+
+        if (pasid == PCI_NO_PASID) {
+            snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
+                     PCI_FUNC(devfn));
+        } else {
+            snprintf(name, sizeof(name), "vtd-%02x.%x-pasid-%x", PCI_SLOT(devfn),
+                     PCI_FUNC(devfn), pasid);
+        }
 
-        snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
-                 PCI_FUNC(devfn));
         vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
 
         vtd_dev_as->bus = bus;
         vtd_dev_as->devfn = (uint8_t)devfn;
+        vtd_dev_as->pasid = pasid;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
         vtd_dev_as->iova_tree = iova_tree_new();
@@ -3507,6 +3671,24 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
                                             VTD_INTERRUPT_ADDR_FIRST,
                                             &vtd_dev_as->iommu_ir, 1);
 
+        /*
+         * This region is used for catching fault to access interrupt
+         * range via passthrough + PASID. See also
+         * vtd_switch_address_space(). We can't use alias since we
+         * need to know the sid which is valid for MSI who uses
+         * bus_master_as (see msi_send_message()).
+         */
+        memory_region_init_io(&vtd_dev_as->iommu_ir_fault, OBJECT(s),
+                              &vtd_mem_ir_fault_ops, vtd_dev_as, "vtd-no-ir",
+                              VTD_INTERRUPT_ADDR_SIZE);
+        /*
+         * Hook to root since when PT is enabled vtd_dev_as->iommu
+         * will be disabled.
+         */
+        memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->root),
+                                            VTD_INTERRUPT_ADDR_FIRST,
+                                            &vtd_dev_as->iommu_ir_fault, 2);
+
         /*
          * Hook both the containers under the root container, we
          * switch between DMAR & noDMAR by enable/disable
@@ -3627,7 +3809,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   "legacy mode",
                                   bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn),
-                                  vtd_get_domain_id(s, &ce),
+                                  vtd_get_domain_id(s, &ce, vtd_as->pasid),
                                   ce.hi, ce.lo);
         if (vtd_as_has_map_notifier(vtd_as)) {
             /* This is required only for MAP typed notifiers */
@@ -3637,10 +3819,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .notify_unmap = false,
                 .aw = s->aw_bits,
                 .as = vtd_as,
-                .domain_id = vtd_get_domain_id(s, &ce),
+                .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
             };
 
-            vtd_page_walk(s, &ce, 0, ~0ULL, &info);
+            vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
@@ -3735,6 +3917,10 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_SC;
     }
 
+    if (s->pasid) {
+        s->ecap |= VTD_ECAP_PASID;
+    }
+
     vtd_reset_caches(s);
 
     /* Define registers with default values and bit semantics */
@@ -3808,7 +3994,7 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 
     assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
 
-    vtd_as = vtd_find_add_as(s, bus, devfn);
+    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
     return &vtd_as->as;
 }
 
@@ -3851,6 +4037,11 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         return false;
     }
 
+    if (s->pasid && !s->scalable_mode) {
+        error_setg(errp, "Need to set PASID for scalable mode");
+        return false;
+    }
+
     return true;
 }
 
@@ -3913,7 +4104,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
 
     sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
     /* No corresponding destroy */
-    s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
+    s->iotlb = g_hash_table_new_full(vtd_iotlb_hash, vtd_iotlb_equal,
                                      g_free, g_free);
     s->vtd_as = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
                                       g_free, g_free);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 930ce61feb..f6d1fae79b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -114,8 +114,9 @@
                                      VTD_INTERRUPT_ADDR_FIRST + 1)
 
 /* The shift of source_id in the key of IOTLB hash table */
-#define VTD_IOTLB_SID_SHIFT         36
-#define VTD_IOTLB_LVL_SHIFT         52
+#define VTD_IOTLB_SID_SHIFT         20
+#define VTD_IOTLB_LVL_SHIFT         28
+#define VTD_IOTLB_PASID_SHIFT       30
 #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
 
 /* IOTLB_REG */
@@ -191,6 +192,7 @@
 #define VTD_ECAP_SC                 (1ULL << 7)
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
+#define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
 #define VTD_ECAP_SLTS               (1ULL << 46)
 
@@ -211,6 +213,8 @@
 #define VTD_CAP_DRAIN_READ          (1ULL << 55)
 #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
 #define VTD_CAP_CM                  (1ULL << 7)
+#define VTD_PASID_ID_SHIFT          20
+#define VTD_PASID_ID_MASK           ((1ULL << VTD_PASID_ID_SHIFT) - 1)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
@@ -379,6 +383,11 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_IOTLB_AM(val)      ((val) & 0x3fULL)
 #define VTD_INV_DESC_IOTLB_RSVD_LO      0xffffffff0000ff00ULL
 #define VTD_INV_DESC_IOTLB_RSVD_HI      0xf80ULL
+#define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
+#define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
+#define VTD_INV_DESC_IOTLB_PASID(val)   (((val) >> 32) & VTD_PASID_ID_MASK)
+#define VTD_INV_DESC_IOTLB_PASID_RSVD_LO      0xfff00000000001c0ULL
+#define VTD_INV_DESC_IOTLB_PASID_RSVD_HI      0xf80ULL
 
 /* Mask for Device IOTLB Invalidate Descriptor */
 #define VTD_INV_DESC_DEVICE_IOTLB_ADDR(val) ((val) & 0xfffffffffffff000ULL)
@@ -413,6 +422,7 @@ typedef union VTDInvDesc VTDInvDesc;
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
+    uint32_t pasid;
     uint64_t addr;
     uint8_t mask;
 };
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 5bf7e52bf5..57beff0c17 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -12,6 +12,8 @@ vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate device
 vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
 vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
 vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
+vtd_inv_desc_iotlb_pasid_pages(uint16_t domain, uint64_t addr, uint8_t mask, uint32_t pasid) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8" pasid 0x%"PRIx32
+vtd_inv_desc_iotlb_pasid(uint16_t domain, uint32_t pasid) "iotlb invalidate domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
 vtd_inv_desc_wait_irq(const char *msg) "%s"
 vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fa1bed353c..0d1029f366 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -97,11 +97,13 @@ struct VTDPASIDEntry {
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
+    uint32_t pasid;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
     MemoryRegion root;          /* The root container of the device */
     MemoryRegion nodmar;        /* The alias of shared nodmar MR */
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
+    MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
@@ -113,6 +115,7 @@ struct VTDAddressSpace {
 struct VTDIOTLBEntry {
     uint64_t gfn;
     uint16_t domain_id;
+    uint32_t pasid;
     uint64_t slpte;
     uint64_t mask;
     uint8_t access_flags;
@@ -260,6 +263,7 @@ struct IntelIOMMUState {
     bool buggy_eim;                 /* Force buggy EIM unless eim=off */
     uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
     bool dma_drain;                 /* Whether DMA r/w draining enabled */
+    bool pasid;                     /* Whether to support PASID */
 
     /*
      * Protects IOMMU states in general.  Currently it protects the
@@ -271,6 +275,7 @@ struct IntelIOMMUState {
 /* Find the VTD Address space associated with the given bus pointer,
  * create a new one if none exists
  */
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn);
+VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
+                                 int devfn, unsigned int pasid);
 
 #endif
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 347440d42c..cbfcf0b770 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -26,6 +26,8 @@ enum PCIBusFlags {
     PCI_BUS_EXTENDED_CONFIG_SPACE                           = 0x0002,
 };
 
+#define PCI_NO_PASID UINT32_MAX
+
 struct PCIBus {
     BusState qbus;
     enum PCIBusFlags flags;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-21  5:54 ` [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry Jason Wang
@ 2022-03-24  8:21   ` Tian, Kevin
  2022-03-28  2:27     ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-24  8:21 UTC (permalink / raw)
  To: Jason Wang, mst, peterx; +Cc: Liu, Yi L, yi.y.sun, qemu-devel

> From: Jason Wang
> Sent: Monday, March 21, 2022 1:54 PM
> 
> We use to warn on wrong rid2pasid entry. But this error could be
> triggered by the guest and could happens during initialization. So
> let's don't warn in this case.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 874d01c162..90964b201c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1554,8 +1554,10 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState
> *s, VTDContextEntry *ce)
>      if (s->root_scalable) {
>          ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>          if (ret) {
> -            error_report_once("%s: vtd_ce_get_rid2pasid_entry error: %"PRId32,
> -                              __func__, ret);
> +            /*
> +             * This error is guest triggerable. We should assumt PT
> +             * not enabled for safety.
> +             */

suppose a VT-d fault should be queued in this case besides returning false:

SPD.1: A hardware attempt to access the scalable-mode PASID-directory 
entry referenced through the PASIDDIRPTR field in scalable-mode 
context-entry resulted in an error

SPT.1: A hardware attempt to access a scalable-mode PASID-table entry
referenced through the SMPTBLPTR field in a scalable-mode PASID-directory
entry resulted in an error.

Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
problematic. According to VT-d spec, RID2PASID field is effective only
when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I didn't
see ecap.rps is set, neither is it checked in that function. It works possibly
just because Linux currently programs 0 to RID2PASID...

>              return false;
>          }
>          return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function
  2022-03-21  5:54 ` [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function Jason Wang
@ 2022-03-24  8:26   ` Tian, Kevin
  2022-03-28  2:27     ` Jason Wang
  2022-04-22 13:08   ` Peter Xu
  1 sibling, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-24  8:26 UTC (permalink / raw)
  To: Jason Wang, mst, peterx; +Cc: Liu, Yi L, yi.y.sun, qemu-devel

> From: Jason Wang
> Sent: Monday, March 21, 2022 1:54 PM
> @@ -1724,6 +1713,19 @@ out:
>      trace_vtd_pt_enable_fast_path(source_id, success);
>  }
> 
> +static void vtd_qualify_report_fault(IntelIOMMUState *s,
> +                                     int err, bool is_fpd_set,
> +                                     uint16_t source_id,
> +                                     hwaddr addr,
> +                                     bool is_write)

vtd_report_qualified_fault() is clearer.

> +{
> +    if (is_fpd_set && vtd_is_qualified_fault(err)) {
> +        trace_vtd_fault_disabled();
> +    } else {
> +        vtd_report_dmar_fault(s, source_id, addr, err, is_write);
> +    }
> +}
> +
>  /* Map dev to context-entry then do a paging-structures walk to do a iommu
>   * translation.
>   *


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-21  5:54 ` [PATCH V2 4/4] intel-iommu: PASID support Jason Wang
@ 2022-03-24  8:53   ` Tian, Kevin
  2022-03-28  2:31     ` Jason Wang
  2022-03-28  7:03   ` Tian, Kevin
  2022-03-28  8:45   ` Yi Liu
  2 siblings, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-24  8:53 UTC (permalink / raw)
  To: Jason Wang, mst, peterx; +Cc: Liu, Yi L, yi.y.sun, qemu-devel

> From: Jason Wang
> Sent: Monday, March 21, 2022 1:54 PM
> 
> This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> existing support for scalable mode, we need to implement the following
> missing parts:
> 
> 1) tag VTDAddressSpace with PASID and support IOMMU/DMA translation
>    with PASID
> 2) tag IOTLB with PASID

and invalidate desc to flush PASID iotlb, which seems missing in this patch.

> 3) PASID cache and its flush
> 4) Fault recording with PASID
> 
> For simplicity:
> 
> 1) PASID cache is not implemented so we can simply implement the PASID
> cache flush as a nop.
> 2) Fault recording with PASID is not supported, NFR is not changed.
> 
> All of the above is not mandatory and could be implemented in the
> future.

PASID cache is optional, but fault recording with PASID is required.
I'm fine with adding it incrementally but want to clarify the concept first.

> 
> Note that though PASID based IOMMU translation is ready but no device
> can issue PASID DMA right now. In this case, PCI_NO_PASID is used as
> PASID to identify the address w/ PASID. vtd_find_add_as() has been
> extended to provision address space with PASID which could be utilized
> by the future extension of PCI core to allow device model to use PASID
> based DMA translation.

I didn't get the point of PCI_NO_PASID. How is it different from RID_PASID?
Can you enlighten?

> 
> This feature would be useful for:
> 
> 1) prototyping PASID support for devices like virtio
> 2) future vPASID work
> 3) future PRS and vSVA work

Haven't got time to look at the code in detail. stop here. 

Thanks
Kevin


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-24  8:21   ` Tian, Kevin
@ 2022-03-28  2:27     ` Jason Wang
  2022-03-28  8:53       ` Yi Liu
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-28  2:27 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Thu, Mar 24, 2022 at 4:21 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang
> > Sent: Monday, March 21, 2022 1:54 PM
> >
> > We use to warn on wrong rid2pasid entry. But this error could be
> > triggered by the guest and could happens during initialization. So
> > let's don't warn in this case.
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 874d01c162..90964b201c 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -1554,8 +1554,10 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState
> > *s, VTDContextEntry *ce)
> >      if (s->root_scalable) {
> >          ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> >          if (ret) {
> > -            error_report_once("%s: vtd_ce_get_rid2pasid_entry error: %"PRId32,
> > -                              __func__, ret);
> > +            /*
> > +             * This error is guest triggerable. We should assumt PT
> > +             * not enabled for safety.
> > +             */
>
> suppose a VT-d fault should be queued in this case besides returning false:
>
> SPD.1: A hardware attempt to access the scalable-mode PASID-directory
> entry referenced through the PASIDDIRPTR field in scalable-mode
> context-entry resulted in an error
>
> SPT.1: A hardware attempt to access a scalable-mode PASID-table entry
> referenced through the SMPTBLPTR field in a scalable-mode PASID-directory
> entry resulted in an error.

Probably, but this issue is not introduced in this patch. We can fix
it on top if necessary.

>
> Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
> problematic. According to VT-d spec, RID2PASID field is effective only
> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I didn't
> see ecap.rps is set, neither is it checked in that function. It works possibly
> just because Linux currently programs 0 to RID2PASID...

This seems to be another issue since the introduction of scalable mode.

Thanks

>
> >              return false;
> >          }
> >          return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> > --
> > 2.25.1
> >
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function
  2022-03-24  8:26   ` Tian, Kevin
@ 2022-03-28  2:27     ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2022-03-28  2:27 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Thu, Mar 24, 2022 at 4:27 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang
> > Sent: Monday, March 21, 2022 1:54 PM
> > @@ -1724,6 +1713,19 @@ out:
> >      trace_vtd_pt_enable_fast_path(source_id, success);
> >  }
> >
> > +static void vtd_qualify_report_fault(IntelIOMMUState *s,
> > +                                     int err, bool is_fpd_set,
> > +                                     uint16_t source_id,
> > +                                     hwaddr addr,
> > +                                     bool is_write)
>
> vtd_report_qualified_fault() is clearer.

Fine.

Thanks

>
> > +{
> > +    if (is_fpd_set && vtd_is_qualified_fault(err)) {
> > +        trace_vtd_fault_disabled();
> > +    } else {
> > +        vtd_report_dmar_fault(s, source_id, addr, err, is_write);
> > +    }
> > +}
> > +
> >  /* Map dev to context-entry then do a paging-structures walk to do a iommu
> >   * translation.
> >   *
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-24  8:53   ` Tian, Kevin
@ 2022-03-28  2:31     ` Jason Wang
  2022-03-28  6:47       ` Tian, Kevin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-28  2:31 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Thu, Mar 24, 2022 at 4:54 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang
> > Sent: Monday, March 21, 2022 1:54 PM
> >
> > This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> > existing support for scalable mode, we need to implement the following
> > missing parts:
> >
> > 1) tag VTDAddressSpace with PASID and support IOMMU/DMA translation
> >    with PASID
> > 2) tag IOTLB with PASID
>
> and invalidate desc to flush PASID iotlb, which seems missing in this patch.

It existed in the previous version, but it looks like it will be used
only for the first level page table which is not supported right now.
So I deleted the codes.

>
> > 3) PASID cache and its flush
> > 4) Fault recording with PASID
> >
> > For simplicity:
> >
> > 1) PASID cache is not implemented so we can simply implement the PASID
> > cache flush as a nop.
> > 2) Fault recording with PASID is not supported, NFR is not changed.
> >
> > All of the above is not mandatory and could be implemented in the
> > future.
>
> PASID cache is optional, but fault recording with PASID is required.

Any pointer in the spec to say something like this? I think sticking
to the NFR would be sufficient.

> I'm fine with adding it incrementally but want to clarify the concept first.

Yes, that's the plan.

>
> >
> > Note that though PASID based IOMMU translation is ready but no device
> > can issue PASID DMA right now. In this case, PCI_NO_PASID is used as
> > PASID to identify the address w/ PASID. vtd_find_add_as() has been
> > extended to provision address space with PASID which could be utilized
> > by the future extension of PCI core to allow device model to use PASID
> > based DMA translation.
>
> I didn't get the point of PCI_NO_PASID. How is it different from RID_PASID?
> Can you enlighten?
>
> >
> > This feature would be useful for:
> >
> > 1) prototyping PASID support for devices like virtio
> > 2) future vPASID work
> > 3) future PRS and vSVA work
>
> Haven't got time to look at the code in detail. stop here.

Fine.

Thanks

>
> Thanks
> Kevin
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-28  2:31     ` Jason Wang
@ 2022-03-28  6:47       ` Tian, Kevin
  2022-03-29  4:46         ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-28  6:47 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, March 28, 2022 10:31 AM
> 
> On Thu, Mar 24, 2022 at 4:54 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang
> > > Sent: Monday, March 21, 2022 1:54 PM
> > >
> > > This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> > > existing support for scalable mode, we need to implement the following
> > > missing parts:
> > >
> > > 1) tag VTDAddressSpace with PASID and support IOMMU/DMA
> translation
> > >    with PASID
> > > 2) tag IOTLB with PASID
> >
> > and invalidate desc to flush PASID iotlb, which seems missing in this patch.
> 
> It existed in the previous version, but it looks like it will be used
> only for the first level page table which is not supported right now.
> So I deleted the codes.

You are right. But there is also PASID-based device TLB invalidate descriptor
which is orthogonal to 1st vs. 2nd level thing. If we don't want to break the
spec with this series then there will need a way to prevent the user from
setting both "device-iotlb" and "x-pasid-mode" together.

> 
> >
> > > 3) PASID cache and its flush
> > > 4) Fault recording with PASID
> > >
> > > For simplicity:
> > >
> > > 1) PASID cache is not implemented so we can simply implement the PASID
> > > cache flush as a nop.
> > > 2) Fault recording with PASID is not supported, NFR is not changed.
> > >
> > > All of the above is not mandatory and could be implemented in the
> > > future.
> >
> > PASID cache is optional, but fault recording with PASID is required.
> 
> Any pointer in the spec to say something like this? I think sticking
> to the NFR would be sufficient.

I didn't remember any place in spec saying that fault recording with PASID is
not required when PASID capability is exposed. If there is certain fault
triggered by a request with PASID, we do want to report this information
upward. 

btw can you elaborate why NFR matters to PASID? It is just about the
number of fault recording register...

> 
> > I'm fine with adding it incrementally but want to clarify the concept first.
> 
> Yes, that's the plan.
> 

I have one open which requires your input.

While incrementally enabling things does be a common practice, one worry
is whether we want to create too many control knobs in the staging process
to cause confusion to the end user.

Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for a
coarse-grained knob design:
--
  Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
  related to scalable mode translation, thus there are multiple combinations.
  While this vIOMMU implementation wants simplify it for user by providing
  typical combinations. User could config it by "x-scalable-mode" option. The
  usage is as below:
    "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"

    - "legacy": gives support for SL page table
    - "modern": gives support for FL page table, pasid, virtual command
    -  if not configured, means no scalable mode support, if not proper
       configured, will throw error
--

Which way do you prefer to?

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02805.html

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-21  5:54 ` [PATCH V2 4/4] intel-iommu: PASID support Jason Wang
  2022-03-24  8:53   ` Tian, Kevin
@ 2022-03-28  7:03   ` Tian, Kevin
  2022-03-29  4:48     ` Jason Wang
  2022-03-28  8:45   ` Yi Liu
  2 siblings, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-28  7:03 UTC (permalink / raw)
  To: Jason Wang, mst, peterx; +Cc: Liu, Yi L, yi.y.sun, qemu-devel

> From: Jason Wang
> Sent: Monday, March 21, 2022 1:54 PM
> 
> +    /*
> +     * vtd-spec v3.4 3.14:
> +     *
> +     * """
> +     * Requests-with-PASID with input address in range 0xFEEx_xxxx are
> +     * translated normally like any other request-with-PASID through
> +     * DMA-remapping hardware. However, if such a request is processed
> +     * using pass-through translation, it will be blocked as described
> +     * in the paragraph below.

While PASID+PT is blocked as described in the below paragraph, the
paragraph itself applies to all situations:

  1) PT + noPASID
  2) translation + noPASID
  3) PT + PASID
  4) translation + PASID

because...

> +     *
> +     * Software must not program paging-structure entries to remap any
> +     * address to the interrupt address range. Untranslated requests
> +     * and translation requests that result in an address in the
> +     * interrupt range will be blocked with condition code LGN.4 or
> +     * SGN.8.

... if you look at the definition of LGN.4 or SGN.8:

LGN.4:	When legacy mode (RTADDR_REG.TTM=00b) is enabled, hardware
	detected an output address (i.e. address after remapping) in the
	interrupt address range (0xFEEx_xxxx). For Translated requests and
	requests with pass-through translation type (TT=10), the output
	address is the same as the address in the request

The last sentence in the first paragraph above just highlights the fact that
when input address of PT is in interrupt range then it is blocked by LGN.4
or SGN.8 due to output address also in interrupt range.

> +     * """
> +     *
> +     * We enable per as memory region (iommu_ir_fault) for catching
> +     * the tranlsation for interrupt range through PASID + PT.
> +     */
> +    if (pt && as->pasid != PCI_NO_PASID) {
> +        memory_region_set_enabled(&as->iommu_ir_fault, true);
> +    } else {
> +        memory_region_set_enabled(&as->iommu_ir_fault, false);
> +    }
> +

Given above this should be a bug fix for nopasid first and then apply it
to pasid path too.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-21  5:54 ` [PATCH V2 4/4] intel-iommu: PASID support Jason Wang
  2022-03-24  8:53   ` Tian, Kevin
  2022-03-28  7:03   ` Tian, Kevin
@ 2022-03-28  8:45   ` Yi Liu
  2022-03-29  4:54     ` Jason Wang
  2 siblings, 1 reply; 43+ messages in thread
From: Yi Liu @ 2022-03-28  8:45 UTC (permalink / raw)
  To: Jason Wang, mst, peterx; +Cc: yi.y.sun, qemu-devel



On 2022/3/21 13:54, Jason Wang wrote:
> This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> existing support for scalable mode, we need to implement the following
> missing parts:
> 
> 1) tag VTDAddressSpace with PASID and support IOMMU/DMA translation
>     with PASID

should it be tagging with bdf+pasid?

> 2) tag IOTLB with PASID
> 3) PASID cache and its flush
> 4) Fault recording with PASID
> 
> For simplicity:
> 
> 1) PASID cache is not implemented so we can simply implement the PASID
> cache flush as a nop.
> 2) Fault recording with PASID is not supported, NFR is not changed.

I think this doesn't work for passthrough device. So need to fail the
qemu if user tries to expose such a vIOMMU togather with passthroug
device.

> All of the above is not mandatory and could be implemented in the
> future.
> 
> Note that though PASID based IOMMU translation is ready but no device
> can issue PASID DMA right now. In this case, PCI_NO_PASID is used as
> PASID to identify the address w/ PASID. vtd_find_add_as() has been
> extended to provision address space with PASID which could be utilized
> by the future extension of PCI core to allow device model to use PASID
> based DMA translation.
> 
> This feature would be useful for:
> 
> 1) prototyping PASID support for devices like virtio
> 2) future vPASID work
> 3) future PRS and vSVA work
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>   hw/i386/intel_iommu.c          | 357 +++++++++++++++++++++++++--------
>   hw/i386/intel_iommu_internal.h |  14 +-
>   hw/i386/trace-events           |   2 +
>   include/hw/i386/intel_iommu.h  |   7 +-
>   include/hw/pci/pci_bus.h       |   2 +
>   5 files changed, 296 insertions(+), 86 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 82787f9850..13447fda16 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -58,6 +58,14 @@
>   struct vtd_as_key {
>       PCIBus *bus;
>       uint8_t devfn;
> +    uint32_t pasid;
> +};
> +
> +struct vtd_iotlb_key {
> +    uint16_t sid;
> +    uint32_t pasid;
> +    uint64_t gfn;
> +    uint32_t level;
>   };
>   
>   static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> @@ -199,14 +207,24 @@ static inline gboolean vtd_as_has_map_notifier(VTDAddressSpace *as)
>   }
>   
>   /* GHashTable functions */
> -static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
> +static gboolean vtd_iotlb_equal(gconstpointer v1, gconstpointer v2)
>   {
> -    return *((const uint64_t *)v1) == *((const uint64_t *)v2);
> +    const struct vtd_iotlb_key *key1 = v1;
> +    const struct vtd_iotlb_key *key2 = v2;
> +
> +    return key1->sid == key2->sid &&
> +           key1->pasid == key2->pasid &&
> +           key1->level == key2->level &&
> +           key1->gfn == key2->gfn;
>   }
>   
> -static guint vtd_uint64_hash(gconstpointer v)
> +static guint vtd_iotlb_hash(gconstpointer v)
>   {
> -    return (guint)*(const uint64_t *)v;
> +    const struct vtd_iotlb_key *key = v;
> +
> +    return key->gfn | ((key->sid) << VTD_IOTLB_SID_SHIFT) |
> +           (key->level) << VTD_IOTLB_LVL_SHIFT |
> +           (key->pasid) << VTD_IOTLB_PASID_SHIFT;
>   }
>   
>   static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
> @@ -214,7 +232,8 @@ static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
>       const struct vtd_as_key *key1 = v1;
>       const struct vtd_as_key *key2 = v2;
>   
> -    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn) &&
> +           (key1->pasid == key2->pasid);
>   }
>   
>   static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
> @@ -306,13 +325,6 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>       vtd_iommu_unlock(s);
>   }
>   
> -static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
> -                                  uint32_t level)
> -{
> -    return gfn | ((uint64_t)(source_id) << VTD_IOTLB_SID_SHIFT) |
> -           ((uint64_t)(level) << VTD_IOTLB_LVL_SHIFT);
> -}
> -
>   static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
>   {
>       return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
> @@ -320,15 +332,17 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
>   
>   /* Must be called with IOMMU lock held */
>   static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
> -                                       hwaddr addr)
> +                                       hwaddr addr, uint32_t pasid)
>   {
> +    struct vtd_iotlb_key key;
>       VTDIOTLBEntry *entry;
> -    uint64_t key;
>       int level;
>   
>       for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; level++) {
> -        key = vtd_get_iotlb_key(vtd_get_iotlb_gfn(addr, level),
> -                                source_id, level);
> +        key.gfn = vtd_get_iotlb_gfn(addr, level);
> +        key.level = level;
> +        key.sid = source_id;
> +        key.pasid = pasid;
>           entry = g_hash_table_lookup(s->iotlb, &key);
>           if (entry) {
>               goto out;
> @@ -342,10 +356,11 @@ out:
>   /* Must be with IOMMU lock held */
>   static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>                                uint16_t domain_id, hwaddr addr, uint64_t slpte,
> -                             uint8_t access_flags, uint32_t level)
> +                             uint8_t access_flags, uint32_t level,
> +                             uint32_t pasid)
>   {
>       VTDIOTLBEntry *entry = g_malloc(sizeof(*entry));
> -    uint64_t *key = g_malloc(sizeof(*key));
> +    struct vtd_iotlb_key *key = g_malloc(sizeof(*key));
>       uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
>   
>       trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
> @@ -359,7 +374,13 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>       entry->slpte = slpte;
>       entry->access_flags = access_flags;
>       entry->mask = vtd_slpt_level_page_mask(level);
> -    *key = vtd_get_iotlb_key(gfn, source_id, level);
> +    entry->pasid = pasid;
> +
> +    key->gfn = gfn;
> +    key->sid = source_id;
> +    key->level = level;
> +    key->pasid = pasid;
> +
>       g_hash_table_replace(s->iotlb, key, entry);
>   }
>   
> @@ -823,13 +844,15 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>   
>   static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
>                                         VTDContextEntry *ce,
> -                                      VTDPASIDEntry *pe)
> +                                      VTDPASIDEntry *pe,
> +                                      uint32_t pasid)
>   {
> -    uint32_t pasid;
>       dma_addr_t pasid_dir_base;
>       int ret = 0;
>   
> -    pasid = VTD_CE_GET_RID2PASID(ce);
> +    if (pasid == PCI_NO_PASID) {
> +        pasid = VTD_CE_GET_RID2PASID(ce);
> +    }
>       pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>       ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
>   
> @@ -838,15 +861,17 @@ static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
>   
>   static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
>                                   VTDContextEntry *ce,
> -                                bool *pe_fpd_set)
> +                                bool *pe_fpd_set,
> +                                uint32_t pasid)
>   {
>       int ret;
> -    uint32_t pasid;
>       dma_addr_t pasid_dir_base;
>       VTDPASIDDirEntry pdire;
>       VTDPASIDEntry pe;
>   
> -    pasid = VTD_CE_GET_RID2PASID(ce);
> +    if (pasid == PCI_NO_PASID) {
> +        pasid = VTD_CE_GET_RID2PASID(ce);
> +    }
>       pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>   
>       /*
> @@ -892,12 +917,13 @@ static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
>   }
>   
>   static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
> -                                   VTDContextEntry *ce)
> +                                   VTDContextEntry *ce,
> +                                   uint32_t pasid)
>   {
>       VTDPASIDEntry pe;
>   
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>           return VTD_PE_GET_LEVEL(&pe);
>       }
>   
> @@ -910,12 +936,13 @@ static inline uint32_t vtd_ce_get_agaw(VTDContextEntry *ce)
>   }
>   
>   static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
> -                                  VTDContextEntry *ce)
> +                                  VTDContextEntry *ce,
> +                                  uint32_t pasid)
>   {
>       VTDPASIDEntry pe;
>   
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>           return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
>       }
>   
> @@ -957,31 +984,33 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
>   }
>   
>   static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
> -                                      VTDContextEntry *ce, uint8_t aw)
> +                                      VTDContextEntry *ce, uint8_t aw,
> +                                      uint32_t pasid)
>   {
> -    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce);
> +    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce, pasid);
>       return 1ULL << MIN(ce_agaw, aw);
>   }
>   
>   /* Return true if IOVA passes range check, otherwise false. */
>   static inline bool vtd_iova_range_check(IntelIOMMUState *s,
>                                           uint64_t iova, VTDContextEntry *ce,
> -                                        uint8_t aw)
> +                                        uint8_t aw, uint32_t pasid)
>   {
>       /*
>        * Check if @iova is above 2^X-1, where X is the minimum of MGAW
>        * in CAP_REG and AW in context-entry.
>        */
> -    return !(iova & ~(vtd_iova_limit(s, ce, aw) - 1));
> +    return !(iova & ~(vtd_iova_limit(s, ce, aw, pasid) - 1));
>   }
>   
>   static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> -                                          VTDContextEntry *ce)
> +                                          VTDContextEntry *ce,
> +                                          uint32_t pasid)
>   {
>       VTDPASIDEntry pe;
>   
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>           return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
>       }
>   
> @@ -1015,16 +1044,17 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>   static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>                                uint64_t iova, bool is_write,
>                                uint64_t *slptep, uint32_t *slpte_level,
> -                             bool *reads, bool *writes, uint8_t aw_bits)
> +                             bool *reads, bool *writes, uint8_t aw_bits,
> +                             uint32_t pasid)
>   {
> -    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
> -    uint32_t level = vtd_get_iova_level(s, ce);
> +    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
> +    uint32_t level = vtd_get_iova_level(s, ce, pasid);
>       uint32_t offset;
>       uint64_t slpte;
>       uint64_t access_right_check;
>       uint64_t xlat, size;
>   
> -    if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
> +    if (!vtd_iova_range_check(s, iova, ce, aw_bits, pasid)) {
>           error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
>                             __func__, iova);
>           return -VTD_FR_ADDR_BEYOND_MGAW;
> @@ -1040,7 +1070,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>           if (slpte == (uint64_t)-1) {
>               error_report_once("%s: detected read error on DMAR slpte "
>                                 "(iova=0x%" PRIx64 ")", __func__, iova);
> -            if (level == vtd_get_iova_level(s, ce)) {
> +            if (level == vtd_get_iova_level(s, ce, pasid)) {
>                   /* Invalid programming of context-entry */
>                   return -VTD_FR_CONTEXT_ENTRY_INV;
>               } else {
> @@ -1304,18 +1334,19 @@ next:
>    */
>   static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
>                            uint64_t start, uint64_t end,
> -                         vtd_page_walk_info *info)
> +                         vtd_page_walk_info *info,
> +                         uint32_t pasid)
>   {
> -    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
> -    uint32_t level = vtd_get_iova_level(s, ce);
> +    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
> +    uint32_t level = vtd_get_iova_level(s, ce, pasid);
>   
> -    if (!vtd_iova_range_check(s, start, ce, info->aw)) {
> +    if (!vtd_iova_range_check(s, start, ce, info->aw, pasid)) {
>           return -VTD_FR_ADDR_BEYOND_MGAW;
>       }
>   
> -    if (!vtd_iova_range_check(s, end, ce, info->aw)) {
> +    if (!vtd_iova_range_check(s, end, ce, info->aw, pasid)) {
>           /* Fix end so that it reaches the maximum */
> -        end = vtd_iova_limit(s, ce, info->aw);
> +        end = vtd_iova_limit(s, ce, info->aw, pasid);
>       }
>   
>       return vtd_page_walk_level(addr, start, end, level, true, true, info);
> @@ -1383,7 +1414,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
>        * has valid rid2pasid setting, which includes valid
>        * rid2pasid field and corresponding pasid entry setting
>        */
> -    return vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
>   }
>   
>   /* Map a device to its corresponding domain (context-entry) */
> @@ -1466,12 +1497,13 @@ static int vtd_sync_shadow_page_hook(IOMMUTLBEvent *event,
>   }
>   
>   static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
> -                                  VTDContextEntry *ce)
> +                                  VTDContextEntry *ce,
> +                                  uint32_t pasid)
>   {
>       VTDPASIDEntry pe;
>   
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>           return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>       }
>   
> @@ -1489,10 +1521,10 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
>           .notify_unmap = true,
>           .aw = s->aw_bits,
>           .as = vtd_as,
> -        .domain_id = vtd_get_domain_id(s, ce),
> +        .domain_id = vtd_get_domain_id(s, ce, vtd_as->pasid),
>       };
>   
> -    return vtd_page_walk(s, ce, addr, addr + size, &info);
> +    return vtd_page_walk(s, ce, addr, addr + size, &info, vtd_as->pasid);
>   }
>   
>   static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
> @@ -1536,13 +1568,14 @@ static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
>    * 1st-level translation or 2nd-level translation, it depends
>    * on PGTT setting.
>    */
> -static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce)
> +static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
> +                               uint32_t pasid)
>   {
>       VTDPASIDEntry pe;
>       int ret;
>   
>       if (s->root_scalable) {
> -        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>           if (ret) {
>               /*
>                * This error is guest triggerable. We should assumt PT
> @@ -1578,19 +1611,20 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>           return false;
>       }
>   
> -    return vtd_dev_pt_enabled(s, &ce);
> +    return vtd_dev_pt_enabled(s, &ce, as->pasid);
>   }
>   
>   /* Return whether the device is using IOMMU translation. */
>   static bool vtd_switch_address_space(VTDAddressSpace *as)
>   {
> -    bool use_iommu;
> +    bool use_iommu, pt;
>       /* Whether we need to take the BQL on our own */
>       bool take_bql = !qemu_mutex_iothread_locked();
>   
>       assert(as);
>   
>       use_iommu = as->iommu_state->dmar_enabled && !vtd_as_pt_enabled(as);
> +    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
>   
>       trace_vtd_switch_address_space(pci_bus_num(as->bus),
>                                      VTD_PCI_SLOT(as->devfn),
> @@ -1610,11 +1644,53 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
>       if (use_iommu) {
>           memory_region_set_enabled(&as->nodmar, false);
>           memory_region_set_enabled(MEMORY_REGION(&as->iommu), true);
> +        /*
> +         * vt-d spec v3.4 3.14:
> +         *
> +         * """
> +         * Requests-with-PASID with input address in range 0xFEEx_xxxx
> +         * are translated normally like any other request-with-PASID
> +         * through DMA-remapping hardware.
> +         * """
> +         *
> +         * Need to disable ir for as with PASID.
> +         */
> +        if (as->pasid != PCI_NO_PASID) {
> +            memory_region_set_enabled(&as->iommu_ir, false);
> +        } else {
> +            memory_region_set_enabled(&as->iommu_ir, true);
> +        }
>       } else {
>           memory_region_set_enabled(MEMORY_REGION(&as->iommu), false);
>           memory_region_set_enabled(&as->nodmar, true);
>       }
>   
> +    /*
> +     * vtd-spec v3.4 3.14:
> +     *
> +     * """
> +     * Requests-with-PASID with input address in range 0xFEEx_xxxx are
> +     * translated normally like any other request-with-PASID through
> +     * DMA-remapping hardware. However, if such a request is processed
> +     * using pass-through translation, it will be blocked as described
> +     * in the paragraph below.
> +     *
> +     * Software must not program paging-structure entries to remap any
> +     * address to the interrupt address range. Untranslated requests
> +     * and translation requests that result in an address in the
> +     * interrupt range will be blocked with condition code LGN.4 or
> +     * SGN.8.
> +     * """
> +     *
> +     * We enable per as memory region (iommu_ir_fault) for catching
> +     * the tranlsation for interrupt range through PASID + PT.
> +     */
> +    if (pt && as->pasid != PCI_NO_PASID) {
> +        memory_region_set_enabled(&as->iommu_ir_fault, true);
> +    } else {
> +        memory_region_set_enabled(&as->iommu_ir_fault, false);
> +    }
> +
>       if (take_bql) {
>           qemu_mutex_unlock_iothread();
>       }
> @@ -1747,13 +1823,14 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>       uint8_t bus_num = pci_bus_num(bus);
>       VTDContextCacheEntry *cc_entry;
>       uint64_t slpte, page_mask;
> -    uint32_t level;
> +    uint32_t level, pasid = vtd_as->pasid;
>       uint16_t source_id = vtd_make_source_id(bus_num, devfn);
>       int ret_fr;
>       bool is_fpd_set = false;
>       bool reads = true;
>       bool writes = true;
>       uint8_t access_flags;
> +    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>       VTDIOTLBEntry *iotlb_entry;
>   
>       /*
> @@ -1766,15 +1843,17 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>   
>       cc_entry = &vtd_as->context_cache_entry;
>   
> -    /* Try to fetch slpte form IOTLB */
> -    iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
> -    if (iotlb_entry) {
> -        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
> -                                 iotlb_entry->domain_id);
> -        slpte = iotlb_entry->slpte;
> -        access_flags = iotlb_entry->access_flags;
> -        page_mask = iotlb_entry->mask;
> -        goto out;
> +    /* Try to fetch slpte form IOTLB, we don't need RID2PASID logic */
> +    if (!rid2pasid) {
> +        iotlb_entry = vtd_lookup_iotlb(s, source_id, addr, pasid);
> +        if (iotlb_entry) {
> +            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
> +                                     iotlb_entry->domain_id);
> +            slpte = iotlb_entry->slpte;
> +            access_flags = iotlb_entry->access_flags;
> +            page_mask = iotlb_entry->mask;
> +            goto out;
> +        }
>       }
>   
>       /* Try to fetch context-entry from cache first */
> @@ -1785,7 +1864,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>           ce = cc_entry->context_entry;
>           is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>           if (!is_fpd_set && s->root_scalable) {
> -            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, pasid);
>               if (ret_fr) {
>                   vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
>                                            source_id, addr, is_write);
> @@ -1796,7 +1875,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>           ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>           is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>           if (!ret_fr && !is_fpd_set && s->root_scalable) {
> -            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, pasid);
>           }
>           if (ret_fr) {
>               vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
> @@ -1811,11 +1890,15 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>           cc_entry->context_cache_gen = s->context_cache_gen;
>       }
>   
> +    if (rid2pasid) {
> +        pasid = VTD_CE_GET_RID2PASID(&ce);
> +    }
> +
>       /*
>        * We don't need to translate for pass-through context entries.
>        * Also, let's ignore IOTLB caching as well for PT devices.
>        */
> -    if (vtd_dev_pt_enabled(s, &ce)) {
> +    if (vtd_dev_pt_enabled(s, &ce, pasid)) {
>           entry->iova = addr & VTD_PAGE_MASK_4K;
>           entry->translated_addr = entry->iova;
>           entry->addr_mask = ~VTD_PAGE_MASK_4K;
> @@ -1836,8 +1919,21 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>           return true;
>       }
>   
> +    /* Try to fetch slpte form IOTLB for RID2PASID slow path */
> +    if (rid2pasid) {
> +        iotlb_entry = vtd_lookup_iotlb(s, source_id, addr, pasid);
> +        if (iotlb_entry) {
> +            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
> +                                     iotlb_entry->domain_id);
> +            slpte = iotlb_entry->slpte;
> +            access_flags = iotlb_entry->access_flags;
> +            page_mask = iotlb_entry->mask;
> +            goto out;
> +        }
> +    }
> +
>       ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
> -                               &reads, &writes, s->aw_bits);
> +                               &reads, &writes, s->aw_bits, pasid);
>       if (ret_fr) {
>           vtd_qualify_report_fault(s, -ret_fr, is_fpd_set, source_id,
>                                    addr, is_write);
> @@ -1846,8 +1942,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>   
>       page_mask = vtd_slpt_level_page_mask(level);
>       access_flags = IOMMU_ACCESS_FLAG(reads, writes);
> -    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
> -                     access_flags, level);
> +    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce, pasid),
> +                     addr, slpte, access_flags, level, pasid);
>   out:
>       vtd_iommu_unlock(s);
>       entry->iova = addr & page_mask;
> @@ -2039,7 +2135,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>       QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>           if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                         vtd_as->devfn, &ce) &&
> -            domain_id == vtd_get_domain_id(s, &ce)) {
> +            domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>               vtd_sync_shadow_page_table(vtd_as);
>           }
>       }
> @@ -2047,7 +2143,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>   
>   static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>                                              uint16_t domain_id, hwaddr addr,
> -                                           uint8_t am)
> +                                             uint8_t am, uint32_t pasid)
>   {
>       VTDAddressSpace *vtd_as;
>       VTDContextEntry ce;
> @@ -2055,9 +2151,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>       hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>   
>       QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
> +        if (pasid != PCI_NO_PASID && pasid != vtd_as->pasid)
> +            continue;
>           ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                          vtd_as->devfn, &ce);
> -        if (!ret && domain_id == vtd_get_domain_id(s, &ce)) {
> +        if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>               if (vtd_as_has_map_notifier(vtd_as)) {
>                   /*
>                    * As long as we have MAP notifications registered in
> @@ -2101,7 +2199,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>       vtd_iommu_lock(s);
>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
>       vtd_iommu_unlock(s);
> -    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, PCI_NO_PASID);
>   }
>   
>   /* Flush IOTLB
> @@ -3168,6 +3266,7 @@ static Property vtd_properties[] = {
>       DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
>       DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
>       DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
> +    DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
>       DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>       DEFINE_PROP_END_OF_LIST(),
>   };
> @@ -3441,7 +3540,63 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
>       },
>   };
>   
> -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> +static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
> +                                         hwaddr addr, bool is_write)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> +    uint16_t sid = vtd_make_source_id(bus_n, vtd_as->devfn);
> +    bool is_fpd_set = false;
> +    VTDContextEntry ce;
> +
> +    assert(vtd_as->pasid != PCI_NO_PASID);
> +
> +    /* Try out best to fetch FPD, we can't do anything more */
> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> +        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> +        if (!is_fpd_set && s->root_scalable) {
> +            vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
> +        }
> +    }
> +
> +    vtd_qualify_report_fault(s, VTD_FR_SM_INTERRUPT_ADDR,
> +                             is_fpd_set, sid, addr, is_write);
> +}
> +
> +static MemTxResult vtd_mem_ir_fault_read(void *opaque, hwaddr addr,
> +                                         uint64_t *data, unsigned size,
> +                                         MemTxAttrs attrs)
> +{
> +    vtd_report_ir_illegal_access(opaque, addr, false);
> +
> +    return MEMTX_ERROR;
> +}
> +
> +static MemTxResult vtd_mem_ir_fault_write(void *opaque, hwaddr addr,
> +                                          uint64_t value, unsigned size,
> +                                          MemTxAttrs attrs)
> +{
> +    vtd_report_ir_illegal_access(opaque, addr, true);
> +
> +    return MEMTX_ERROR;
> +}
> +
> +static const MemoryRegionOps vtd_mem_ir_fault_ops = {
> +    .read_with_attrs = vtd_mem_ir_fault_read,
> +    .write_with_attrs = vtd_mem_ir_fault_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 4,
> +    },
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 4,
> +    },
> +};
> +
> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
> +                                 int devfn, unsigned int pasid)
>   {
>       /*
>        * We can't simply use sid here since the bus number might not be
> @@ -3450,6 +3605,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>       struct vtd_as_key key = {
>           .bus = bus,
>           .devfn = devfn,
> +        .pasid = pasid,
>       };
>       VTDAddressSpace *vtd_dev_as;
>       char name[128];
> @@ -3460,13 +3616,21 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>   
>           new_key->bus = bus;
>           new_key->devfn = devfn;
> +        new_key->pasid = pasid;
> +
> +        if (pasid == PCI_NO_PASID) {
> +            snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
> +                     PCI_FUNC(devfn));
> +        } else {
> +            snprintf(name, sizeof(name), "vtd-%02x.%x-pasid-%x", PCI_SLOT(devfn),
> +                     PCI_FUNC(devfn), pasid);
> +        }
>   
> -        snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
> -                 PCI_FUNC(devfn));
>           vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
>   
>           vtd_dev_as->bus = bus;
>           vtd_dev_as->devfn = (uint8_t)devfn;
> +        vtd_dev_as->pasid = pasid;
>           vtd_dev_as->iommu_state = s;
>           vtd_dev_as->context_cache_entry.context_cache_gen = 0;
>           vtd_dev_as->iova_tree = iova_tree_new();
> @@ -3507,6 +3671,24 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>                                               VTD_INTERRUPT_ADDR_FIRST,
>                                               &vtd_dev_as->iommu_ir, 1);
>   
> +        /*
> +         * This region is used for catching fault to access interrupt
> +         * range via passthrough + PASID. See also
> +         * vtd_switch_address_space(). We can't use alias since we
> +         * need to know the sid which is valid for MSI who uses
> +         * bus_master_as (see msi_send_message()).
> +         */
> +        memory_region_init_io(&vtd_dev_as->iommu_ir_fault, OBJECT(s),
> +                              &vtd_mem_ir_fault_ops, vtd_dev_as, "vtd-no-ir",
> +                              VTD_INTERRUPT_ADDR_SIZE);
> +        /*
> +         * Hook to root since when PT is enabled vtd_dev_as->iommu
> +         * will be disabled.
> +         */
> +        memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->root),
> +                                            VTD_INTERRUPT_ADDR_FIRST,
> +                                            &vtd_dev_as->iommu_ir_fault, 2);
> +
>           /*
>            * Hook both the containers under the root container, we
>            * switch between DMAR & noDMAR by enable/disable
> @@ -3627,7 +3809,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                                     "legacy mode",
>                                     bus_n, PCI_SLOT(vtd_as->devfn),
>                                     PCI_FUNC(vtd_as->devfn),
> -                                  vtd_get_domain_id(s, &ce),
> +                                  vtd_get_domain_id(s, &ce, vtd_as->pasid),
>                                     ce.hi, ce.lo);
>           if (vtd_as_has_map_notifier(vtd_as)) {
>               /* This is required only for MAP typed notifiers */
> @@ -3637,10 +3819,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                   .notify_unmap = false,
>                   .aw = s->aw_bits,
>                   .as = vtd_as,
> -                .domain_id = vtd_get_domain_id(s, &ce),
> +                .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
>               };
>   
> -            vtd_page_walk(s, &ce, 0, ~0ULL, &info);
> +            vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
>           }
>       } else {
>           trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> @@ -3735,6 +3917,10 @@ static void vtd_init(IntelIOMMUState *s)
>           s->ecap |= VTD_ECAP_SC;
>       }
>   
> +    if (s->pasid) {
> +        s->ecap |= VTD_ECAP_PASID;
> +    }
> +
>       vtd_reset_caches(s);
>   
>       /* Define registers with default values and bit semantics */
> @@ -3808,7 +3994,7 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>   
>       assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>   
> -    vtd_as = vtd_find_add_as(s, bus, devfn);
> +    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
>       return &vtd_as->as;
>   }
>   
> @@ -3851,6 +4037,11 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>           return false;
>       }
>   
> +    if (s->pasid && !s->scalable_mode) {
> +        error_setg(errp, "Need to set PASID for scalable mode");
> +        return false;
I guess your point is if setting pasid capability, scalable mode
is required. right? You also need to set the pasid size in the ecap
register when exposing pasid capability to guest.

39:35 RO X PSS: PASID Size Supported


> +    }
> +
>       return true;
>   }
>   
> @@ -3913,7 +4104,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>   
>       sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
>       /* No corresponding destroy */
> -    s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
> +    s->iotlb = g_hash_table_new_full(vtd_iotlb_hash, vtd_iotlb_equal,
>                                        g_free, g_free);
>       s->vtd_as = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
>                                         g_free, g_free);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 930ce61feb..f6d1fae79b 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -114,8 +114,9 @@
>                                        VTD_INTERRUPT_ADDR_FIRST + 1)
>   
>   /* The shift of source_id in the key of IOTLB hash table */
> -#define VTD_IOTLB_SID_SHIFT         36
> -#define VTD_IOTLB_LVL_SHIFT         52
> +#define VTD_IOTLB_SID_SHIFT         20
> +#define VTD_IOTLB_LVL_SHIFT         28
> +#define VTD_IOTLB_PASID_SHIFT       30
>   #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
>   
>   /* IOTLB_REG */
> @@ -191,6 +192,7 @@
>   #define VTD_ECAP_SC                 (1ULL << 7)
>   #define VTD_ECAP_MHMV               (15ULL << 20)
>   #define VTD_ECAP_SRS                (1ULL << 31)
> +#define VTD_ECAP_PASID              (1ULL << 40)
>   #define VTD_ECAP_SMTS               (1ULL << 43)
>   #define VTD_ECAP_SLTS               (1ULL << 46)
>   
> @@ -211,6 +213,8 @@
>   #define VTD_CAP_DRAIN_READ          (1ULL << 55)
>   #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
>   #define VTD_CAP_CM                  (1ULL << 7)
> +#define VTD_PASID_ID_SHIFT          20
> +#define VTD_PASID_ID_MASK           ((1ULL << VTD_PASID_ID_SHIFT) - 1)
>   
>   /* Supported Adjusted Guest Address Widths */
>   #define VTD_CAP_SAGAW_SHIFT         8
> @@ -379,6 +383,11 @@ typedef union VTDInvDesc VTDInvDesc;
>   #define VTD_INV_DESC_IOTLB_AM(val)      ((val) & 0x3fULL)
>   #define VTD_INV_DESC_IOTLB_RSVD_LO      0xffffffff0000ff00ULL
>   #define VTD_INV_DESC_IOTLB_RSVD_HI      0xf80ULL
> +#define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
> +#define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
> +#define VTD_INV_DESC_IOTLB_PASID(val)   (((val) >> 32) & VTD_PASID_ID_MASK)
> +#define VTD_INV_DESC_IOTLB_PASID_RSVD_LO      0xfff00000000001c0ULL
> +#define VTD_INV_DESC_IOTLB_PASID_RSVD_HI      0xf80ULL
>   
>   /* Mask for Device IOTLB Invalidate Descriptor */
>   #define VTD_INV_DESC_DEVICE_IOTLB_ADDR(val) ((val) & 0xfffffffffffff000ULL)
> @@ -413,6 +422,7 @@ typedef union VTDInvDesc VTDInvDesc;
>   /* Information about page-selective IOTLB invalidate */
>   struct VTDIOTLBPageInvInfo {
>       uint16_t domain_id;
> +    uint32_t pasid;
>       uint64_t addr;
>       uint8_t mask;
>   };
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 5bf7e52bf5..57beff0c17 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -12,6 +12,8 @@ vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate device
>   vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
>   vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
>   vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
> +vtd_inv_desc_iotlb_pasid_pages(uint16_t domain, uint64_t addr, uint8_t mask, uint32_t pasid) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8" pasid 0x%"PRIx32
> +vtd_inv_desc_iotlb_pasid(uint16_t domain, uint32_t pasid) "iotlb invalidate domain 0x%"PRIx16" pasid 0x%"PRIx32
>   vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
>   vtd_inv_desc_wait_irq(const char *msg) "%s"
>   vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index fa1bed353c..0d1029f366 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -97,11 +97,13 @@ struct VTDPASIDEntry {
>   struct VTDAddressSpace {
>       PCIBus *bus;
>       uint8_t devfn;
> +    uint32_t pasid;
>       AddressSpace as;
>       IOMMUMemoryRegion iommu;
>       MemoryRegion root;          /* The root container of the device */
>       MemoryRegion nodmar;        /* The alias of shared nodmar MR */
>       MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
> +    MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>       IntelIOMMUState *iommu_state;
>       VTDContextCacheEntry context_cache_entry;
>       QLIST_ENTRY(VTDAddressSpace) next;
> @@ -113,6 +115,7 @@ struct VTDAddressSpace {
>   struct VTDIOTLBEntry {
>       uint64_t gfn;
>       uint16_t domain_id;
> +    uint32_t pasid;
>       uint64_t slpte;
>       uint64_t mask;
>       uint8_t access_flags;
> @@ -260,6 +263,7 @@ struct IntelIOMMUState {
>       bool buggy_eim;                 /* Force buggy EIM unless eim=off */
>       uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
>       bool dma_drain;                 /* Whether DMA r/w draining enabled */
> +    bool pasid;                     /* Whether to support PASID */
>   
>       /*
>        * Protects IOMMU states in general.  Currently it protects the
> @@ -271,6 +275,7 @@ struct IntelIOMMUState {
>   /* Find the VTD Address space associated with the given bus pointer,
>    * create a new one if none exists
>    */
> -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn);
> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
> +                                 int devfn, unsigned int pasid);
>   
>   #endif
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index 347440d42c..cbfcf0b770 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -26,6 +26,8 @@ enum PCIBusFlags {
>       PCI_BUS_EXTENDED_CONFIG_SPACE                           = 0x0002,
>   };
>   
> +#define PCI_NO_PASID UINT32_MAX
> +
>   struct PCIBus {
>       BusState qbus;
>       enum PCIBusFlags flags;

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-28  2:27     ` Jason Wang
@ 2022-03-28  8:53       ` Yi Liu
  2022-03-29  4:52         ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Yi Liu @ 2022-03-28  8:53 UTC (permalink / raw)
  To: Jason Wang, Tian, Kevin; +Cc: yi.y.sun, qemu-devel, peterx, mst



On 2022/3/28 10:27, Jason Wang wrote:
> On Thu, Mar 24, 2022 at 4:21 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>>
>>> From: Jason Wang
>>> Sent: Monday, March 21, 2022 1:54 PM
>>>
>>> We use to warn on wrong rid2pasid entry. But this error could be
>>> triggered by the guest and could happens during initialization. So
>>> let's don't warn in this case.
>>>
>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>> ---
>>>   hw/i386/intel_iommu.c | 6 ++++--
>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 874d01c162..90964b201c 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -1554,8 +1554,10 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState
>>> *s, VTDContextEntry *ce)
>>>       if (s->root_scalable) {
>>>           ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>>>           if (ret) {
>>> -            error_report_once("%s: vtd_ce_get_rid2pasid_entry error: %"PRId32,
>>> -                              __func__, ret);
>>> +            /*
>>> +             * This error is guest triggerable. We should assumt PT
>>> +             * not enabled for safety.
>>> +             */
>>
>> suppose a VT-d fault should be queued in this case besides returning false:
>>
>> SPD.1: A hardware attempt to access the scalable-mode PASID-directory
>> entry referenced through the PASIDDIRPTR field in scalable-mode
>> context-entry resulted in an error
>>
>> SPT.1: A hardware attempt to access a scalable-mode PASID-table entry
>> referenced through the SMPTBLPTR field in a scalable-mode PASID-directory
>> entry resulted in an error.
> 
> Probably, but this issue is not introduced in this patch. We can fix
> it on top if necessary.

agreed.

>>
>> Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
>> problematic. According to VT-d spec, RID2PASID field is effective only
>> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I didn't
>> see ecap.rps is set, neither is it checked in that function. It works possibly
>> just because Linux currently programs 0 to RID2PASID...
> 
> This seems to be another issue since the introduction of scalable mode.

yes. this is not introduced in this series. The current scalable mode 
vIOMMU support was following 3.0 spec, while RPS is added in 3.1. Needs
to be fixed.

> Thanks
> 
>>
>>>               return false;
>>>           }
>>>           return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
>>> --
>>> 2.25.1
>>>
>>
> 

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-28  6:47       ` Tian, Kevin
@ 2022-03-29  4:46         ` Jason Wang
  2022-03-30  8:00           ` Tian, Kevin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-29  4:46 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Mon, Mar 28, 2022 at 2:47 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, March 28, 2022 10:31 AM
> >
> > On Thu, Mar 24, 2022 at 4:54 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Wang
> > > > Sent: Monday, March 21, 2022 1:54 PM
> > > >
> > > > This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> > > > existing support for scalable mode, we need to implement the following
> > > > missing parts:
> > > >
> > > > 1) tag VTDAddressSpace with PASID and support IOMMU/DMA
> > translation
> > > >    with PASID
> > > > 2) tag IOTLB with PASID
> > >
> > > and invalidate desc to flush PASID iotlb, which seems missing in this patch.
> >
> > It existed in the previous version, but it looks like it will be used
> > only for the first level page table which is not supported right now.
> > So I deleted the codes.
>
> You are right. But there is also PASID-based device TLB invalidate descriptor
> which is orthogonal to 1st vs. 2nd level thing. If we don't want to break the
> spec with this series then there will need a way to prevent the user from
> setting both "device-iotlb" and "x-pasid-mode" together.

Right, let me do it in the next version.


>
> >
> > >
> > > > 3) PASID cache and its flush
> > > > 4) Fault recording with PASID
> > > >
> > > > For simplicity:
> > > >
> > > > 1) PASID cache is not implemented so we can simply implement the PASID
> > > > cache flush as a nop.
> > > > 2) Fault recording with PASID is not supported, NFR is not changed.
> > > >
> > > > All of the above is not mandatory and could be implemented in the
> > > > future.
> > >
> > > PASID cache is optional, but fault recording with PASID is required.
> >
> > Any pointer in the spec to say something like this? I think sticking
> > to the NFR would be sufficient.
>
> I didn't remember any place in spec saying that fault recording with PASID is
> not required when PASID capability is exposed.

Ok, but as a spec it needs to clarify what is required for each capability.

> If there is certain fault
> triggered by a request with PASID, we do want to report this information
> upward.

I tend to do it increasingly on top of this series (anyhow at least
RID2PASID is introduced before this series)

>
> btw can you elaborate why NFR matters to PASID? It is just about the
> number of fault recording register...

I might be wrong, but I thought without increasing NFR we may lack
sufficient room for reporting PASID.

>
> >
> > > I'm fine with adding it incrementally but want to clarify the concept first.
> >
> > Yes, that's the plan.
> >
>
> I have one open which requires your input.
>
> While incrementally enabling things does be a common practice, one worry
> is whether we want to create too many control knobs in the staging process
> to cause confusion to the end user.

It should be fine as long as we use the "x-" prefix which will be
finally removed.

>
> Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for a
> coarse-grained knob design:
> --
>   Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
>   related to scalable mode translation, thus there are multiple combinations.
>   While this vIOMMU implementation wants simplify it for user by providing
>   typical combinations. User could config it by "x-scalable-mode" option. The
>   usage is as below:
>     "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
>
>     - "legacy": gives support for SL page table
>     - "modern": gives support for FL page table, pasid, virtual command
>     -  if not configured, means no scalable mode support, if not proper
>        configured, will throw error
> --
>
> Which way do you prefer to?
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02805.html

My understanding is that, if we want to deploy Qemu in a production
environment, we can't use the "x-" prefix. We need a full
implementation of each cap.

E.g
-device intel-iommu,first-level=on,scalable-mode=on etc.

Thanks

>
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-28  7:03   ` Tian, Kevin
@ 2022-03-29  4:48     ` Jason Wang
  2022-03-30  8:02       ` Tian, Kevin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-29  4:48 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Mon, Mar 28, 2022 at 3:03 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang
> > Sent: Monday, March 21, 2022 1:54 PM
> >
> > +    /*
> > +     * vtd-spec v3.4 3.14:
> > +     *
> > +     * """
> > +     * Requests-with-PASID with input address in range 0xFEEx_xxxx are
> > +     * translated normally like any other request-with-PASID through
> > +     * DMA-remapping hardware. However, if such a request is processed
> > +     * using pass-through translation, it will be blocked as described
> > +     * in the paragraph below.
>
> While PASID+PT is blocked as described in the below paragraph, the
> paragraph itself applies to all situations:
>
>   1) PT + noPASID
>   2) translation + noPASID
>   3) PT + PASID
>   4) translation + PASID
>
> because...
>
> > +     *
> > +     * Software must not program paging-structure entries to remap any
> > +     * address to the interrupt address range. Untranslated requests
> > +     * and translation requests that result in an address in the
> > +     * interrupt range will be blocked with condition code LGN.4 or
> > +     * SGN.8.
>
> ... if you look at the definition of LGN.4 or SGN.8:
>
> LGN.4:  When legacy mode (RTADDR_REG.TTM=00b) is enabled, hardware
>         detected an output address (i.e. address after remapping) in the
>         interrupt address range (0xFEEx_xxxx). For Translated requests and
>         requests with pass-through translation type (TT=10), the output
>         address is the same as the address in the request
>
> The last sentence in the first paragraph above just highlights the fact that
> when input address of PT is in interrupt range then it is blocked by LGN.4
> or SGN.8 due to output address also in interrupt range.
>
> > +     * """
> > +     *
> > +     * We enable per as memory region (iommu_ir_fault) for catching
> > +     * the tranlsation for interrupt range through PASID + PT.
> > +     */
> > +    if (pt && as->pasid != PCI_NO_PASID) {
> > +        memory_region_set_enabled(&as->iommu_ir_fault, true);
> > +    } else {
> > +        memory_region_set_enabled(&as->iommu_ir_fault, false);
> > +    }
> > +
>
> Given above this should be a bug fix for nopasid first and then apply it
> to pasid path too.

Actually, nopasid path patches were posted here.

https://www.mail-archive.com/qemu-devel@nongnu.org/msg867878.html

Thanks

>
> Thanks
> Kevin
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-28  8:53       ` Yi Liu
@ 2022-03-29  4:52         ` Jason Wang
  2022-03-30  8:16           ` Tian, Kevin
  2022-04-22  7:57           ` Michael S. Tsirkin
  0 siblings, 2 replies; 43+ messages in thread
From: Jason Wang @ 2022-03-29  4:52 UTC (permalink / raw)
  To: Yi Liu, Tian, Kevin; +Cc: yi.y.sun, qemu-devel, peterx, mst


在 2022/3/28 下午4:53, Yi Liu 写道:
>
>
> On 2022/3/28 10:27, Jason Wang wrote:
>> On Thu, Mar 24, 2022 at 4:21 PM Tian, Kevin <kevin.tian@intel.com> 
>> wrote:
>>>
>>>> From: Jason Wang
>>>> Sent: Monday, March 21, 2022 1:54 PM
>>>>
>>>> We use to warn on wrong rid2pasid entry. But this error could be
>>>> triggered by the guest and could happens during initialization. So
>>>> let's don't warn in this case.
>>>>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>>   hw/i386/intel_iommu.c | 6 ++++--
>>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 874d01c162..90964b201c 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -1554,8 +1554,10 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState
>>>> *s, VTDContextEntry *ce)
>>>>       if (s->root_scalable) {
>>>>           ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>>>>           if (ret) {
>>>> -            error_report_once("%s: vtd_ce_get_rid2pasid_entry 
>>>> error: %"PRId32,
>>>> -                              __func__, ret);
>>>> +            /*
>>>> +             * This error is guest triggerable. We should assumt PT
>>>> +             * not enabled for safety.
>>>> +             */
>>>
>>> suppose a VT-d fault should be queued in this case besides returning 
>>> false:
>>>
>>> SPD.1: A hardware attempt to access the scalable-mode PASID-directory
>>> entry referenced through the PASIDDIRPTR field in scalable-mode
>>> context-entry resulted in an error
>>>
>>> SPT.1: A hardware attempt to access a scalable-mode PASID-table entry
>>> referenced through the SMPTBLPTR field in a scalable-mode 
>>> PASID-directory
>>> entry resulted in an error.
>>
>> Probably, but this issue is not introduced in this patch. We can fix
>> it on top if necessary.
>
> agreed.
>
>>>
>>> Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
>>> problematic. According to VT-d spec, RID2PASID field is effective only
>>> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I didn't
>>> see ecap.rps is set, neither is it checked in that function. It 
>>> works possibly
>>> just because Linux currently programs 0 to RID2PASID...
>>
>> This seems to be another issue since the introduction of scalable mode.
>
> yes. this is not introduced in this series. The current scalable mode 
> vIOMMU support was following 3.0 spec, while RPS is added in 3.1. Needs
> to be fixed.


Interesting, so this is more complicated when dealing with migration 
compatibility. So what I suggest is probably something like:

-device intel-iommu,version=$version

Then we can maintain migration compatibility correctly. For 3.0 we can 
go without RPS and 3.1 and above we need to implement RPS.

Since most of the advanced features has not been implemented, we may 
probably start just from 3.4 (assuming it's the latest version). And all 
of the following effort should be done for 3.4 in order to productize it.

Thanks


>
>> Thanks
>>
>>>
>>>>               return false;
>>>>           }
>>>>           return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
>>>> -- 
>>>> 2.25.1
>>>>
>>>
>>
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-28  8:45   ` Yi Liu
@ 2022-03-29  4:54     ` Jason Wang
  2022-04-01 13:42       ` Yi Liu
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-29  4:54 UTC (permalink / raw)
  To: Yi Liu, mst, peterx; +Cc: yi.y.sun, qemu-devel


在 2022/3/28 下午4:45, Yi Liu 写道:
>
>
> On 2022/3/21 13:54, Jason Wang wrote:
>> This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
>> existing support for scalable mode, we need to implement the following
>> missing parts:
>>
>> 1) tag VTDAddressSpace with PASID and support IOMMU/DMA translation
>>     with PASID
>
> should it be tagging with bdf+pasid?


The problem is BDF is programmable by the guest. So we may end up 
duplicated BDFs. That's why the code uses struct PCIBus.


>
>> 2) tag IOTLB with PASID
>> 3) PASID cache and its flush
>> 4) Fault recording with PASID
>>
>> For simplicity:
>>
>> 1) PASID cache is not implemented so we can simply implement the PASID
>> cache flush as a nop.
>> 2) Fault recording with PASID is not supported, NFR is not changed.
>
> I think this doesn't work for passthrough device. So need to fail the
> qemu if user tries to expose such a vIOMMU togather with passthroug
> device.


Ok, I think I can simply fail the vIOMMU notifier registering to block 
both vhost and VFIO.

Thanks


>
>> All of the above is not mandatory and could be implemented in the
>> future.
>>
>> Note that though PASID based IOMMU translation is ready but no device
>> can issue PASID DMA right now. In this case, PCI_NO_PASID is used as
>> PASID to identify the address w/ PASID. vtd_find_add_as() has been
>> extended to provision address space with PASID which could be utilized
>> by the future extension of PCI core to allow device model to use PASID
>> based DMA translation.
>>
>> This feature would be useful for:
>>
>> 1) prototyping PASID support for devices like virtio
>> 2) future vPASID work
>> 3) future PRS and vSVA work
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   hw/i386/intel_iommu.c          | 357 +++++++++++++++++++++++++--------
>>   hw/i386/intel_iommu_internal.h |  14 +-
>>   hw/i386/trace-events           |   2 +
>>   include/hw/i386/intel_iommu.h  |   7 +-
>>   include/hw/pci/pci_bus.h       |   2 +
>>   5 files changed, 296 insertions(+), 86 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 82787f9850..13447fda16 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -58,6 +58,14 @@
>>   struct vtd_as_key {
>>       PCIBus *bus;
>>       uint8_t devfn;
>> +    uint32_t pasid;
>> +};
>> +
>> +struct vtd_iotlb_key {
>> +    uint16_t sid;
>> +    uint32_t pasid;
>> +    uint64_t gfn;
>> +    uint32_t level;
>>   };
>>     static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>> @@ -199,14 +207,24 @@ static inline gboolean 
>> vtd_as_has_map_notifier(VTDAddressSpace *as)
>>   }
>>     /* GHashTable functions */
>> -static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
>> +static gboolean vtd_iotlb_equal(gconstpointer v1, gconstpointer v2)
>>   {
>> -    return *((const uint64_t *)v1) == *((const uint64_t *)v2);
>> +    const struct vtd_iotlb_key *key1 = v1;
>> +    const struct vtd_iotlb_key *key2 = v2;
>> +
>> +    return key1->sid == key2->sid &&
>> +           key1->pasid == key2->pasid &&
>> +           key1->level == key2->level &&
>> +           key1->gfn == key2->gfn;
>>   }
>>   -static guint vtd_uint64_hash(gconstpointer v)
>> +static guint vtd_iotlb_hash(gconstpointer v)
>>   {
>> -    return (guint)*(const uint64_t *)v;
>> +    const struct vtd_iotlb_key *key = v;
>> +
>> +    return key->gfn | ((key->sid) << VTD_IOTLB_SID_SHIFT) |
>> +           (key->level) << VTD_IOTLB_LVL_SHIFT |
>> +           (key->pasid) << VTD_IOTLB_PASID_SHIFT;
>>   }
>>     static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
>> @@ -214,7 +232,8 @@ static gboolean vtd_as_equal(gconstpointer v1, 
>> gconstpointer v2)
>>       const struct vtd_as_key *key1 = v1;
>>       const struct vtd_as_key *key2 = v2;
>>   -    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
>> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn) &&
>> +           (key1->pasid == key2->pasid);
>>   }
>>     static inline uint16_t vtd_make_source_id(uint8_t bus_num, 
>> uint8_t devfn)
>> @@ -306,13 +325,6 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>>       vtd_iommu_unlock(s);
>>   }
>>   -static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
>> -                                  uint32_t level)
>> -{
>> -    return gfn | ((uint64_t)(source_id) << VTD_IOTLB_SID_SHIFT) |
>> -           ((uint64_t)(level) << VTD_IOTLB_LVL_SHIFT);
>> -}
>> -
>>   static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
>>   {
>>       return (addr & vtd_slpt_level_page_mask(level)) >> 
>> VTD_PAGE_SHIFT_4K;
>> @@ -320,15 +332,17 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, 
>> uint32_t level)
>>     /* Must be called with IOMMU lock held */
>>   static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t 
>> source_id,
>> -                                       hwaddr addr)
>> +                                       hwaddr addr, uint32_t pasid)
>>   {
>> +    struct vtd_iotlb_key key;
>>       VTDIOTLBEntry *entry;
>> -    uint64_t key;
>>       int level;
>>         for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; 
>> level++) {
>> -        key = vtd_get_iotlb_key(vtd_get_iotlb_gfn(addr, level),
>> -                                source_id, level);
>> +        key.gfn = vtd_get_iotlb_gfn(addr, level);
>> +        key.level = level;
>> +        key.sid = source_id;
>> +        key.pasid = pasid;
>>           entry = g_hash_table_lookup(s->iotlb, &key);
>>           if (entry) {
>>               goto out;
>> @@ -342,10 +356,11 @@ out:
>>   /* Must be with IOMMU lock held */
>>   static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>>                                uint16_t domain_id, hwaddr addr, 
>> uint64_t slpte,
>> -                             uint8_t access_flags, uint32_t level)
>> +                             uint8_t access_flags, uint32_t level,
>> +                             uint32_t pasid)
>>   {
>>       VTDIOTLBEntry *entry = g_malloc(sizeof(*entry));
>> -    uint64_t *key = g_malloc(sizeof(*key));
>> +    struct vtd_iotlb_key *key = g_malloc(sizeof(*key));
>>       uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
>>         trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
>> @@ -359,7 +374,13 @@ static void vtd_update_iotlb(IntelIOMMUState *s, 
>> uint16_t source_id,
>>       entry->slpte = slpte;
>>       entry->access_flags = access_flags;
>>       entry->mask = vtd_slpt_level_page_mask(level);
>> -    *key = vtd_get_iotlb_key(gfn, source_id, level);
>> +    entry->pasid = pasid;
>> +
>> +    key->gfn = gfn;
>> +    key->sid = source_id;
>> +    key->level = level;
>> +    key->pasid = pasid;
>> +
>>       g_hash_table_replace(s->iotlb, key, entry);
>>   }
>>   @@ -823,13 +844,15 @@ static int 
>> vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>>     static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
>>                                         VTDContextEntry *ce,
>> -                                      VTDPASIDEntry *pe)
>> +                                      VTDPASIDEntry *pe,
>> +                                      uint32_t pasid)
>>   {
>> -    uint32_t pasid;
>>       dma_addr_t pasid_dir_base;
>>       int ret = 0;
>>   -    pasid = VTD_CE_GET_RID2PASID(ce);
>> +    if (pasid == PCI_NO_PASID) {
>> +        pasid = VTD_CE_GET_RID2PASID(ce);
>> +    }
>>       pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>       ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
>>   @@ -838,15 +861,17 @@ static int 
>> vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
>>     static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
>>                                   VTDContextEntry *ce,
>> -                                bool *pe_fpd_set)
>> +                                bool *pe_fpd_set,
>> +                                uint32_t pasid)
>>   {
>>       int ret;
>> -    uint32_t pasid;
>>       dma_addr_t pasid_dir_base;
>>       VTDPASIDDirEntry pdire;
>>       VTDPASIDEntry pe;
>>   -    pasid = VTD_CE_GET_RID2PASID(ce);
>> +    if (pasid == PCI_NO_PASID) {
>> +        pasid = VTD_CE_GET_RID2PASID(ce);
>> +    }
>>       pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>         /*
>> @@ -892,12 +917,13 @@ static inline uint32_t 
>> vtd_ce_get_level(VTDContextEntry *ce)
>>   }
>>     static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
>> -                                   VTDContextEntry *ce)
>> +                                   VTDContextEntry *ce,
>> +                                   uint32_t pasid)
>>   {
>>       VTDPASIDEntry pe;
>>         if (s->root_scalable) {
>> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>>           return VTD_PE_GET_LEVEL(&pe);
>>       }
>>   @@ -910,12 +936,13 @@ static inline uint32_t 
>> vtd_ce_get_agaw(VTDContextEntry *ce)
>>   }
>>     static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
>> -                                  VTDContextEntry *ce)
>> +                                  VTDContextEntry *ce,
>> +                                  uint32_t pasid)
>>   {
>>       VTDPASIDEntry pe;
>>         if (s->root_scalable) {
>> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>>           return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
>>       }
>>   @@ -957,31 +984,33 @@ static inline bool 
>> vtd_ce_type_check(X86IOMMUState *x86_iommu,
>>   }
>>     static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
>> -                                      VTDContextEntry *ce, uint8_t aw)
>> +                                      VTDContextEntry *ce, uint8_t aw,
>> +                                      uint32_t pasid)
>>   {
>> -    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce);
>> +    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce, pasid);
>>       return 1ULL << MIN(ce_agaw, aw);
>>   }
>>     /* Return true if IOVA passes range check, otherwise false. */
>>   static inline bool vtd_iova_range_check(IntelIOMMUState *s,
>>                                           uint64_t iova, 
>> VTDContextEntry *ce,
>> -                                        uint8_t aw)
>> +                                        uint8_t aw, uint32_t pasid)
>>   {
>>       /*
>>        * Check if @iova is above 2^X-1, where X is the minimum of MGAW
>>        * in CAP_REG and AW in context-entry.
>>        */
>> -    return !(iova & ~(vtd_iova_limit(s, ce, aw) - 1));
>> +    return !(iova & ~(vtd_iova_limit(s, ce, aw, pasid) - 1));
>>   }
>>     static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>> -                                          VTDContextEntry *ce)
>> +                                          VTDContextEntry *ce,
>> +                                          uint32_t pasid)
>>   {
>>       VTDPASIDEntry pe;
>>         if (s->root_scalable) {
>> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>>           return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
>>       }
>>   @@ -1015,16 +1044,17 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t 
>> slpte, uint32_t level)
>>   static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>>                                uint64_t iova, bool is_write,
>>                                uint64_t *slptep, uint32_t *slpte_level,
>> -                             bool *reads, bool *writes, uint8_t 
>> aw_bits)
>> +                             bool *reads, bool *writes, uint8_t 
>> aw_bits,
>> +                             uint32_t pasid)
>>   {
>> -    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
>> -    uint32_t level = vtd_get_iova_level(s, ce);
>> +    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
>> +    uint32_t level = vtd_get_iova_level(s, ce, pasid);
>>       uint32_t offset;
>>       uint64_t slpte;
>>       uint64_t access_right_check;
>>       uint64_t xlat, size;
>>   -    if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
>> +    if (!vtd_iova_range_check(s, iova, ce, aw_bits, pasid)) {
>>           error_report_once("%s: detected IOVA overflow (iova=0x%" 
>> PRIx64 ")",
>>                             __func__, iova);
>>           return -VTD_FR_ADDR_BEYOND_MGAW;
>> @@ -1040,7 +1070,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState 
>> *s, VTDContextEntry *ce,
>>           if (slpte == (uint64_t)-1) {
>>               error_report_once("%s: detected read error on DMAR slpte "
>>                                 "(iova=0x%" PRIx64 ")", __func__, iova);
>> -            if (level == vtd_get_iova_level(s, ce)) {
>> +            if (level == vtd_get_iova_level(s, ce, pasid)) {
>>                   /* Invalid programming of context-entry */
>>                   return -VTD_FR_CONTEXT_ENTRY_INV;
>>               } else {
>> @@ -1304,18 +1334,19 @@ next:
>>    */
>>   static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
>>                            uint64_t start, uint64_t end,
>> -                         vtd_page_walk_info *info)
>> +                         vtd_page_walk_info *info,
>> +                         uint32_t pasid)
>>   {
>> -    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
>> -    uint32_t level = vtd_get_iova_level(s, ce);
>> +    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
>> +    uint32_t level = vtd_get_iova_level(s, ce, pasid);
>>   -    if (!vtd_iova_range_check(s, start, ce, info->aw)) {
>> +    if (!vtd_iova_range_check(s, start, ce, info->aw, pasid)) {
>>           return -VTD_FR_ADDR_BEYOND_MGAW;
>>       }
>>   -    if (!vtd_iova_range_check(s, end, ce, info->aw)) {
>> +    if (!vtd_iova_range_check(s, end, ce, info->aw, pasid)) {
>>           /* Fix end so that it reaches the maximum */
>> -        end = vtd_iova_limit(s, ce, info->aw);
>> +        end = vtd_iova_limit(s, ce, info->aw, pasid);
>>       }
>>         return vtd_page_walk_level(addr, start, end, level, true, 
>> true, info);
>> @@ -1383,7 +1414,7 @@ static int 
>> vtd_ce_rid2pasid_check(IntelIOMMUState *s,
>>        * has valid rid2pasid setting, which includes valid
>>        * rid2pasid field and corresponding pasid entry setting
>>        */
>> -    return vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>> +    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
>>   }
>>     /* Map a device to its corresponding domain (context-entry) */
>> @@ -1466,12 +1497,13 @@ static int 
>> vtd_sync_shadow_page_hook(IOMMUTLBEvent *event,
>>   }
>>     static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>> -                                  VTDContextEntry *ce)
>> +                                  VTDContextEntry *ce,
>> +                                  uint32_t pasid)
>>   {
>>       VTDPASIDEntry pe;
>>         if (s->root_scalable) {
>> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>>           return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>>       }
>>   @@ -1489,10 +1521,10 @@ static int 
>> vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
>>           .notify_unmap = true,
>>           .aw = s->aw_bits,
>>           .as = vtd_as,
>> -        .domain_id = vtd_get_domain_id(s, ce),
>> +        .domain_id = vtd_get_domain_id(s, ce, vtd_as->pasid),
>>       };
>>   -    return vtd_page_walk(s, ce, addr, addr + size, &info);
>> +    return vtd_page_walk(s, ce, addr, addr + size, &info, 
>> vtd_as->pasid);
>>   }
>>     static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
>> @@ -1536,13 +1568,14 @@ static int 
>> vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
>>    * 1st-level translation or 2nd-level translation, it depends
>>    * on PGTT setting.
>>    */
>> -static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce)
>> +static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>> +                               uint32_t pasid)
>>   {
>>       VTDPASIDEntry pe;
>>       int ret;
>>         if (s->root_scalable) {
>> -        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
>> +        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>>           if (ret) {
>>               /*
>>                * This error is guest triggerable. We should assumt PT
>> @@ -1578,19 +1611,20 @@ static bool vtd_as_pt_enabled(VTDAddressSpace 
>> *as)
>>           return false;
>>       }
>>   -    return vtd_dev_pt_enabled(s, &ce);
>> +    return vtd_dev_pt_enabled(s, &ce, as->pasid);
>>   }
>>     /* Return whether the device is using IOMMU translation. */
>>   static bool vtd_switch_address_space(VTDAddressSpace *as)
>>   {
>> -    bool use_iommu;
>> +    bool use_iommu, pt;
>>       /* Whether we need to take the BQL on our own */
>>       bool take_bql = !qemu_mutex_iothread_locked();
>>         assert(as);
>>         use_iommu = as->iommu_state->dmar_enabled && 
>> !vtd_as_pt_enabled(as);
>> +    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
>>         trace_vtd_switch_address_space(pci_bus_num(as->bus),
>>                                      VTD_PCI_SLOT(as->devfn),
>> @@ -1610,11 +1644,53 @@ static bool 
>> vtd_switch_address_space(VTDAddressSpace *as)
>>       if (use_iommu) {
>>           memory_region_set_enabled(&as->nodmar, false);
>> memory_region_set_enabled(MEMORY_REGION(&as->iommu), true);
>> +        /*
>> +         * vt-d spec v3.4 3.14:
>> +         *
>> +         * """
>> +         * Requests-with-PASID with input address in range 0xFEEx_xxxx
>> +         * are translated normally like any other request-with-PASID
>> +         * through DMA-remapping hardware.
>> +         * """
>> +         *
>> +         * Need to disable ir for as with PASID.
>> +         */
>> +        if (as->pasid != PCI_NO_PASID) {
>> +            memory_region_set_enabled(&as->iommu_ir, false);
>> +        } else {
>> +            memory_region_set_enabled(&as->iommu_ir, true);
>> +        }
>>       } else {
>> memory_region_set_enabled(MEMORY_REGION(&as->iommu), false);
>>           memory_region_set_enabled(&as->nodmar, true);
>>       }
>>   +    /*
>> +     * vtd-spec v3.4 3.14:
>> +     *
>> +     * """
>> +     * Requests-with-PASID with input address in range 0xFEEx_xxxx are
>> +     * translated normally like any other request-with-PASID through
>> +     * DMA-remapping hardware. However, if such a request is processed
>> +     * using pass-through translation, it will be blocked as described
>> +     * in the paragraph below.
>> +     *
>> +     * Software must not program paging-structure entries to remap any
>> +     * address to the interrupt address range. Untranslated requests
>> +     * and translation requests that result in an address in the
>> +     * interrupt range will be blocked with condition code LGN.4 or
>> +     * SGN.8.
>> +     * """
>> +     *
>> +     * We enable per as memory region (iommu_ir_fault) for catching
>> +     * the tranlsation for interrupt range through PASID + PT.
>> +     */
>> +    if (pt && as->pasid != PCI_NO_PASID) {
>> +        memory_region_set_enabled(&as->iommu_ir_fault, true);
>> +    } else {
>> +        memory_region_set_enabled(&as->iommu_ir_fault, false);
>> +    }
>> +
>>       if (take_bql) {
>>           qemu_mutex_unlock_iothread();
>>       }
>> @@ -1747,13 +1823,14 @@ static bool 
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>       uint8_t bus_num = pci_bus_num(bus);
>>       VTDContextCacheEntry *cc_entry;
>>       uint64_t slpte, page_mask;
>> -    uint32_t level;
>> +    uint32_t level, pasid = vtd_as->pasid;
>>       uint16_t source_id = vtd_make_source_id(bus_num, devfn);
>>       int ret_fr;
>>       bool is_fpd_set = false;
>>       bool reads = true;
>>       bool writes = true;
>>       uint8_t access_flags;
>> +    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>>       VTDIOTLBEntry *iotlb_entry;
>>         /*
>> @@ -1766,15 +1843,17 @@ static bool 
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>         cc_entry = &vtd_as->context_cache_entry;
>>   -    /* Try to fetch slpte form IOTLB */
>> -    iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>> -    if (iotlb_entry) {
>> -        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
>> -                                 iotlb_entry->domain_id);
>> -        slpte = iotlb_entry->slpte;
>> -        access_flags = iotlb_entry->access_flags;
>> -        page_mask = iotlb_entry->mask;
>> -        goto out;
>> +    /* Try to fetch slpte form IOTLB, we don't need RID2PASID logic */
>> +    if (!rid2pasid) {
>> +        iotlb_entry = vtd_lookup_iotlb(s, source_id, addr, pasid);
>> +        if (iotlb_entry) {
>> +            trace_vtd_iotlb_page_hit(source_id, addr, 
>> iotlb_entry->slpte,
>> + iotlb_entry->domain_id);
>> +            slpte = iotlb_entry->slpte;
>> +            access_flags = iotlb_entry->access_flags;
>> +            page_mask = iotlb_entry->mask;
>> +            goto out;
>> +        }
>>       }
>>         /* Try to fetch context-entry from cache first */
>> @@ -1785,7 +1864,7 @@ static bool 
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>           ce = cc_entry->context_entry;
>>           is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>           if (!is_fpd_set && s->root_scalable) {
>> -            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
>> +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, pasid);
>>               if (ret_fr) {
>>                   vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
>>                                            source_id, addr, is_write);
>> @@ -1796,7 +1875,7 @@ static bool 
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>           ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>>           is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>>           if (!ret_fr && !is_fpd_set && s->root_scalable) {
>> -            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
>> +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, pasid);
>>           }
>>           if (ret_fr) {
>>               vtd_qualify_report_fault(s, -ret_fr, is_fpd_set,
>> @@ -1811,11 +1890,15 @@ static bool 
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>           cc_entry->context_cache_gen = s->context_cache_gen;
>>       }
>>   +    if (rid2pasid) {
>> +        pasid = VTD_CE_GET_RID2PASID(&ce);
>> +    }
>> +
>>       /*
>>        * We don't need to translate for pass-through context entries.
>>        * Also, let's ignore IOTLB caching as well for PT devices.
>>        */
>> -    if (vtd_dev_pt_enabled(s, &ce)) {
>> +    if (vtd_dev_pt_enabled(s, &ce, pasid)) {
>>           entry->iova = addr & VTD_PAGE_MASK_4K;
>>           entry->translated_addr = entry->iova;
>>           entry->addr_mask = ~VTD_PAGE_MASK_4K;
>> @@ -1836,8 +1919,21 @@ static bool 
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>           return true;
>>       }
>>   +    /* Try to fetch slpte form IOTLB for RID2PASID slow path */
>> +    if (rid2pasid) {
>> +        iotlb_entry = vtd_lookup_iotlb(s, source_id, addr, pasid);
>> +        if (iotlb_entry) {
>> +            trace_vtd_iotlb_page_hit(source_id, addr, 
>> iotlb_entry->slpte,
>> + iotlb_entry->domain_id);
>> +            slpte = iotlb_entry->slpte;
>> +            access_flags = iotlb_entry->access_flags;
>> +            page_mask = iotlb_entry->mask;
>> +            goto out;
>> +        }
>> +    }
>> +
>>       ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
>> -                               &reads, &writes, s->aw_bits);
>> +                               &reads, &writes, s->aw_bits, pasid);
>>       if (ret_fr) {
>>           vtd_qualify_report_fault(s, -ret_fr, is_fpd_set, source_id,
>>                                    addr, is_write);
>> @@ -1846,8 +1942,8 @@ static bool 
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>         page_mask = vtd_slpt_level_page_mask(level);
>>       access_flags = IOMMU_ACCESS_FLAG(reads, writes);
>> -    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, 
>> slpte,
>> -                     access_flags, level);
>> +    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce, pasid),
>> +                     addr, slpte, access_flags, level, pasid);
>>   out:
>>       vtd_iommu_unlock(s);
>>       entry->iova = addr & page_mask;
>> @@ -2039,7 +2135,7 @@ static void 
>> vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>>       QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>>           if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>                                         vtd_as->devfn, &ce) &&
>> -            domain_id == vtd_get_domain_id(s, &ce)) {
>> +            domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>>               vtd_sync_shadow_page_table(vtd_as);
>>           }
>>       }
>> @@ -2047,7 +2143,7 @@ static void 
>> vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>>     static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>                                              uint16_t domain_id, 
>> hwaddr addr,
>> -                                           uint8_t am)
>> +                                             uint8_t am, uint32_t 
>> pasid)
>>   {
>>       VTDAddressSpace *vtd_as;
>>       VTDContextEntry ce;
>> @@ -2055,9 +2151,11 @@ static void 
>> vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>       hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>>         QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
>> +        if (pasid != PCI_NO_PASID && pasid != vtd_as->pasid)
>> +            continue;
>>           ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>                                          vtd_as->devfn, &ce);
>> -        if (!ret && domain_id == vtd_get_domain_id(s, &ce)) {
>> +        if (!ret && domain_id == vtd_get_domain_id(s, &ce, 
>> vtd_as->pasid)) {
>>               if (vtd_as_has_map_notifier(vtd_as)) {
>>                   /*
>>                    * As long as we have MAP notifications registered in
>> @@ -2101,7 +2199,7 @@ static void 
>> vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>       vtd_iommu_lock(s);
>>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, 
>> &info);
>>       vtd_iommu_unlock(s);
>> -    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
>> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, 
>> PCI_NO_PASID);
>>   }
>>     /* Flush IOTLB
>> @@ -3168,6 +3266,7 @@ static Property vtd_properties[] = {
>>       DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, 
>> FALSE),
>>       DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, 
>> scalable_mode, FALSE),
>>       DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, 
>> snoop_control, false),
>> +    DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
>>       DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>> @@ -3441,7 +3540,63 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
>>       },
>>   };
>>   -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, 
>> int devfn)
>> +static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
>> +                                         hwaddr addr, bool is_write)
>> +{
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
>> +    uint16_t sid = vtd_make_source_id(bus_n, vtd_as->devfn);
>> +    bool is_fpd_set = false;
>> +    VTDContextEntry ce;
>> +
>> +    assert(vtd_as->pasid != PCI_NO_PASID);
>> +
>> +    /* Try out best to fetch FPD, we can't do anything more */
>> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>> +        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>> +        if (!is_fpd_set && s->root_scalable) {
>> +            vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
>> +        }
>> +    }
>> +
>> +    vtd_qualify_report_fault(s, VTD_FR_SM_INTERRUPT_ADDR,
>> +                             is_fpd_set, sid, addr, is_write);
>> +}
>> +
>> +static MemTxResult vtd_mem_ir_fault_read(void *opaque, hwaddr addr,
>> +                                         uint64_t *data, unsigned size,
>> +                                         MemTxAttrs attrs)
>> +{
>> +    vtd_report_ir_illegal_access(opaque, addr, false);
>> +
>> +    return MEMTX_ERROR;
>> +}
>> +
>> +static MemTxResult vtd_mem_ir_fault_write(void *opaque, hwaddr addr,
>> +                                          uint64_t value, unsigned 
>> size,
>> +                                          MemTxAttrs attrs)
>> +{
>> +    vtd_report_ir_illegal_access(opaque, addr, true);
>> +
>> +    return MEMTX_ERROR;
>> +}
>> +
>> +static const MemoryRegionOps vtd_mem_ir_fault_ops = {
>> +    .read_with_attrs = vtd_mem_ir_fault_read,
>> +    .write_with_attrs = vtd_mem_ir_fault_write,
>> +    .endianness = DEVICE_LITTLE_ENDIAN,
>> +    .impl = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 4,
>> +    },
>> +    .valid = {
>> +        .min_access_size = 4,
>> +        .max_access_size = 4,
>> +    },
>> +};
>> +
>> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>> +                                 int devfn, unsigned int pasid)
>>   {
>>       /*
>>        * We can't simply use sid here since the bus number might not be
>> @@ -3450,6 +3605,7 @@ VTDAddressSpace 
>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>       struct vtd_as_key key = {
>>           .bus = bus,
>>           .devfn = devfn,
>> +        .pasid = pasid,
>>       };
>>       VTDAddressSpace *vtd_dev_as;
>>       char name[128];
>> @@ -3460,13 +3616,21 @@ VTDAddressSpace 
>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>             new_key->bus = bus;
>>           new_key->devfn = devfn;
>> +        new_key->pasid = pasid;
>> +
>> +        if (pasid == PCI_NO_PASID) {
>> +            snprintf(name, sizeof(name), "vtd-%02x.%x", 
>> PCI_SLOT(devfn),
>> +                     PCI_FUNC(devfn));
>> +        } else {
>> +            snprintf(name, sizeof(name), "vtd-%02x.%x-pasid-%x", 
>> PCI_SLOT(devfn),
>> +                     PCI_FUNC(devfn), pasid);
>> +        }
>>   -        snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
>> -                 PCI_FUNC(devfn));
>>           vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
>>             vtd_dev_as->bus = bus;
>>           vtd_dev_as->devfn = (uint8_t)devfn;
>> +        vtd_dev_as->pasid = pasid;
>>           vtd_dev_as->iommu_state = s;
>>           vtd_dev_as->context_cache_entry.context_cache_gen = 0;
>>           vtd_dev_as->iova_tree = iova_tree_new();
>> @@ -3507,6 +3671,24 @@ VTDAddressSpace 
>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>> VTD_INTERRUPT_ADDR_FIRST,
>> &vtd_dev_as->iommu_ir, 1);
>>   +        /*
>> +         * This region is used for catching fault to access interrupt
>> +         * range via passthrough + PASID. See also
>> +         * vtd_switch_address_space(). We can't use alias since we
>> +         * need to know the sid which is valid for MSI who uses
>> +         * bus_master_as (see msi_send_message()).
>> +         */
>> + memory_region_init_io(&vtd_dev_as->iommu_ir_fault, OBJECT(s),
>> +                              &vtd_mem_ir_fault_ops, vtd_dev_as, 
>> "vtd-no-ir",
>> +                              VTD_INTERRUPT_ADDR_SIZE);
>> +        /*
>> +         * Hook to root since when PT is enabled vtd_dev_as->iommu
>> +         * will be disabled.
>> +         */
>> + memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->root),
>> + VTD_INTERRUPT_ADDR_FIRST,
>> + &vtd_dev_as->iommu_ir_fault, 2);
>> +
>>           /*
>>            * Hook both the containers under the root container, we
>>            * switch between DMAR & noDMAR by enable/disable
>> @@ -3627,7 +3809,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion 
>> *iommu_mr, IOMMUNotifier *n)
>>                                     "legacy mode",
>>                                     bus_n, PCI_SLOT(vtd_as->devfn),
>>                                     PCI_FUNC(vtd_as->devfn),
>> -                                  vtd_get_domain_id(s, &ce),
>> +                                  vtd_get_domain_id(s, &ce, 
>> vtd_as->pasid),
>>                                     ce.hi, ce.lo);
>>           if (vtd_as_has_map_notifier(vtd_as)) {
>>               /* This is required only for MAP typed notifiers */
>> @@ -3637,10 +3819,10 @@ static void 
>> vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>>                   .notify_unmap = false,
>>                   .aw = s->aw_bits,
>>                   .as = vtd_as,
>> -                .domain_id = vtd_get_domain_id(s, &ce),
>> +                .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
>>               };
>>   -            vtd_page_walk(s, &ce, 0, ~0ULL, &info);
>> +            vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
>>           }
>>       } else {
>>           trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
>> @@ -3735,6 +3917,10 @@ static void vtd_init(IntelIOMMUState *s)
>>           s->ecap |= VTD_ECAP_SC;
>>       }
>>   +    if (s->pasid) {
>> +        s->ecap |= VTD_ECAP_PASID;
>> +    }
>> +
>>       vtd_reset_caches(s);
>>         /* Define registers with default values and bit semantics */
>> @@ -3808,7 +3994,7 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus 
>> *bus, void *opaque, int devfn)
>>         assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>>   -    vtd_as = vtd_find_add_as(s, bus, devfn);
>> +    vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
>>       return &vtd_as->as;
>>   }
>>   @@ -3851,6 +4037,11 @@ static bool 
>> vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>           return false;
>>       }
>>   +    if (s->pasid && !s->scalable_mode) {
>> +        error_setg(errp, "Need to set PASID for scalable mode");
>> +        return false;
> I guess your point is if setting pasid capability, scalable mode
> is required. right? You also need to set the pasid size in the ecap
> register when exposing pasid capability to guest.
>
> 39:35 RO X PSS: PASID Size Supported
>
>
>> +    }
>> +
>>       return true;
>>   }
>>   @@ -3913,7 +4104,7 @@ static void vtd_realize(DeviceState *dev, 
>> Error **errp)
>>         sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
>>       /* No corresponding destroy */
>> -    s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
>> +    s->iotlb = g_hash_table_new_full(vtd_iotlb_hash, vtd_iotlb_equal,
>>                                        g_free, g_free);
>>       s->vtd_as = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
>>                                         g_free, g_free);
>> diff --git a/hw/i386/intel_iommu_internal.h 
>> b/hw/i386/intel_iommu_internal.h
>> index 930ce61feb..f6d1fae79b 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -114,8 +114,9 @@
>>                                        VTD_INTERRUPT_ADDR_FIRST + 1)
>>     /* The shift of source_id in the key of IOTLB hash table */
>> -#define VTD_IOTLB_SID_SHIFT         36
>> -#define VTD_IOTLB_LVL_SHIFT         52
>> +#define VTD_IOTLB_SID_SHIFT         20
>> +#define VTD_IOTLB_LVL_SHIFT         28
>> +#define VTD_IOTLB_PASID_SHIFT       30
>>   #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash 
>> table */
>>     /* IOTLB_REG */
>> @@ -191,6 +192,7 @@
>>   #define VTD_ECAP_SC                 (1ULL << 7)
>>   #define VTD_ECAP_MHMV               (15ULL << 20)
>>   #define VTD_ECAP_SRS                (1ULL << 31)
>> +#define VTD_ECAP_PASID              (1ULL << 40)
>>   #define VTD_ECAP_SMTS               (1ULL << 43)
>>   #define VTD_ECAP_SLTS               (1ULL << 46)
>>   @@ -211,6 +213,8 @@
>>   #define VTD_CAP_DRAIN_READ          (1ULL << 55)
>>   #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | 
>> VTD_CAP_DRAIN_WRITE)
>>   #define VTD_CAP_CM                  (1ULL << 7)
>> +#define VTD_PASID_ID_SHIFT          20
>> +#define VTD_PASID_ID_MASK           ((1ULL << VTD_PASID_ID_SHIFT) - 1)
>>     /* Supported Adjusted Guest Address Widths */
>>   #define VTD_CAP_SAGAW_SHIFT         8
>> @@ -379,6 +383,11 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_IOTLB_AM(val)      ((val) & 0x3fULL)
>>   #define VTD_INV_DESC_IOTLB_RSVD_LO      0xffffffff0000ff00ULL
>>   #define VTD_INV_DESC_IOTLB_RSVD_HI      0xf80ULL
>> +#define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
>> +#define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
>> +#define VTD_INV_DESC_IOTLB_PASID(val)   (((val) >> 32) & 
>> VTD_PASID_ID_MASK)
>> +#define VTD_INV_DESC_IOTLB_PASID_RSVD_LO 0xfff00000000001c0ULL
>> +#define VTD_INV_DESC_IOTLB_PASID_RSVD_HI      0xf80ULL
>>     /* Mask for Device IOTLB Invalidate Descriptor */
>>   #define VTD_INV_DESC_DEVICE_IOTLB_ADDR(val) ((val) & 
>> 0xfffffffffffff000ULL)
>> @@ -413,6 +422,7 @@ typedef union VTDInvDesc VTDInvDesc;
>>   /* Information about page-selective IOTLB invalidate */
>>   struct VTDIOTLBPageInvInfo {
>>       uint16_t domain_id;
>> +    uint32_t pasid;
>>       uint64_t addr;
>>       uint8_t mask;
>>   };
>> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
>> index 5bf7e52bf5..57beff0c17 100644
>> --- a/hw/i386/trace-events
>> +++ b/hw/i386/trace-events
>> @@ -12,6 +12,8 @@ vtd_inv_desc_cc_devices(uint16_t sid, uint16_t 
>> fmask) "context invalidate device
>>   vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
>>   vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole 
>> domain 0x%"PRIx16
>>   vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t 
>> mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 
>> 0x%"PRIx8
>> +vtd_inv_desc_iotlb_pasid_pages(uint16_t domain, uint64_t addr, 
>> uint8_t mask, uint32_t pasid) "iotlb invalidate domain 0x%"PRIx16" 
>> addr 0x%"PRIx64" mask 0x%"PRIx8" pasid 0x%"PRIx32
>> +vtd_inv_desc_iotlb_pasid(uint16_t domain, uint32_t pasid) "iotlb 
>> invalidate domain 0x%"PRIx16" pasid 0x%"PRIx32
>>   vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate 
>> status write addr 0x%"PRIx64" data 0x%"PRIx32
>>   vtd_inv_desc_wait_irq(const char *msg) "%s"
>>   vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail 
>> for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
>> diff --git a/include/hw/i386/intel_iommu.h 
>> b/include/hw/i386/intel_iommu.h
>> index fa1bed353c..0d1029f366 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -97,11 +97,13 @@ struct VTDPASIDEntry {
>>   struct VTDAddressSpace {
>>       PCIBus *bus;
>>       uint8_t devfn;
>> +    uint32_t pasid;
>>       AddressSpace as;
>>       IOMMUMemoryRegion iommu;
>>       MemoryRegion root;          /* The root container of the device */
>>       MemoryRegion nodmar;        /* The alias of shared nodmar MR */
>>       MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>> +    MemoryRegion iommu_ir_fault; /* Interrupt region for catching 
>> fault */
>>       IntelIOMMUState *iommu_state;
>>       VTDContextCacheEntry context_cache_entry;
>>       QLIST_ENTRY(VTDAddressSpace) next;
>> @@ -113,6 +115,7 @@ struct VTDAddressSpace {
>>   struct VTDIOTLBEntry {
>>       uint64_t gfn;
>>       uint16_t domain_id;
>> +    uint32_t pasid;
>>       uint64_t slpte;
>>       uint64_t mask;
>>       uint8_t access_flags;
>> @@ -260,6 +263,7 @@ struct IntelIOMMUState {
>>       bool buggy_eim;                 /* Force buggy EIM unless 
>> eim=off */
>>       uint8_t aw_bits;                /* Host/IOVA address width (in 
>> bits) */
>>       bool dma_drain;                 /* Whether DMA r/w draining 
>> enabled */
>> +    bool pasid;                     /* Whether to support PASID */
>>         /*
>>        * Protects IOMMU states in general.  Currently it protects the
>> @@ -271,6 +275,7 @@ struct IntelIOMMUState {
>>   /* Find the VTD Address space associated with the given bus pointer,
>>    * create a new one if none exists
>>    */
>> -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, 
>> int devfn);
>> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>> +                                 int devfn, unsigned int pasid);
>>     #endif
>> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
>> index 347440d42c..cbfcf0b770 100644
>> --- a/include/hw/pci/pci_bus.h
>> +++ b/include/hw/pci/pci_bus.h
>> @@ -26,6 +26,8 @@ enum PCIBusFlags {
>>       PCI_BUS_EXTENDED_CONFIG_SPACE                           = 0x0002,
>>   };
>>   +#define PCI_NO_PASID UINT32_MAX
>> +
>>   struct PCIBus {
>>       BusState qbus;
>>       enum PCIBusFlags flags;
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-29  4:46         ` Jason Wang
@ 2022-03-30  8:00           ` Tian, Kevin
  2022-03-30  8:32             ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-30  8:00 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, March 29, 2022 12:47 PM
> 
> On Mon, Mar 28, 2022 at 2:47 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, March 28, 2022 10:31 AM
> > >
> > > On Thu, Mar 24, 2022 at 4:54 PM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > >
> > > > > From: Jason Wang
> > > > > Sent: Monday, March 21, 2022 1:54 PM
> > > > >
> > > > > This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> > > > > existing support for scalable mode, we need to implement the
> following
> > > > > missing parts:
> > > > >
> > > > > 1) tag VTDAddressSpace with PASID and support IOMMU/DMA
> > > translation
> > > > >    with PASID
> > > > > 2) tag IOTLB with PASID
> > > >
> > > > and invalidate desc to flush PASID iotlb, which seems missing in this
> patch.
> > >
> > > It existed in the previous version, but it looks like it will be used
> > > only for the first level page table which is not supported right now.
> > > So I deleted the codes.
> >
> > You are right. But there is also PASID-based device TLB invalidate descriptor
> > which is orthogonal to 1st vs. 2nd level thing. If we don't want to break the
> > spec with this series then there will need a way to prevent the user from
> > setting both "device-iotlb" and "x-pasid-mode" together.
> 
> Right, let me do it in the next version.
> 
> 
> >
> > >
> > > >
> > > > > 3) PASID cache and its flush
> > > > > 4) Fault recording with PASID
> > > > >
> > > > > For simplicity:
> > > > >
> > > > > 1) PASID cache is not implemented so we can simply implement the
> PASID
> > > > > cache flush as a nop.
> > > > > 2) Fault recording with PASID is not supported, NFR is not changed.
> > > > >
> > > > > All of the above is not mandatory and could be implemented in the
> > > > > future.
> > > >
> > > > PASID cache is optional, but fault recording with PASID is required.
> > >
> > > Any pointer in the spec to say something like this? I think sticking
> > > to the NFR would be sufficient.
> >
> > I didn't remember any place in spec saying that fault recording with PASID
> is
> > not required when PASID capability is exposed.
> 
> Ok, but as a spec it needs to clarify what is required for each capability.

It is clarified in 10.4.14 Fault Recording Registers:

  "PV: PASID Value": PASID value used by the faulted request.
  For requests with PASID, this field reports the PASID value that
  came with the request. Hardware implementations not supporting
  PASID (PASID field Clear in Extended Capability register) and not
  supporting RID_PASID (RPS field Clear in Extended Capability
  Register) implement this field as RsvdZ.

Above reflects that when PASID capability is enabled the PV field
should include PASID value for the faulted request.

Similar description can be found in another field "PP: PASID Present"

> 
> > If there is certain fault
> > triggered by a request with PASID, we do want to report this information
> > upward.
> 
> I tend to do it increasingly on top of this series (anyhow at least
> RID2PASID is introduced before this series)

Yes, RID2PASID should have been recorded too but it's not done correctly.

If you do it in separate series, it implies that you will introduce another
"x-pasid-fault' to guard the new logic related to PASID fault recording?

> 
> >
> > btw can you elaborate why NFR matters to PASID? It is just about the
> > number of fault recording register...
> 
> I might be wrong, but I thought without increasing NFR we may lack
> sufficient room for reporting PASID.

I think they are orthogonal things.

> 
> >
> > >
> > > > I'm fine with adding it incrementally but want to clarify the concept first.
> > >
> > > Yes, that's the plan.
> > >
> >
> > I have one open which requires your input.
> >
> > While incrementally enabling things does be a common practice, one worry
> > is whether we want to create too many control knobs in the staging process
> > to cause confusion to the end user.
> 
> It should be fine as long as we use the "x-" prefix which will be
> finally removed.

Good to learn.

> 
> >
> > Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for a
> > coarse-grained knob design:
> > --
> >   Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> >   related to scalable mode translation, thus there are multiple combinations.
> >   While this vIOMMU implementation wants simplify it for user by providing
> >   typical combinations. User could config it by "x-scalable-mode" option.
> The
> >   usage is as below:
> >     "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> >
> >     - "legacy": gives support for SL page table
> >     - "modern": gives support for FL page table, pasid, virtual command
> >     -  if not configured, means no scalable mode support, if not proper
> >        configured, will throw error
> > --
> >
> > Which way do you prefer to?
> >
> > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02805.html
> 
> My understanding is that, if we want to deploy Qemu in a production
> environment, we can't use the "x-" prefix. We need a full
> implementation of each cap.
> 
> E.g
> -device intel-iommu,first-level=on,scalable-mode=on etc.
> 

You meant each cap will get a separate control option?

But that way requires the management stack or admin to have deep
knowledge about how combinations of different capabilities work, e.g.
if just turning on scalable mode w/o first-level cannot support vSVA
on assigned devices. Is this a common practice when defining Qemu
parameters? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-29  4:48     ` Jason Wang
@ 2022-03-30  8:02       ` Tian, Kevin
  2022-03-30  8:31         ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-30  8:02 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, March 29, 2022 12:49 PM
> 
> On Mon, Mar 28, 2022 at 3:03 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang
> > > Sent: Monday, March 21, 2022 1:54 PM
> > >
> > > +    /*
> > > +     * vtd-spec v3.4 3.14:
> > > +     *
> > > +     * """
> > > +     * Requests-with-PASID with input address in range 0xFEEx_xxxx are
> > > +     * translated normally like any other request-with-PASID through
> > > +     * DMA-remapping hardware. However, if such a request is processed
> > > +     * using pass-through translation, it will be blocked as described
> > > +     * in the paragraph below.
> >
> > While PASID+PT is blocked as described in the below paragraph, the
> > paragraph itself applies to all situations:
> >
> >   1) PT + noPASID
> >   2) translation + noPASID
> >   3) PT + PASID
> >   4) translation + PASID
> >
> > because...
> >
> > > +     *
> > > +     * Software must not program paging-structure entries to remap any
> > > +     * address to the interrupt address range. Untranslated requests
> > > +     * and translation requests that result in an address in the
> > > +     * interrupt range will be blocked with condition code LGN.4 or
> > > +     * SGN.8.
> >
> > ... if you look at the definition of LGN.4 or SGN.8:
> >
> > LGN.4:  When legacy mode (RTADDR_REG.TTM=00b) is enabled, hardware
> >         detected an output address (i.e. address after remapping) in the
> >         interrupt address range (0xFEEx_xxxx). For Translated requests and
> >         requests with pass-through translation type (TT=10), the output
> >         address is the same as the address in the request
> >
> > The last sentence in the first paragraph above just highlights the fact that
> > when input address of PT is in interrupt range then it is blocked by LGN.4
> > or SGN.8 due to output address also in interrupt range.
> >
> > > +     * """
> > > +     *
> > > +     * We enable per as memory region (iommu_ir_fault) for catching
> > > +     * the tranlsation for interrupt range through PASID + PT.
> > > +     */
> > > +    if (pt && as->pasid != PCI_NO_PASID) {
> > > +        memory_region_set_enabled(&as->iommu_ir_fault, true);
> > > +    } else {
> > > +        memory_region_set_enabled(&as->iommu_ir_fault, false);
> > > +    }
> > > +
> >
> > Given above this should be a bug fix for nopasid first and then apply it
> > to pasid path too.
> 
> Actually, nopasid path patches were posted here.
> 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg867878.html
> 
> Thanks
> 

Can you elaborate why they are handled differently?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-29  4:52         ` Jason Wang
@ 2022-03-30  8:16           ` Tian, Kevin
  2022-03-30  8:36             ` Jason Wang
  2022-04-22  7:57           ` Michael S. Tsirkin
  1 sibling, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-03-30  8:16 UTC (permalink / raw)
  To: Jason Wang, Liu, Yi L; +Cc: yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, March 29, 2022 12:52 PM
> >
> >>>
> >>> Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
> >>> problematic. According to VT-d spec, RID2PASID field is effective only
> >>> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I didn't
> >>> see ecap.rps is set, neither is it checked in that function. It
> >>> works possibly
> >>> just because Linux currently programs 0 to RID2PASID...
> >>
> >> This seems to be another issue since the introduction of scalable mode.
> >
> > yes. this is not introduced in this series. The current scalable mode
> > vIOMMU support was following 3.0 spec, while RPS is added in 3.1. Needs
> > to be fixed.
> 
> 
> Interesting, so this is more complicated when dealing with migration
> compatibility. So what I suggest is probably something like:
> 
> -device intel-iommu,version=$version
> 
> Then we can maintain migration compatibility correctly. For 3.0 we can
> go without RPS and 3.1 and above we need to implement RPS.

This is sensible. Probably a new version number is created only when
it breaks compatibility with an old version, i.e. not necessarily to follow
every release from VT-d spec. In this case we definitely need one from
3.0 to 3.1+ given RID2PASID working on a 3.0 implementation will 
trigger a reserved fault due to RPS not set on a 3.1 implementation.

> 
> Since most of the advanced features has not been implemented, we may
> probably start just from 3.4 (assuming it's the latest version). And all
> of the following effort should be done for 3.4 in order to productize it.
> 

Agree. btw in your understanding is intel-iommu in a production quality
now? If not, do we want to apply this version scheme only when it
reaches the production quality or also in the experimental phase?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-30  8:02       ` Tian, Kevin
@ 2022-03-30  8:31         ` Jason Wang
  2022-04-02  7:24           ` Tian, Kevin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-30  8:31 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Wed, Mar 30, 2022 at 4:02 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, March 29, 2022 12:49 PM
> >
> > On Mon, Mar 28, 2022 at 3:03 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Wang
> > > > Sent: Monday, March 21, 2022 1:54 PM
> > > >
> > > > +    /*
> > > > +     * vtd-spec v3.4 3.14:
> > > > +     *
> > > > +     * """
> > > > +     * Requests-with-PASID with input address in range 0xFEEx_xxxx are
> > > > +     * translated normally like any other request-with-PASID through
> > > > +     * DMA-remapping hardware. However, if such a request is processed
> > > > +     * using pass-through translation, it will be blocked as described
> > > > +     * in the paragraph below.
> > >
> > > While PASID+PT is blocked as described in the below paragraph, the
> > > paragraph itself applies to all situations:
> > >
> > >   1) PT + noPASID
> > >   2) translation + noPASID
> > >   3) PT + PASID
> > >   4) translation + PASID
> > >
> > > because...
> > >
> > > > +     *
> > > > +     * Software must not program paging-structure entries to remap any
> > > > +     * address to the interrupt address range. Untranslated requests
> > > > +     * and translation requests that result in an address in the
> > > > +     * interrupt range will be blocked with condition code LGN.4 or
> > > > +     * SGN.8.
> > >
> > > ... if you look at the definition of LGN.4 or SGN.8:
> > >
> > > LGN.4:  When legacy mode (RTADDR_REG.TTM=00b) is enabled, hardware
> > >         detected an output address (i.e. address after remapping) in the
> > >         interrupt address range (0xFEEx_xxxx). For Translated requests and
> > >         requests with pass-through translation type (TT=10), the output
> > >         address is the same as the address in the request
> > >
> > > The last sentence in the first paragraph above just highlights the fact that
> > > when input address of PT is in interrupt range then it is blocked by LGN.4
> > > or SGN.8 due to output address also in interrupt range.
> > >
> > > > +     * """
> > > > +     *
> > > > +     * We enable per as memory region (iommu_ir_fault) for catching
> > > > +     * the tranlsation for interrupt range through PASID + PT.
> > > > +     */
> > > > +    if (pt && as->pasid != PCI_NO_PASID) {
> > > > +        memory_region_set_enabled(&as->iommu_ir_fault, true);
> > > > +    } else {
> > > > +        memory_region_set_enabled(&as->iommu_ir_fault, false);
> > > > +    }
> > > > +
> > >
> > > Given above this should be a bug fix for nopasid first and then apply it
> > > to pasid path too.
> >
> > Actually, nopasid path patches were posted here.
> >
> > https://www.mail-archive.com/qemu-devel@nongnu.org/msg867878.html
> >
> > Thanks
> >
>
> Can you elaborate why they are handled differently?

It's because that patch is for the case where pasid mode is not
implemented. We might need it for -stable.

Thanks

>
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-30  8:00           ` Tian, Kevin
@ 2022-03-30  8:32             ` Jason Wang
  2022-04-02  7:27               ` Tian, Kevin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-03-30  8:32 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Wed, Mar 30, 2022 at 4:00 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, March 29, 2022 12:47 PM
> >
> > On Mon, Mar 28, 2022 at 2:47 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, March 28, 2022 10:31 AM
> > > >
> > > > On Thu, Mar 24, 2022 at 4:54 PM Tian, Kevin <kevin.tian@intel.com>
> > wrote:
> > > > >
> > > > > > From: Jason Wang
> > > > > > Sent: Monday, March 21, 2022 1:54 PM
> > > > > >
> > > > > > This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> > > > > > existing support for scalable mode, we need to implement the
> > following
> > > > > > missing parts:
> > > > > >
> > > > > > 1) tag VTDAddressSpace with PASID and support IOMMU/DMA
> > > > translation
> > > > > >    with PASID
> > > > > > 2) tag IOTLB with PASID
> > > > >
> > > > > and invalidate desc to flush PASID iotlb, which seems missing in this
> > patch.
> > > >
> > > > It existed in the previous version, but it looks like it will be used
> > > > only for the first level page table which is not supported right now.
> > > > So I deleted the codes.
> > >
> > > You are right. But there is also PASID-based device TLB invalidate descriptor
> > > which is orthogonal to 1st vs. 2nd level thing. If we don't want to break the
> > > spec with this series then there will need a way to prevent the user from
> > > setting both "device-iotlb" and "x-pasid-mode" together.
> >
> > Right, let me do it in the next version.
> >
> >
> > >
> > > >
> > > > >
> > > > > > 3) PASID cache and its flush
> > > > > > 4) Fault recording with PASID
> > > > > >
> > > > > > For simplicity:
> > > > > >
> > > > > > 1) PASID cache is not implemented so we can simply implement the
> > PASID
> > > > > > cache flush as a nop.
> > > > > > 2) Fault recording with PASID is not supported, NFR is not changed.
> > > > > >
> > > > > > All of the above is not mandatory and could be implemented in the
> > > > > > future.
> > > > >
> > > > > PASID cache is optional, but fault recording with PASID is required.
> > > >
> > > > Any pointer in the spec to say something like this? I think sticking
> > > > to the NFR would be sufficient.
> > >
> > > I didn't remember any place in spec saying that fault recording with PASID
> > is
> > > not required when PASID capability is exposed.
> >
> > Ok, but as a spec it needs to clarify what is required for each capability.
>
> It is clarified in 10.4.14 Fault Recording Registers:
>
>   "PV: PASID Value": PASID value used by the faulted request.
>   For requests with PASID, this field reports the PASID value that
>   came with the request. Hardware implementations not supporting
>   PASID (PASID field Clear in Extended Capability register) and not
>   supporting RID_PASID (RPS field Clear in Extended Capability
>   Register) implement this field as RsvdZ.
>
> Above reflects that when PASID capability is enabled the PV field
> should include PASID value for the faulted request.
>
> Similar description can be found in another field "PP: PASID Present"

Ok.

>
> >
> > > If there is certain fault
> > > triggered by a request with PASID, we do want to report this information
> > > upward.
> >
> > I tend to do it increasingly on top of this series (anyhow at least
> > RID2PASID is introduced before this series)
>
> Yes, RID2PASID should have been recorded too but it's not done correctly.
>
> If you do it in separate series, it implies that you will introduce another
> "x-pasid-fault' to guard the new logic related to PASID fault recording?

Something like this, as said previously, if it's a real problem, it
exists since the introduction of rid2pasid, not specific to this
patch.

But I can add the fault recording if you insist.

>
> >
> > >
> > > btw can you elaborate why NFR matters to PASID? It is just about the
> > > number of fault recording register...
> >
> > I might be wrong, but I thought without increasing NFR we may lack
> > sufficient room for reporting PASID.
>
> I think they are orthogonal things.

Ok.

>
> >
> > >
> > > >
> > > > > I'm fine with adding it incrementally but want to clarify the concept first.
> > > >
> > > > Yes, that's the plan.
> > > >
> > >
> > > I have one open which requires your input.
> > >
> > > While incrementally enabling things does be a common practice, one worry
> > > is whether we want to create too many control knobs in the staging process
> > > to cause confusion to the end user.
> >
> > It should be fine as long as we use the "x-" prefix which will be
> > finally removed.
>
> Good to learn.
>
> >
> > >
> > > Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for a
> > > coarse-grained knob design:
> > > --
> > >   Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> > >   related to scalable mode translation, thus there are multiple combinations.
> > >   While this vIOMMU implementation wants simplify it for user by providing
> > >   typical combinations. User could config it by "x-scalable-mode" option.
> > The
> > >   usage is as below:
> > >     "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > >
> > >     - "legacy": gives support for SL page table
> > >     - "modern": gives support for FL page table, pasid, virtual command
> > >     -  if not configured, means no scalable mode support, if not proper
> > >        configured, will throw error
> > > --
> > >
> > > Which way do you prefer to?
> > >
> > > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02805.html
> >
> > My understanding is that, if we want to deploy Qemu in a production
> > environment, we can't use the "x-" prefix. We need a full
> > implementation of each cap.
> >
> > E.g
> > -device intel-iommu,first-level=on,scalable-mode=on etc.
> >
>
> You meant each cap will get a separate control option?
>
> But that way requires the management stack or admin to have deep
> knowledge about how combinations of different capabilities work, e.g.
> if just turning on scalable mode w/o first-level cannot support vSVA
> on assigned devices. Is this a common practice when defining Qemu
> parameters?

We can have a safe and good default value for each cap. E.g

In qemu 8.0 we think scalable is mature, we can make scalable to be
enabled by default
in qemu 8.1 we think first-level is mature, we can make first level to
be enabled by default.

Thanks

>
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-30  8:16           ` Tian, Kevin
@ 2022-03-30  8:36             ` Jason Wang
  2022-04-02  7:33               ` Tian, Kevin
  2022-04-22  0:13               ` Peter Xu
  0 siblings, 2 replies; 43+ messages in thread
From: Jason Wang @ 2022-03-30  8:36 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Wed, Mar 30, 2022 at 4:16 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, March 29, 2022 12:52 PM
> > >
> > >>>
> > >>> Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
> > >>> problematic. According to VT-d spec, RID2PASID field is effective only
> > >>> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I didn't
> > >>> see ecap.rps is set, neither is it checked in that function. It
> > >>> works possibly
> > >>> just because Linux currently programs 0 to RID2PASID...
> > >>
> > >> This seems to be another issue since the introduction of scalable mode.
> > >
> > > yes. this is not introduced in this series. The current scalable mode
> > > vIOMMU support was following 3.0 spec, while RPS is added in 3.1. Needs
> > > to be fixed.
> >
> >
> > Interesting, so this is more complicated when dealing with migration
> > compatibility. So what I suggest is probably something like:
> >
> > -device intel-iommu,version=$version
> >
> > Then we can maintain migration compatibility correctly. For 3.0 we can
> > go without RPS and 3.1 and above we need to implement RPS.
>
> This is sensible. Probably a new version number is created only when
> it breaks compatibility with an old version, i.e. not necessarily to follow
> every release from VT-d spec. In this case we definitely need one from
> 3.0 to 3.1+ given RID2PASID working on a 3.0 implementation will
> trigger a reserved fault due to RPS not set on a 3.1 implementation.

3.0 should be fine, but I need to check whether there's another
difference for PASID mode.

It would be helpful if there's a chapter in the spec to describe the
difference of behaviours.

>
> >
> > Since most of the advanced features has not been implemented, we may
> > probably start just from 3.4 (assuming it's the latest version). And all
> > of the following effort should be done for 3.4 in order to productize it.
> >
>
> Agree. btw in your understanding is intel-iommu in a production quality
> now?

Red Hat supports vIOMMU for the guest DPDK path now.

For scalable-mode we need to see some use cases then we can evaluate.
virtio SVA could be a possible use case, but it requires more work e.g
PRS queue.

> If not, do we want to apply this version scheme only when it
> reaches the production quality or also in the experimental phase?

Yes. E.g if we think scalable mode is mature, we can enable 3.0.

Thanks

>
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-29  4:54     ` Jason Wang
@ 2022-04-01 13:42       ` Yi Liu
  2022-04-02  1:52         ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Yi Liu @ 2022-04-01 13:42 UTC (permalink / raw)
  To: Jason Wang, mst, peterx; +Cc: yi.y.sun, qemu-devel



On 2022/3/29 12:54, Jason Wang wrote:
> 
> 在 2022/3/28 下午4:45, Yi Liu 写道:
>>
>>
>> On 2022/3/21 13:54, Jason Wang wrote:
>>> This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
>>> existing support for scalable mode, we need to implement the following
>>> missing parts:
>>>
>>> 1) tag VTDAddressSpace with PASID and support IOMMU/DMA translation
>>>     with PASID
>>
>> should it be tagging with bdf+pasid?
> 
> 
> The problem is BDF is programmable by the guest. So we may end up 
> duplicated BDFs. That's why the code uses struct PCIBus.

how about the devfn? will it also change? taggiing addressspace with
BDF+PASID mostly suits the spec since the PASID table is per-bdf. If
bus number may change, using PCIBus is fine.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-04-01 13:42       ` Yi Liu
@ 2022-04-02  1:52         ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2022-04-02  1:52 UTC (permalink / raw)
  To: Yi Liu; +Cc: yi.y.sun, qemu-devel, Peter Xu, mst

On Fri, Apr 1, 2022 at 9:43 PM Yi Liu <yi.l.liu@intel.com> wrote:
>
>
>
> On 2022/3/29 12:54, Jason Wang wrote:
> >
> > 在 2022/3/28 下午4:45, Yi Liu 写道:
> >>
> >>
> >> On 2022/3/21 13:54, Jason Wang wrote:
> >>> This patch introduce ECAP_PASID via "x-pasid-mode". Based on the
> >>> existing support for scalable mode, we need to implement the following
> >>> missing parts:
> >>>
> >>> 1) tag VTDAddressSpace with PASID and support IOMMU/DMA translation
> >>>     with PASID
> >>
> >> should it be tagging with bdf+pasid?
> >
> >
> > The problem is BDF is programmable by the guest. So we may end up
> > duplicated BDFs. That's why the code uses struct PCIBus.
>
> how about the devfn? will it also change?

The code has already used devfn, doesn't it?

struct vtd_as_key {
    PCIBus *bus;
    uint8_t devfn;
    uint32_t pasid;
};

Thanks

>  taggiing addressspace with
> BDF+PASID mostly suits the spec since the PASID table is per-bdf. If
> bus number may change, using PCIBus is fine.
>
> Regards,
> Yi Liu
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-30  8:31         ` Jason Wang
@ 2022-04-02  7:24           ` Tian, Kevin
  2022-04-06  3:31             ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-04-02  7:24 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, March 30, 2022 4:32 PM
> 
> On Wed, Mar 30, 2022 at 4:02 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, March 29, 2022 12:49 PM
> > >
> > > On Mon, Mar 28, 2022 at 3:03 PM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > >
> > > > > From: Jason Wang
> > > > > Sent: Monday, March 21, 2022 1:54 PM
> > > > >
> > > > > +    /*
> > > > > +     * vtd-spec v3.4 3.14:
> > > > > +     *
> > > > > +     * """
> > > > > +     * Requests-with-PASID with input address in range 0xFEEx_xxxx
> are
> > > > > +     * translated normally like any other request-with-PASID through
> > > > > +     * DMA-remapping hardware. However, if such a request is
> processed
> > > > > +     * using pass-through translation, it will be blocked as described
> > > > > +     * in the paragraph below.
> > > >
> > > > While PASID+PT is blocked as described in the below paragraph, the
> > > > paragraph itself applies to all situations:
> > > >
> > > >   1) PT + noPASID
> > > >   2) translation + noPASID
> > > >   3) PT + PASID
> > > >   4) translation + PASID
> > > >
> > > > because...
> > > >
> > > > > +     *
> > > > > +     * Software must not program paging-structure entries to remap
> any
> > > > > +     * address to the interrupt address range. Untranslated requests
> > > > > +     * and translation requests that result in an address in the
> > > > > +     * interrupt range will be blocked with condition code LGN.4 or
> > > > > +     * SGN.8.
> > > >
> > > > ... if you look at the definition of LGN.4 or SGN.8:
> > > >
> > > > LGN.4:  When legacy mode (RTADDR_REG.TTM=00b) is enabled,
> hardware
> > > >         detected an output address (i.e. address after remapping) in the
> > > >         interrupt address range (0xFEEx_xxxx). For Translated requests and
> > > >         requests with pass-through translation type (TT=10), the output
> > > >         address is the same as the address in the request
> > > >
> > > > The last sentence in the first paragraph above just highlights the fact
> that
> > > > when input address of PT is in interrupt range then it is blocked by
> LGN.4
> > > > or SGN.8 due to output address also in interrupt range.
> > > >
> > > > > +     * """
> > > > > +     *
> > > > > +     * We enable per as memory region (iommu_ir_fault) for catching
> > > > > +     * the tranlsation for interrupt range through PASID + PT.
> > > > > +     */
> > > > > +    if (pt && as->pasid != PCI_NO_PASID) {
> > > > > +        memory_region_set_enabled(&as->iommu_ir_fault, true);
> > > > > +    } else {
> > > > > +        memory_region_set_enabled(&as->iommu_ir_fault, false);
> > > > > +    }
> > > > > +
> > > >
> > > > Given above this should be a bug fix for nopasid first and then apply it
> > > > to pasid path too.
> > >
> > > Actually, nopasid path patches were posted here.
> > >
> > > https://www.mail-archive.com/qemu-
> devel@nongnu.org/msg867878.html
> > >
> > > Thanks
> > >
> >
> > Can you elaborate why they are handled differently?
> 
> It's because that patch is for the case where pasid mode is not
> implemented. We might need it for -stable.
> 

So will that patch be replaced after this one goes in? By any means
the new iommu_ir_fault region could be applied to both nopasid
and pasid i.e. no need toggle it when address space is switched.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 4/4] intel-iommu: PASID support
  2022-03-30  8:32             ` Jason Wang
@ 2022-04-02  7:27               ` Tian, Kevin
  2022-04-06  3:31                 ` Jason Wang
  2022-04-22 15:03                 ` Peter Xu
  0 siblings, 2 replies; 43+ messages in thread
From: Tian, Kevin @ 2022-04-02  7:27 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, March 30, 2022 4:32 PM
> 
> >
> > >
> > > > If there is certain fault
> > > > triggered by a request with PASID, we do want to report this
> information
> > > > upward.
> > >
> > > I tend to do it increasingly on top of this series (anyhow at least
> > > RID2PASID is introduced before this series)
> >
> > Yes, RID2PASID should have been recorded too but it's not done correctly.
> >
> > If you do it in separate series, it implies that you will introduce another
> > "x-pasid-fault' to guard the new logic related to PASID fault recording?
> 
> Something like this, as said previously, if it's a real problem, it
> exists since the introduction of rid2pasid, not specific to this
> patch.
> 
> But I can add the fault recording if you insist.

I prefer to including the fault recording given it's simple and makes this
change more complete in concept. 😊

> > > >
> > > > Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for
> a
> > > > coarse-grained knob design:
> > > > --
> > > >   Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> capabilities
> > > >   related to scalable mode translation, thus there are multiple
> combinations.
> > > >   While this vIOMMU implementation wants simplify it for user by
> providing
> > > >   typical combinations. User could config it by "x-scalable-mode" option.
> > > The
> > > >   usage is as below:
> > > >     "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > > >
> > > >     - "legacy": gives support for SL page table
> > > >     - "modern": gives support for FL page table, pasid, virtual command
> > > >     -  if not configured, means no scalable mode support, if not proper
> > > >        configured, will throw error
> > > > --
> > > >
> > > > Which way do you prefer to?
> > > >
> > > > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-
> 02/msg02805.html
> > >
> > > My understanding is that, if we want to deploy Qemu in a production
> > > environment, we can't use the "x-" prefix. We need a full
> > > implementation of each cap.
> > >
> > > E.g
> > > -device intel-iommu,first-level=on,scalable-mode=on etc.
> > >
> >
> > You meant each cap will get a separate control option?
> >
> > But that way requires the management stack or admin to have deep
> > knowledge about how combinations of different capabilities work, e.g.
> > if just turning on scalable mode w/o first-level cannot support vSVA
> > on assigned devices. Is this a common practice when defining Qemu
> > parameters?
> 
> We can have a safe and good default value for each cap. E.g
> 
> In qemu 8.0 we think scalable is mature, we can make scalable to be
> enabled by default
> in qemu 8.1 we think first-level is mature, we can make first level to
> be enabled by default.
> 

OK, that is a workable way.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-30  8:36             ` Jason Wang
@ 2022-04-02  7:33               ` Tian, Kevin
  2022-04-06  3:33                 ` Jason Wang
  2022-04-22  0:13               ` Peter Xu
  1 sibling, 1 reply; 43+ messages in thread
From: Tian, Kevin @ 2022-04-02  7:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, March 30, 2022 4:37 PM
> On Wed, Mar 30, 2022 at 4:16 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, March 29, 2022 12:52 PM
> > > >
> > > >>>
> > > >>> Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
> > > >>> problematic. According to VT-d spec, RID2PASID field is effective only
> > > >>> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I
> didn't
> > > >>> see ecap.rps is set, neither is it checked in that function. It
> > > >>> works possibly
> > > >>> just because Linux currently programs 0 to RID2PASID...
> > > >>
> > > >> This seems to be another issue since the introduction of scalable mode.
> > > >
> > > > yes. this is not introduced in this series. The current scalable mode
> > > > vIOMMU support was following 3.0 spec, while RPS is added in 3.1.
> Needs
> > > > to be fixed.
> > >
> > >
> > > Interesting, so this is more complicated when dealing with migration
> > > compatibility. So what I suggest is probably something like:
> > >
> > > -device intel-iommu,version=$version
> > >
> > > Then we can maintain migration compatibility correctly. For 3.0 we can
> > > go without RPS and 3.1 and above we need to implement RPS.
> >
> > This is sensible. Probably a new version number is created only when
> > it breaks compatibility with an old version, i.e. not necessarily to follow
> > every release from VT-d spec. In this case we definitely need one from
> > 3.0 to 3.1+ given RID2PASID working on a 3.0 implementation will
> > trigger a reserved fault due to RPS not set on a 3.1 implementation.
> 
> 3.0 should be fine, but I need to check whether there's another
> difference for PASID mode.
> 
> It would be helpful if there's a chapter in the spec to describe the
> difference of behaviours.

There is a section called 'Revision History' in the start of the VT-d spec.
It talks about changes in each revision, e.g.:
--
  June 2019, 3.1:

  Added support for RID-PASID capability (RPS field in ECAP_REG).
--

> 
> >
> > >
> > > Since most of the advanced features has not been implemented, we may
> > > probably start just from 3.4 (assuming it's the latest version). And all
> > > of the following effort should be done for 3.4 in order to productize it.
> > >
> >
> > Agree. btw in your understanding is intel-iommu in a production quality
> > now?
> 
> Red Hat supports vIOMMU for the guest DPDK path now.
> 
> For scalable-mode we need to see some use cases then we can evaluate.
> virtio SVA could be a possible use case, but it requires more work e.g
> PRS queue.

Yes it's not ready for full evaluation yet.

The current state before your change is exactly feature-on-par with the
legacy mode, except using scalable format in certain structures. That alone
is not worthy of a formal evaluation.

> 
> > If not, do we want to apply this version scheme only when it
> > reaches the production quality or also in the experimental phase?
> 
> Yes. E.g if we think scalable mode is mature, we can enable 3.0.
> 

Nice to know.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-04-02  7:24           ` Tian, Kevin
@ 2022-04-06  3:31             ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2022-04-06  3:31 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Sat, Apr 2, 2022 at 3:24 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, March 30, 2022 4:32 PM
> >
> > On Wed, Mar 30, 2022 at 4:02 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, March 29, 2022 12:49 PM
> > > >
> > > > On Mon, Mar 28, 2022 at 3:03 PM Tian, Kevin <kevin.tian@intel.com>
> > wrote:
> > > > >
> > > > > > From: Jason Wang
> > > > > > Sent: Monday, March 21, 2022 1:54 PM
> > > > > >
> > > > > > +    /*
> > > > > > +     * vtd-spec v3.4 3.14:
> > > > > > +     *
> > > > > > +     * """
> > > > > > +     * Requests-with-PASID with input address in range 0xFEEx_xxxx
> > are
> > > > > > +     * translated normally like any other request-with-PASID through
> > > > > > +     * DMA-remapping hardware. However, if such a request is
> > processed
> > > > > > +     * using pass-through translation, it will be blocked as described
> > > > > > +     * in the paragraph below.
> > > > >
> > > > > While PASID+PT is blocked as described in the below paragraph, the
> > > > > paragraph itself applies to all situations:
> > > > >
> > > > >   1) PT + noPASID
> > > > >   2) translation + noPASID
> > > > >   3) PT + PASID
> > > > >   4) translation + PASID
> > > > >
> > > > > because...
> > > > >
> > > > > > +     *
> > > > > > +     * Software must not program paging-structure entries to remap
> > any
> > > > > > +     * address to the interrupt address range. Untranslated requests
> > > > > > +     * and translation requests that result in an address in the
> > > > > > +     * interrupt range will be blocked with condition code LGN.4 or
> > > > > > +     * SGN.8.
> > > > >
> > > > > ... if you look at the definition of LGN.4 or SGN.8:
> > > > >
> > > > > LGN.4:  When legacy mode (RTADDR_REG.TTM=00b) is enabled,
> > hardware
> > > > >         detected an output address (i.e. address after remapping) in the
> > > > >         interrupt address range (0xFEEx_xxxx). For Translated requests and
> > > > >         requests with pass-through translation type (TT=10), the output
> > > > >         address is the same as the address in the request
> > > > >
> > > > > The last sentence in the first paragraph above just highlights the fact
> > that
> > > > > when input address of PT is in interrupt range then it is blocked by
> > LGN.4
> > > > > or SGN.8 due to output address also in interrupt range.
> > > > >
> > > > > > +     * """
> > > > > > +     *
> > > > > > +     * We enable per as memory region (iommu_ir_fault) for catching
> > > > > > +     * the tranlsation for interrupt range through PASID + PT.
> > > > > > +     */
> > > > > > +    if (pt && as->pasid != PCI_NO_PASID) {
> > > > > > +        memory_region_set_enabled(&as->iommu_ir_fault, true);
> > > > > > +    } else {
> > > > > > +        memory_region_set_enabled(&as->iommu_ir_fault, false);
> > > > > > +    }
> > > > > > +
> > > > >
> > > > > Given above this should be a bug fix for nopasid first and then apply it
> > > > > to pasid path too.
> > > >
> > > > Actually, nopasid path patches were posted here.
> > > >
> > > > https://www.mail-archive.com/qemu-
> > devel@nongnu.org/msg867878.html
> > > >
> > > > Thanks
> > > >
> > >
> > > Can you elaborate why they are handled differently?
> >
> > It's because that patch is for the case where pasid mode is not
> > implemented. We might need it for -stable.
> >
>
> So will that patch be replaced after this one goes in?

That path will be merged first if I understand correctly. Then this
patch could be applied on top.

> By any means
> the new iommu_ir_fault region could be applied to both nopasid
> and pasid i.e. no need toggle it when address space is switched.

Actually it's needed only when PT is enabled. When PT is disabled, the
translation is done via iommu_translate.

Considering the previous patch will be merged, I will fix this !PT in
the next version.

Thanks

>
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-04-02  7:27               ` Tian, Kevin
@ 2022-04-06  3:31                 ` Jason Wang
  2022-04-22 15:03                 ` Peter Xu
  1 sibling, 0 replies; 43+ messages in thread
From: Jason Wang @ 2022-04-06  3:31 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Sat, Apr 2, 2022 at 3:27 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, March 30, 2022 4:32 PM
> >
> > >
> > > >
> > > > > If there is certain fault
> > > > > triggered by a request with PASID, we do want to report this
> > information
> > > > > upward.
> > > >
> > > > I tend to do it increasingly on top of this series (anyhow at least
> > > > RID2PASID is introduced before this series)
> > >
> > > Yes, RID2PASID should have been recorded too but it's not done correctly.
> > >
> > > If you do it in separate series, it implies that you will introduce another
> > > "x-pasid-fault' to guard the new logic related to PASID fault recording?
> >
> > Something like this, as said previously, if it's a real problem, it
> > exists since the introduction of rid2pasid, not specific to this
> > patch.
> >
> > But I can add the fault recording if you insist.
>
> I prefer to including the fault recording given it's simple and makes this
> change more complete in concept. 😊

That's fine.

Thanks

>
> > > > >
> > > > > Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for
> > a
> > > > > coarse-grained knob design:
> > > > > --
> > > > >   Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > capabilities
> > > > >   related to scalable mode translation, thus there are multiple
> > combinations.
> > > > >   While this vIOMMU implementation wants simplify it for user by
> > providing
> > > > >   typical combinations. User could config it by "x-scalable-mode" option.
> > > > The
> > > > >   usage is as below:
> > > > >     "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > > > >
> > > > >     - "legacy": gives support for SL page table
> > > > >     - "modern": gives support for FL page table, pasid, virtual command
> > > > >     -  if not configured, means no scalable mode support, if not proper
> > > > >        configured, will throw error
> > > > > --
> > > > >
> > > > > Which way do you prefer to?
> > > > >
> > > > > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-
> > 02/msg02805.html
> > > >
> > > > My understanding is that, if we want to deploy Qemu in a production
> > > > environment, we can't use the "x-" prefix. We need a full
> > > > implementation of each cap.
> > > >
> > > > E.g
> > > > -device intel-iommu,first-level=on,scalable-mode=on etc.
> > > >
> > >
> > > You meant each cap will get a separate control option?
> > >
> > > But that way requires the management stack or admin to have deep
> > > knowledge about how combinations of different capabilities work, e.g.
> > > if just turning on scalable mode w/o first-level cannot support vSVA
> > > on assigned devices. Is this a common practice when defining Qemu
> > > parameters?
> >
> > We can have a safe and good default value for each cap. E.g
> >
> > In qemu 8.0 we think scalable is mature, we can make scalable to be
> > enabled by default
> > in qemu 8.1 we think first-level is mature, we can make first level to
> > be enabled by default.
> >
>
> OK, that is a workable way.
>
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-04-02  7:33               ` Tian, Kevin
@ 2022-04-06  3:33                 ` Jason Wang
  2022-04-06  3:41                   ` Tian, Kevin
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-04-06  3:33 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

On Sat, Apr 2, 2022 at 3:34 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, March 30, 2022 4:37 PM
> > On Wed, Mar 30, 2022 at 4:16 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, March 29, 2022 12:52 PM
> > > > >
> > > > >>>
> > > > >>> Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
> > > > >>> problematic. According to VT-d spec, RID2PASID field is effective only
> > > > >>> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I
> > didn't
> > > > >>> see ecap.rps is set, neither is it checked in that function. It
> > > > >>> works possibly
> > > > >>> just because Linux currently programs 0 to RID2PASID...
> > > > >>
> > > > >> This seems to be another issue since the introduction of scalable mode.
> > > > >
> > > > > yes. this is not introduced in this series. The current scalable mode
> > > > > vIOMMU support was following 3.0 spec, while RPS is added in 3.1.
> > Needs
> > > > > to be fixed.
> > > >
> > > >
> > > > Interesting, so this is more complicated when dealing with migration
> > > > compatibility. So what I suggest is probably something like:
> > > >
> > > > -device intel-iommu,version=$version
> > > >
> > > > Then we can maintain migration compatibility correctly. For 3.0 we can
> > > > go without RPS and 3.1 and above we need to implement RPS.
> > >
> > > This is sensible. Probably a new version number is created only when
> > > it breaks compatibility with an old version, i.e. not necessarily to follow
> > > every release from VT-d spec. In this case we definitely need one from
> > > 3.0 to 3.1+ given RID2PASID working on a 3.0 implementation will
> > > trigger a reserved fault due to RPS not set on a 3.1 implementation.
> >
> > 3.0 should be fine, but I need to check whether there's another
> > difference for PASID mode.
> >
> > It would be helpful if there's a chapter in the spec to describe the
> > difference of behaviours.
>
> There is a section called 'Revision History' in the start of the VT-d spec.
> It talks about changes in each revision, e.g.:
> --
>   June 2019, 3.1:
>
>   Added support for RID-PASID capability (RPS field in ECAP_REG).

Good to know that, does it mean, except for this revision history, all
the other semantics keep backward compatibility across the version?

> --
>
> >
> > >
> > > >
> > > > Since most of the advanced features has not been implemented, we may
> > > > probably start just from 3.4 (assuming it's the latest version). And all
> > > > of the following effort should be done for 3.4 in order to productize it.
> > > >
> > >
> > > Agree. btw in your understanding is intel-iommu in a production quality
> > > now?
> >
> > Red Hat supports vIOMMU for the guest DPDK path now.
> >
> > For scalable-mode we need to see some use cases then we can evaluate.
> > virtio SVA could be a possible use case, but it requires more work e.g
> > PRS queue.
>
> Yes it's not ready for full evaluation yet.
>
> The current state before your change is exactly feature-on-par with the
> legacy mode, except using scalable format in certain structures. That alone
> is not worthy of a formal evaluation.

Right.

Thanks

>
> >
> > > If not, do we want to apply this version scheme only when it
> > > reaches the production quality or also in the experimental phase?
> >
> > Yes. E.g if we think scalable mode is mature, we can enable 3.0.
> >
>
> Nice to know.
>
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-04-06  3:33                 ` Jason Wang
@ 2022-04-06  3:41                   ` Tian, Kevin
  0 siblings, 0 replies; 43+ messages in thread
From: Tian, Kevin @ 2022-04-06  3:41 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, peterx, mst

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, April 6, 2022 11:33 AM
> To: Tian, Kevin <kevin.tian@intel.com>
> Cc: Liu, Yi L <yi.l.liu@intel.com>; mst@redhat.com; peterx@redhat.com;
> yi.y.sun@linux.intel.com; qemu-devel@nongnu.org
> Subject: Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when
> getting rid2pasid entry
> 
> On Sat, Apr 2, 2022 at 3:34 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, March 30, 2022 4:37 PM
> > > On Wed, Mar 30, 2022 at 4:16 PM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Tuesday, March 29, 2022 12:52 PM
> > > > > >
> > > > > >>>
> > > > > >>> Currently the implementation of vtd_ce_get_rid2pasid_entry() is
> also
> > > > > >>> problematic. According to VT-d spec, RID2PASID field is effective
> only
> > > > > >>> when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I
> > > didn't
> > > > > >>> see ecap.rps is set, neither is it checked in that function. It
> > > > > >>> works possibly
> > > > > >>> just because Linux currently programs 0 to RID2PASID...
> > > > > >>
> > > > > >> This seems to be another issue since the introduction of scalable
> mode.
> > > > > >
> > > > > > yes. this is not introduced in this series. The current scalable mode
> > > > > > vIOMMU support was following 3.0 spec, while RPS is added in 3.1.
> > > Needs
> > > > > > to be fixed.
> > > > >
> > > > >
> > > > > Interesting, so this is more complicated when dealing with migration
> > > > > compatibility. So what I suggest is probably something like:
> > > > >
> > > > > -device intel-iommu,version=$version
> > > > >
> > > > > Then we can maintain migration compatibility correctly. For 3.0 we
> can
> > > > > go without RPS and 3.1 and above we need to implement RPS.
> > > >
> > > > This is sensible. Probably a new version number is created only when
> > > > it breaks compatibility with an old version, i.e. not necessarily to follow
> > > > every release from VT-d spec. In this case we definitely need one from
> > > > 3.0 to 3.1+ given RID2PASID working on a 3.0 implementation will
> > > > trigger a reserved fault due to RPS not set on a 3.1 implementation.
> > >
> > > 3.0 should be fine, but I need to check whether there's another
> > > difference for PASID mode.
> > >
> > > It would be helpful if there's a chapter in the spec to describe the
> > > difference of behaviours.
> >
> > There is a section called 'Revision History' in the start of the VT-d spec.
> > It talks about changes in each revision, e.g.:
> > --
> >   June 2019, 3.1:
> >
> >   Added support for RID-PASID capability (RPS field in ECAP_REG).
> 
> Good to know that, does it mean, except for this revision history, all
> the other semantics keep backward compatibility across the version?

Yes and if you find anything not clarified properly I can help forward
to the spec owner.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-30  8:36             ` Jason Wang
  2022-04-02  7:33               ` Tian, Kevin
@ 2022-04-22  0:13               ` Peter Xu
  2022-04-22  6:24                 ` Jason Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Xu @ 2022-04-22  0:13 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, Liu, Yi L, yi.y.sun, qemu-devel, mst

On Wed, Mar 30, 2022 at 04:36:36PM +0800, Jason Wang wrote:
> > If not, do we want to apply this version scheme only when it
> > reaches the production quality or also in the experimental phase?
> 
> Yes. E.g if we think scalable mode is mature, we can enable 3.0.

Sorry to come back to the discussion late..

I'd say unless someone (or some organization) strongly ask for a stable
interface for scalable mode (better with some developer looking after it
along with the organization), until then we start with versioning.

Otherwise I hope we can be free to break the interface assuming things are
still evolving, just like the spec.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/4] intel-iommu: drop VTDBus
  2022-03-21  5:54 ` [PATCH V2 2/4] intel-iommu: drop VTDBus Jason Wang
@ 2022-04-22  1:17   ` Peter Xu
  2022-04-22  6:26     ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Xu @ 2022-04-22  1:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: yi.l.liu, yi.y.sun, qemu-devel, mst

Hi, Jason,

Mostly good to me, just a few nitpicks below.

On Mon, Mar 21, 2022 at 01:54:27PM +0800, Jason Wang wrote:
> We introduce VTDBus structure as an intermediate step for searching
> the address space. This works well with SID based matching/lookup. But
> when we want to support SID plus PASID based address space lookup,
> this intermediate steps turns out to be a burden. So the patch simply
> drops the VTDBus structure and use the PCIBus and devfn as the key for
> the g_hash_table(). This simplifies the codes and the future PASID
> extension.
> 
> To prevent being slower for past vtd_find_as_from_bus_num() callers, a
> vtd_as cache indexed by the bus number is introduced to store the last
> recent search result of a vtd_as belongs to a specific bus.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  hw/i386/intel_iommu.c         | 238 +++++++++++++++++-----------------
>  include/hw/i386/intel_iommu.h |  11 +-
>  2 files changed, 123 insertions(+), 126 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 90964b201c..5851a17d0e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -61,6 +61,16 @@
>      }                                                                         \
>  }
>  
> +/*
> + * PCI bus number (or SID) is not reliable since the device is usaully
> + * initalized before guest can configure the PCI bridge
> + * (SECONDARY_BUS_NUMBER).
> + */
> +struct vtd_as_key {
> +    PCIBus *bus;
> +    uint8_t devfn;
> +};
> +
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> @@ -210,6 +220,31 @@ static guint vtd_uint64_hash(gconstpointer v)
>      return (guint)*(const uint64_t *)v;
>  }
>  
> +static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    const struct vtd_as_key *key1 = v1;
> +    const struct vtd_as_key *key2 = v2;
> +
> +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
> +}
> +
> +static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
> +{
> +    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> +}

Nit: we could directly drop this one and use PCI_BUILD_BDF().

> +
> +/*
> + * Note that we use pointer to PCIBus as the key, so hashing/shifting
> + * based on the pointer value is intended.

Thanks for the comment; that helps.

Should we also mention that this hash is not the only interface to identify
two vtd_as*, say, even if on a 32bit system we got last 24 bits collapsed
on two vtd_as* pointers, we can still have vtd_as_equal() to guard us?

> + */
> +static guint vtd_as_hash(gconstpointer v)
> +{
> +    const struct vtd_as_key *key = v;
> +    guint value = (guint)(uintptr_t)key->bus;
> +
> +    return (guint)(value << 8 | key->devfn);
> +}
> +
>  static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
>                                            gpointer user_data)
>  {
> @@ -248,22 +283,14 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
>  static void vtd_reset_context_cache_locked(IntelIOMMUState *s)
>  {
>      VTDAddressSpace *vtd_as;
> -    VTDBus *vtd_bus;
> -    GHashTableIter bus_it;
> -    uint32_t devfn_it;
> +    GHashTableIter as_it;
>  
>      trace_vtd_context_cache_reset();
>  
> -    g_hash_table_iter_init(&bus_it, s->vtd_as_by_busptr);
> +    g_hash_table_iter_init(&as_it, s->vtd_as);
>  
> -    while (g_hash_table_iter_next (&bus_it, NULL, (void**)&vtd_bus)) {
> -        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
> -            vtd_as = vtd_bus->dev_as[devfn_it];
> -            if (!vtd_as) {
> -                continue;
> -            }
> -            vtd_as->context_cache_entry.context_cache_gen = 0;
> -        }
> +    while (g_hash_table_iter_next (&as_it, NULL, (void**)&vtd_as)) {
> +        vtd_as->context_cache_entry.context_cache_gen = 0;
>      }
>      s->context_cache_gen = 1;
>  }
> @@ -993,32 +1020,6 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>      return slpte & rsvd_mask;
>  }
>  
> -/* Find the VTD address space associated with a given bus number */
> -static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
> -{
> -    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
> -    GHashTableIter iter;
> -
> -    if (vtd_bus) {
> -        return vtd_bus;
> -    }
> -
> -    /*
> -     * Iterate over the registered buses to find the one which
> -     * currently holds this bus number and update the bus_num
> -     * lookup table.
> -     */
> -    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> -    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> -        if (pci_bus_num(vtd_bus->bus) == bus_num) {
> -            s->vtd_as_by_bus_num[bus_num] = vtd_bus;
> -            return vtd_bus;
> -        }
> -    }
> -
> -    return NULL;
> -}
> -
>  /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> @@ -1634,24 +1635,13 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
>  
>  static void vtd_switch_address_space_all(IntelIOMMUState *s)
>  {
> +    VTDAddressSpace *vtd_as;
>      GHashTableIter iter;
> -    VTDBus *vtd_bus;
> -    int i;
> -
> -    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> -    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> -        for (i = 0; i < PCI_DEVFN_MAX; i++) {
> -            if (!vtd_bus->dev_as[i]) {
> -                continue;
> -            }
> -            vtd_switch_address_space(vtd_bus->dev_as[i]);
> -        }
> -    }
> -}
>  
> -static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
> -{
> -    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> +    g_hash_table_iter_init(&iter, s->vtd_as);
> +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_as)) {
> +        vtd_switch_address_space(vtd_as);
> +    }
>  }
>  
>  static const bool vtd_qualified_faults[] = {
> @@ -1688,18 +1678,39 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
>      return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
>  }
>  
> +static gboolean vtd_find_as_by_sid(gpointer key, gpointer value,
> +                                   gpointer user_data)
> +{
> +    struct vtd_as_key *as_key = (struct vtd_as_key *)key;
> +    uint16_t target_sid = *(uint16_t *)user_data;
> +    uint16_t sid = vtd_make_source_id(pci_bus_num(as_key->bus),
> +                                      as_key->devfn);
> +    return sid == target_sid;
> +}
> +
> +static VTDAddressSpace *vtd_get_as_by_sid(IntelIOMMUState *s, uint16_t sid)
> +{
> +    uint8_t bus_num = sid >> 8;

PCI_BUS_NUM(sid)?

> +    VTDAddressSpace *vtd_as = s->vtd_as_cache[bus_num];
> +
> +    if (vtd_as &&
> +        (sid == vtd_make_source_id(pci_bus_num(vtd_as->bus),
> +                                   vtd_as->devfn))) {
> +        return vtd_as;
> +    }
> +
> +    vtd_as = g_hash_table_find(s->vtd_as, vtd_find_as_by_sid, &sid);
> +    s->vtd_as_cache[bus_num] = vtd_as;
> +
> +    return vtd_as;
> +}
> +
>  static void vtd_pt_enable_fast_path(IntelIOMMUState *s, uint16_t source_id)
>  {
> -    VTDBus *vtd_bus;
>      VTDAddressSpace *vtd_as;
>      bool success = false;
>  
> -    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
> -    if (!vtd_bus) {
> -        goto out;
> -    }
> -
> -    vtd_as = vtd_bus->dev_as[VTD_SID_TO_DEVFN(source_id)];
> +    vtd_as = vtd_get_as_by_sid(s, source_id);
>      if (!vtd_as) {
>          goto out;
>      }
> @@ -1907,11 +1918,10 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                                            uint16_t source_id,
>                                            uint16_t func_mask)
>  {
> +    GHashTableIter as_it;
>      uint16_t mask;
> -    VTDBus *vtd_bus;
>      VTDAddressSpace *vtd_as;
>      uint8_t bus_n, devfn;
> -    uint16_t devfn_it;
>  
>      trace_vtd_inv_desc_cc_devices(source_id, func_mask);
>  
> @@ -1934,32 +1944,31 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>      mask = ~mask;
>  
>      bus_n = VTD_SID_TO_BUS(source_id);
> -    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> -    if (vtd_bus) {
> -        devfn = VTD_SID_TO_DEVFN(source_id);
> -        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
> -            vtd_as = vtd_bus->dev_as[devfn_it];
> -            if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
> -                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> -                                             VTD_PCI_FUNC(devfn_it));
> -                vtd_iommu_lock(s);
> -                vtd_as->context_cache_entry.context_cache_gen = 0;
> -                vtd_iommu_unlock(s);
> -                /*
> -                 * Do switch address space when needed, in case if the
> -                 * device passthrough bit is switched.
> -                 */
> -                vtd_switch_address_space(vtd_as);
> -                /*
> -                 * So a device is moving out of (or moving into) a
> -                 * domain, resync the shadow page table.
> -                 * This won't bring bad even if we have no such
> -                 * notifier registered - the IOMMU notification
> -                 * framework will skip MAP notifications if that
> -                 * happened.
> -                 */
> -                vtd_sync_shadow_page_table(vtd_as);
> -            }
> +    devfn = VTD_SID_TO_DEVFN(source_id);
> +
> +    g_hash_table_iter_init(&as_it, s->vtd_as);
> +    while (g_hash_table_iter_next(&as_it, NULL, (void**)&vtd_as)) {
> +        if ((pci_bus_num(vtd_as->bus) == bus_n) &&
> +            (vtd_as->devfn & mask) == (devfn & mask)) {
> +            trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(vtd_as->devfn),
> +                                         VTD_PCI_FUNC(vtd_as->devfn));
> +            vtd_iommu_lock(s);
> +            vtd_as->context_cache_entry.context_cache_gen = 0;
> +            vtd_iommu_unlock(s);
> +            /*
> +             * Do switch address space when needed, in case if the
> +             * device passthrough bit is switched.
> +             */
> +            vtd_switch_address_space(vtd_as);
> +            /*
> +             * So a device is moving out of (or moving into) a
> +             * domain, resync the shadow page table.
> +             * This won't bring bad even if we have no such
> +             * notifier registered - the IOMMU notification
> +             * framework will skip MAP notifications if that
> +             * happened.
> +             */
> +            vtd_sync_shadow_page_table(vtd_as);
>          }
>      }
>  }
> @@ -2473,18 +2482,13 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
>  {
>      VTDAddressSpace *vtd_dev_as;
>      IOMMUTLBEvent event;
> -    struct VTDBus *vtd_bus;
>      hwaddr addr;
>      uint64_t sz;
>      uint16_t sid;
> -    uint8_t devfn;
>      bool size;
> -    uint8_t bus_num;
>  
>      addr = VTD_INV_DESC_DEVICE_IOTLB_ADDR(inv_desc->hi);
>      sid = VTD_INV_DESC_DEVICE_IOTLB_SID(inv_desc->lo);
> -    devfn = sid & 0xff;
> -    bus_num = sid >> 8;
>      size = VTD_INV_DESC_DEVICE_IOTLB_SIZE(inv_desc->hi);
>  
>      if ((inv_desc->lo & VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO) ||
> @@ -2495,12 +2499,11 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
>          return false;
>      }
>  
> -    vtd_bus = vtd_find_as_from_bus_num(s, bus_num);
> -    if (!vtd_bus) {
> -        goto done;
> -    }
> -
> -    vtd_dev_as = vtd_bus->dev_as[devfn];
> +    /*
> +     * Using sid is OK since the guest should have finished the
> +     * initialization of both the bus and device.
> +     */
> +    vtd_dev_as = vtd_get_as_by_sid(s, sid);
>      if (!vtd_dev_as) {
>          goto done;
>      }
> @@ -3426,27 +3429,27 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
>  
>  VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>  {
> -    uintptr_t key = (uintptr_t)bus;
> -    VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
> +    /*
> +     * We can't simply use sid here since the bus number might not be
> +     * initialized by the guest.
> +     */
> +    struct vtd_as_key key = {
> +        .bus = bus,
> +        .devfn = devfn,
> +    };
>      VTDAddressSpace *vtd_dev_as;
>      char name[128];
>  
> -    if (!vtd_bus) {
> -        uintptr_t *new_key = g_malloc(sizeof(*new_key));
> -        *new_key = (uintptr_t)bus;
> -        /* No corresponding free() */
> -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> -                            PCI_DEVFN_MAX);
> -        vtd_bus->bus = bus;
> -        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
> -    }
> +    vtd_dev_as = g_hash_table_lookup(s->vtd_as, &key);
> +    if (!vtd_dev_as) {
> +        struct vtd_as_key *new_key = g_malloc(sizeof(*new_key));
>  
> -    vtd_dev_as = vtd_bus->dev_as[devfn];
> +        new_key->bus = bus;
> +        new_key->devfn = devfn;
>  
> -    if (!vtd_dev_as) {
>          snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
>                   PCI_FUNC(devfn));
> -        vtd_bus->dev_as[devfn] = vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
> +        vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
>  
>          vtd_dev_as->bus = bus;
>          vtd_dev_as->devfn = (uint8_t)devfn;
> @@ -3502,6 +3505,8 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>                                              &vtd_dev_as->nodmar, 0);
>  
>          vtd_switch_address_space(vtd_dev_as);
> +
> +        g_hash_table_insert(s->vtd_as, new_key, vtd_dev_as);
>      }
>      return vtd_dev_as;
>  }
> @@ -3875,7 +3880,6 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>  
>      QLIST_INIT(&s->vtd_as_with_notifiers);
>      qemu_mutex_init(&s->iommu_lock);
> -    memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>                            "intel_iommu", DMAR_REG_SIZE);
>  
> @@ -3897,8 +3901,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>      /* No corresponding destroy */
>      s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
>                                       g_free, g_free);
> -    s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
> -                                              g_free, g_free);
> +    s->vtd_as = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
> +                                      g_free, g_free);
>      vtd_init(s);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
>      pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 3b5ac869db..fa1bed353c 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -58,7 +58,6 @@ typedef struct VTDContextEntry VTDContextEntry;
>  typedef struct VTDContextCacheEntry VTDContextCacheEntry;
>  typedef struct VTDAddressSpace VTDAddressSpace;
>  typedef struct VTDIOTLBEntry VTDIOTLBEntry;
> -typedef struct VTDBus VTDBus;
>  typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> @@ -111,12 +110,6 @@ struct VTDAddressSpace {
>      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
>  };
>  
> -struct VTDBus {
> -    PCIBus* bus;		/* A reference to the bus to provide translation for */
> -    /* A table of VTDAddressSpace objects indexed by devfn */
> -    VTDAddressSpace *dev_as[];
> -};
> -
>  struct VTDIOTLBEntry {
>      uint64_t gfn;
>      uint16_t domain_id;
> @@ -253,8 +246,8 @@ struct IntelIOMMUState {
>      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
>      GHashTable *iotlb;              /* IOTLB */
>  
> -    GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
> -    VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +    GHashTable *vtd_as;             /* VTD address space indexed by source id*/

It's not indexed by source ID but vtd_as_key?

Meanwhile how about renaming it to vtd_address_spaces?  Since we use vtd_as
as the name for VTDAddressSpace* in most code paths, so imho it'll be
easier to identify the two.

> +    VTDAddressSpace *vtd_as_cache[VTD_PCI_BUS_MAX]; /* VTD address space cache */
>      /* list of registered notifiers */
>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>  
> -- 
> 2.25.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-04-22  0:13               ` Peter Xu
@ 2022-04-22  6:24                 ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2022-04-22  6:24 UTC (permalink / raw)
  To: Peter Xu; +Cc: Tian, Kevin, Liu, Yi L, yi.y.sun, qemu-devel, mst

On Fri, Apr 22, 2022 at 8:13 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Mar 30, 2022 at 04:36:36PM +0800, Jason Wang wrote:
> > > If not, do we want to apply this version scheme only when it
> > > reaches the production quality or also in the experimental phase?
> >
> > Yes. E.g if we think scalable mode is mature, we can enable 3.0.
>
> Sorry to come back to the discussion late..
>
> I'd say unless someone (or some organization) strongly ask for a stable
> interface for scalable mode (better with some developer looking after it
> along with the organization), until then we start with versioning.
>
> Otherwise I hope we can be free to break the interface assuming things are
> still evolving, just like the spec.

Right, according to the discussion, as long as we don't think it's
mature enough to be capable of version X. We won't introduce the
version.

Thanks

>
> Thanks,
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/4] intel-iommu: drop VTDBus
  2022-04-22  1:17   ` Peter Xu
@ 2022-04-22  6:26     ` Jason Wang
  2022-04-22 12:55       ` Peter Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2022-04-22  6:26 UTC (permalink / raw)
  To: Peter Xu; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, mst

On Fri, Apr 22, 2022 at 9:17 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Jason,
>
> Mostly good to me, just a few nitpicks below.
>
> On Mon, Mar 21, 2022 at 01:54:27PM +0800, Jason Wang wrote:
> > We introduce VTDBus structure as an intermediate step for searching
> > the address space. This works well with SID based matching/lookup. But
> > when we want to support SID plus PASID based address space lookup,
> > this intermediate steps turns out to be a burden. So the patch simply
> > drops the VTDBus structure and use the PCIBus and devfn as the key for
> > the g_hash_table(). This simplifies the codes and the future PASID
> > extension.
> >
> > To prevent being slower for past vtd_find_as_from_bus_num() callers, a
> > vtd_as cache indexed by the bus number is introduced to store the last
> > recent search result of a vtd_as belongs to a specific bus.
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c         | 238 +++++++++++++++++-----------------
> >  include/hw/i386/intel_iommu.h |  11 +-
> >  2 files changed, 123 insertions(+), 126 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 90964b201c..5851a17d0e 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -61,6 +61,16 @@
> >      }                                                                         \
> >  }
> >
> > +/*
> > + * PCI bus number (or SID) is not reliable since the device is usaully
> > + * initalized before guest can configure the PCI bridge
> > + * (SECONDARY_BUS_NUMBER).
> > + */
> > +struct vtd_as_key {
> > +    PCIBus *bus;
> > +    uint8_t devfn;
> > +};
> > +
> >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> >
> > @@ -210,6 +220,31 @@ static guint vtd_uint64_hash(gconstpointer v)
> >      return (guint)*(const uint64_t *)v;
> >  }
> >
> > +static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
> > +{
> > +    const struct vtd_as_key *key1 = v1;
> > +    const struct vtd_as_key *key2 = v2;
> > +
> > +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
> > +}
> > +
> > +static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
> > +{
> > +    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> > +}
>
> Nit: we could directly drop this one and use PCI_BUILD_BDF().

Will fix.

>
> > +
> > +/*
> > + * Note that we use pointer to PCIBus as the key, so hashing/shifting
> > + * based on the pointer value is intended.
>
> Thanks for the comment; that helps.
>
> Should we also mention that this hash is not the only interface to identify
> two vtd_as*, say, even if on a 32bit system we got last 24 bits collapsed
> on two vtd_as* pointers, we can still have vtd_as_equal() to guard us?

Ok. let me add that in the next version.

>
> > + */
> > +static guint vtd_as_hash(gconstpointer v)
> > +{
> > +    const struct vtd_as_key *key = v;
> > +    guint value = (guint)(uintptr_t)key->bus;
> > +
> > +    return (guint)(value << 8 | key->devfn);
> > +}
> > +
> >  static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
> >                                            gpointer user_data)
> >  {
> > @@ -248,22 +283,14 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
> >  static void vtd_reset_context_cache_locked(IntelIOMMUState *s)
> >  {
> >      VTDAddressSpace *vtd_as;
> > -    VTDBus *vtd_bus;
> > -    GHashTableIter bus_it;
> > -    uint32_t devfn_it;
> > +    GHashTableIter as_it;
> >
> >      trace_vtd_context_cache_reset();
> >
> > -    g_hash_table_iter_init(&bus_it, s->vtd_as_by_busptr);
> > +    g_hash_table_iter_init(&as_it, s->vtd_as);
> >
> > -    while (g_hash_table_iter_next (&bus_it, NULL, (void**)&vtd_bus)) {
> > -        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
> > -            vtd_as = vtd_bus->dev_as[devfn_it];
> > -            if (!vtd_as) {
> > -                continue;
> > -            }
> > -            vtd_as->context_cache_entry.context_cache_gen = 0;
> > -        }
> > +    while (g_hash_table_iter_next (&as_it, NULL, (void**)&vtd_as)) {
> > +        vtd_as->context_cache_entry.context_cache_gen = 0;
> >      }
> >      s->context_cache_gen = 1;
> >  }
> > @@ -993,32 +1020,6 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> >      return slpte & rsvd_mask;
> >  }
> >
> > -/* Find the VTD address space associated with a given bus number */
> > -static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
> > -{
> > -    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
> > -    GHashTableIter iter;
> > -
> > -    if (vtd_bus) {
> > -        return vtd_bus;
> > -    }
> > -
> > -    /*
> > -     * Iterate over the registered buses to find the one which
> > -     * currently holds this bus number and update the bus_num
> > -     * lookup table.
> > -     */
> > -    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> > -    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> > -        if (pci_bus_num(vtd_bus->bus) == bus_num) {
> > -            s->vtd_as_by_bus_num[bus_num] = vtd_bus;
> > -            return vtd_bus;
> > -        }
> > -    }
> > -
> > -    return NULL;
> > -}
> > -
> >  /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
> >   * of the translation, can be used for deciding the size of large page.
> >   */
> > @@ -1634,24 +1635,13 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
> >
> >  static void vtd_switch_address_space_all(IntelIOMMUState *s)
> >  {
> > +    VTDAddressSpace *vtd_as;
> >      GHashTableIter iter;
> > -    VTDBus *vtd_bus;
> > -    int i;
> > -
> > -    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> > -    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> > -        for (i = 0; i < PCI_DEVFN_MAX; i++) {
> > -            if (!vtd_bus->dev_as[i]) {
> > -                continue;
> > -            }
> > -            vtd_switch_address_space(vtd_bus->dev_as[i]);
> > -        }
> > -    }
> > -}
> >
> > -static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
> > -{
> > -    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> > +    g_hash_table_iter_init(&iter, s->vtd_as);
> > +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_as)) {
> > +        vtd_switch_address_space(vtd_as);
> > +    }
> >  }
> >
> >  static const bool vtd_qualified_faults[] = {
> > @@ -1688,18 +1678,39 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
> >      return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
> >  }
> >
> > +static gboolean vtd_find_as_by_sid(gpointer key, gpointer value,
> > +                                   gpointer user_data)
> > +{
> > +    struct vtd_as_key *as_key = (struct vtd_as_key *)key;
> > +    uint16_t target_sid = *(uint16_t *)user_data;
> > +    uint16_t sid = vtd_make_source_id(pci_bus_num(as_key->bus),
> > +                                      as_key->devfn);
> > +    return sid == target_sid;
> > +}
> > +
> > +static VTDAddressSpace *vtd_get_as_by_sid(IntelIOMMUState *s, uint16_t sid)
> > +{
> > +    uint8_t bus_num = sid >> 8;
>
> PCI_BUS_NUM(sid)?

Will do.

>
> > +    VTDAddressSpace *vtd_as = s->vtd_as_cache[bus_num];
> > +
> > +    if (vtd_as &&
> > +        (sid == vtd_make_source_id(pci_bus_num(vtd_as->bus),
> > +                                   vtd_as->devfn))) {
> > +        return vtd_as;
> > +    }
> > +
> > +    vtd_as = g_hash_table_find(s->vtd_as, vtd_find_as_by_sid, &sid);
> > +    s->vtd_as_cache[bus_num] = vtd_as;
> > +
> > +    return vtd_as;
> > +}
> > +
> >  static void vtd_pt_enable_fast_path(IntelIOMMUState *s, uint16_t source_id)
> >  {
> > -    VTDBus *vtd_bus;
> >      VTDAddressSpace *vtd_as;
> >      bool success = false;
> >
> > -    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
> > -    if (!vtd_bus) {
> > -        goto out;
> > -    }
> > -
> > -    vtd_as = vtd_bus->dev_as[VTD_SID_TO_DEVFN(source_id)];
> > +    vtd_as = vtd_get_as_by_sid(s, source_id);
> >      if (!vtd_as) {
> >          goto out;
> >      }
> > @@ -1907,11 +1918,10 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
> >                                            uint16_t source_id,
> >                                            uint16_t func_mask)
> >  {
> > +    GHashTableIter as_it;
> >      uint16_t mask;
> > -    VTDBus *vtd_bus;
> >      VTDAddressSpace *vtd_as;
> >      uint8_t bus_n, devfn;
> > -    uint16_t devfn_it;
> >
> >      trace_vtd_inv_desc_cc_devices(source_id, func_mask);
> >
> > @@ -1934,32 +1944,31 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
> >      mask = ~mask;
> >
> >      bus_n = VTD_SID_TO_BUS(source_id);
> > -    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> > -    if (vtd_bus) {
> > -        devfn = VTD_SID_TO_DEVFN(source_id);
> > -        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
> > -            vtd_as = vtd_bus->dev_as[devfn_it];
> > -            if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
> > -                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> > -                                             VTD_PCI_FUNC(devfn_it));
> > -                vtd_iommu_lock(s);
> > -                vtd_as->context_cache_entry.context_cache_gen = 0;
> > -                vtd_iommu_unlock(s);
> > -                /*
> > -                 * Do switch address space when needed, in case if the
> > -                 * device passthrough bit is switched.
> > -                 */
> > -                vtd_switch_address_space(vtd_as);
> > -                /*
> > -                 * So a device is moving out of (or moving into) a
> > -                 * domain, resync the shadow page table.
> > -                 * This won't bring bad even if we have no such
> > -                 * notifier registered - the IOMMU notification
> > -                 * framework will skip MAP notifications if that
> > -                 * happened.
> > -                 */
> > -                vtd_sync_shadow_page_table(vtd_as);
> > -            }
> > +    devfn = VTD_SID_TO_DEVFN(source_id);
> > +
> > +    g_hash_table_iter_init(&as_it, s->vtd_as);
> > +    while (g_hash_table_iter_next(&as_it, NULL, (void**)&vtd_as)) {
> > +        if ((pci_bus_num(vtd_as->bus) == bus_n) &&
> > +            (vtd_as->devfn & mask) == (devfn & mask)) {
> > +            trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(vtd_as->devfn),
> > +                                         VTD_PCI_FUNC(vtd_as->devfn));
> > +            vtd_iommu_lock(s);
> > +            vtd_as->context_cache_entry.context_cache_gen = 0;
> > +            vtd_iommu_unlock(s);
> > +            /*
> > +             * Do switch address space when needed, in case if the
> > +             * device passthrough bit is switched.
> > +             */
> > +            vtd_switch_address_space(vtd_as);
> > +            /*
> > +             * So a device is moving out of (or moving into) a
> > +             * domain, resync the shadow page table.
> > +             * This won't bring bad even if we have no such
> > +             * notifier registered - the IOMMU notification
> > +             * framework will skip MAP notifications if that
> > +             * happened.
> > +             */
> > +            vtd_sync_shadow_page_table(vtd_as);
> >          }
> >      }
> >  }
> > @@ -2473,18 +2482,13 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
> >  {
> >      VTDAddressSpace *vtd_dev_as;
> >      IOMMUTLBEvent event;
> > -    struct VTDBus *vtd_bus;
> >      hwaddr addr;
> >      uint64_t sz;
> >      uint16_t sid;
> > -    uint8_t devfn;
> >      bool size;
> > -    uint8_t bus_num;
> >
> >      addr = VTD_INV_DESC_DEVICE_IOTLB_ADDR(inv_desc->hi);
> >      sid = VTD_INV_DESC_DEVICE_IOTLB_SID(inv_desc->lo);
> > -    devfn = sid & 0xff;
> > -    bus_num = sid >> 8;
> >      size = VTD_INV_DESC_DEVICE_IOTLB_SIZE(inv_desc->hi);
> >
> >      if ((inv_desc->lo & VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO) ||
> > @@ -2495,12 +2499,11 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
> >          return false;
> >      }
> >
> > -    vtd_bus = vtd_find_as_from_bus_num(s, bus_num);
> > -    if (!vtd_bus) {
> > -        goto done;
> > -    }
> > -
> > -    vtd_dev_as = vtd_bus->dev_as[devfn];
> > +    /*
> > +     * Using sid is OK since the guest should have finished the
> > +     * initialization of both the bus and device.
> > +     */
> > +    vtd_dev_as = vtd_get_as_by_sid(s, sid);
> >      if (!vtd_dev_as) {
> >          goto done;
> >      }
> > @@ -3426,27 +3429,27 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
> >
> >  VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> >  {
> > -    uintptr_t key = (uintptr_t)bus;
> > -    VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
> > +    /*
> > +     * We can't simply use sid here since the bus number might not be
> > +     * initialized by the guest.
> > +     */
> > +    struct vtd_as_key key = {
> > +        .bus = bus,
> > +        .devfn = devfn,
> > +    };
> >      VTDAddressSpace *vtd_dev_as;
> >      char name[128];
> >
> > -    if (!vtd_bus) {
> > -        uintptr_t *new_key = g_malloc(sizeof(*new_key));
> > -        *new_key = (uintptr_t)bus;
> > -        /* No corresponding free() */
> > -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> > -                            PCI_DEVFN_MAX);
> > -        vtd_bus->bus = bus;
> > -        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
> > -    }
> > +    vtd_dev_as = g_hash_table_lookup(s->vtd_as, &key);
> > +    if (!vtd_dev_as) {
> > +        struct vtd_as_key *new_key = g_malloc(sizeof(*new_key));
> >
> > -    vtd_dev_as = vtd_bus->dev_as[devfn];
> > +        new_key->bus = bus;
> > +        new_key->devfn = devfn;
> >
> > -    if (!vtd_dev_as) {
> >          snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
> >                   PCI_FUNC(devfn));
> > -        vtd_bus->dev_as[devfn] = vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
> > +        vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
> >
> >          vtd_dev_as->bus = bus;
> >          vtd_dev_as->devfn = (uint8_t)devfn;
> > @@ -3502,6 +3505,8 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> >                                              &vtd_dev_as->nodmar, 0);
> >
> >          vtd_switch_address_space(vtd_dev_as);
> > +
> > +        g_hash_table_insert(s->vtd_as, new_key, vtd_dev_as);
> >      }
> >      return vtd_dev_as;
> >  }
> > @@ -3875,7 +3880,6 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >
> >      QLIST_INIT(&s->vtd_as_with_notifiers);
> >      qemu_mutex_init(&s->iommu_lock);
> > -    memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
> >      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> >                            "intel_iommu", DMAR_REG_SIZE);
> >
> > @@ -3897,8 +3901,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >      /* No corresponding destroy */
> >      s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
> >                                       g_free, g_free);
> > -    s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
> > -                                              g_free, g_free);
> > +    s->vtd_as = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
> > +                                      g_free, g_free);
> >      vtd_init(s);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
> >      pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 3b5ac869db..fa1bed353c 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -58,7 +58,6 @@ typedef struct VTDContextEntry VTDContextEntry;
> >  typedef struct VTDContextCacheEntry VTDContextCacheEntry;
> >  typedef struct VTDAddressSpace VTDAddressSpace;
> >  typedef struct VTDIOTLBEntry VTDIOTLBEntry;
> > -typedef struct VTDBus VTDBus;
> >  typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> > @@ -111,12 +110,6 @@ struct VTDAddressSpace {
> >      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
> >  };
> >
> > -struct VTDBus {
> > -    PCIBus* bus;             /* A reference to the bus to provide translation for */
> > -    /* A table of VTDAddressSpace objects indexed by devfn */
> > -    VTDAddressSpace *dev_as[];
> > -};
> > -
> >  struct VTDIOTLBEntry {
> >      uint64_t gfn;
> >      uint16_t domain_id;
> > @@ -253,8 +246,8 @@ struct IntelIOMMUState {
> >      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
> >      GHashTable *iotlb;              /* IOTLB */
> >
> > -    GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
> > -    VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> > +    GHashTable *vtd_as;             /* VTD address space indexed by source id*/
>
> It's not indexed by source ID but vtd_as_key?

Right, let me fix that.

>
> Meanwhile how about renaming it to vtd_address_spaces?  Since we use vtd_as
> as the name for VTDAddressSpace* in most code paths, so imho it'll be
> easier to identify the two.

Ok.

Thanks

>
> > +    VTDAddressSpace *vtd_as_cache[VTD_PCI_BUS_MAX]; /* VTD address space cache */
> >      /* list of registered notifiers */
> >      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
> >
> > --
> > 2.25.1
> >
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry
  2022-03-29  4:52         ` Jason Wang
  2022-03-30  8:16           ` Tian, Kevin
@ 2022-04-22  7:57           ` Michael S. Tsirkin
  1 sibling, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2022-04-22  7:57 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, Yi Liu, yi.y.sun, qemu-devel, peterx

On Tue, Mar 29, 2022 at 12:52:08PM +0800, Jason Wang wrote:
> 
> 在 2022/3/28 下午4:53, Yi Liu 写道:
> > 
> > 
> > On 2022/3/28 10:27, Jason Wang wrote:
> > > On Thu, Mar 24, 2022 at 4:21 PM Tian, Kevin <kevin.tian@intel.com>
> > > wrote:
> > > > 
> > > > > From: Jason Wang
> > > > > Sent: Monday, March 21, 2022 1:54 PM
> > > > > 
> > > > > We use to warn on wrong rid2pasid entry. But this error could be
> > > > > triggered by the guest and could happens during initialization. So
> > > > > let's don't warn in this case.
> > > > > 
> > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > ---
> > > > >   hw/i386/intel_iommu.c | 6 ++++--
> > > > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > index 874d01c162..90964b201c 100644
> > > > > --- a/hw/i386/intel_iommu.c
> > > > > +++ b/hw/i386/intel_iommu.c
> > > > > @@ -1554,8 +1554,10 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState
> > > > > *s, VTDContextEntry *ce)
> > > > >       if (s->root_scalable) {
> > > > >           ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> > > > >           if (ret) {
> > > > > -            error_report_once("%s:
> > > > > vtd_ce_get_rid2pasid_entry error: %"PRId32,
> > > > > -                              __func__, ret);
> > > > > +            /*
> > > > > +             * This error is guest triggerable. We should assumt PT
> > > > > +             * not enabled for safety.
> > > > > +             */
> > > > 
> > > > suppose a VT-d fault should be queued in this case besides
> > > > returning false:
> > > > 
> > > > SPD.1: A hardware attempt to access the scalable-mode PASID-directory
> > > > entry referenced through the PASIDDIRPTR field in scalable-mode
> > > > context-entry resulted in an error
> > > > 
> > > > SPT.1: A hardware attempt to access a scalable-mode PASID-table entry
> > > > referenced through the SMPTBLPTR field in a scalable-mode
> > > > PASID-directory
> > > > entry resulted in an error.
> > > 
> > > Probably, but this issue is not introduced in this patch. We can fix
> > > it on top if necessary.
> > 
> > agreed.
> > 
> > > > 
> > > > Currently the implementation of vtd_ce_get_rid2pasid_entry() is also
> > > > problematic. According to VT-d spec, RID2PASID field is effective only
> > > > when ecap.rps is true otherwise PASID#0 is used for RID2PASID. I didn't
> > > > see ecap.rps is set, neither is it checked in that function. It
> > > > works possibly
> > > > just because Linux currently programs 0 to RID2PASID...
> > > 
> > > This seems to be another issue since the introduction of scalable mode.
> > 
> > yes. this is not introduced in this series. The current scalable mode
> > vIOMMU support was following 3.0 spec, while RPS is added in 3.1. Needs
> > to be fixed.
> 
> 
> Interesting, so this is more complicated when dealing with migration
> compatibility. So what I suggest is probably something like:
> 
> -device intel-iommu,version=$version
> 
> Then we can maintain migration compatibility correctly. For 3.0 we can go
> without RPS and 3.1 and above we need to implement RPS.
> 
> Since most of the advanced features has not been implemented, we may
> probably start just from 3.4 (assuming it's the latest version). And all of
> the following effort should be done for 3.4 in order to productize it.
> 
> Thanks

I would advise calling it x-version. And declare it unstable for now.  I
think that we don't at this point want to support users tweaking version
to arbitrary values.


> 
> > 
> > > Thanks
> > > 
> > > > 
> > > > >               return false;
> > > > >           }
> > > > >           return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> > > > > -- 
> > > > > 2.25.1
> > > > > 
> > > > 
> > > 
> > 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/4] intel-iommu: drop VTDBus
  2022-04-22  6:26     ` Jason Wang
@ 2022-04-22 12:55       ` Peter Xu
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Xu @ 2022-04-22 12:55 UTC (permalink / raw)
  To: Jason Wang; +Cc: Liu, Yi L, yi.y.sun, qemu-devel, mst

On Fri, Apr 22, 2022 at 02:26:11PM +0800, Jason Wang wrote:
> On Fri, Apr 22, 2022 at 9:17 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Jason,
> >
> > Mostly good to me, just a few nitpicks below.
> >
> > On Mon, Mar 21, 2022 at 01:54:27PM +0800, Jason Wang wrote:
> > > We introduce VTDBus structure as an intermediate step for searching
> > > the address space. This works well with SID based matching/lookup. But
> > > when we want to support SID plus PASID based address space lookup,
> > > this intermediate steps turns out to be a burden. So the patch simply
> > > drops the VTDBus structure and use the PCIBus and devfn as the key for
> > > the g_hash_table(). This simplifies the codes and the future PASID
> > > extension.
> > >
> > > To prevent being slower for past vtd_find_as_from_bus_num() callers, a
> > > vtd_as cache indexed by the bus number is introduced to store the last
> > > recent search result of a vtd_as belongs to a specific bus.
> > >
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> > >  hw/i386/intel_iommu.c         | 238 +++++++++++++++++-----------------
> > >  include/hw/i386/intel_iommu.h |  11 +-
> > >  2 files changed, 123 insertions(+), 126 deletions(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 90964b201c..5851a17d0e 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -61,6 +61,16 @@
> > >      }                                                                         \
> > >  }
> > >
> > > +/*
> > > + * PCI bus number (or SID) is not reliable since the device is usaully
> > > + * initalized before guest can configure the PCI bridge
> > > + * (SECONDARY_BUS_NUMBER).
> > > + */
> > > +struct vtd_as_key {
> > > +    PCIBus *bus;
> > > +    uint8_t devfn;
> > > +};
> > > +
> > >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> > >
> > > @@ -210,6 +220,31 @@ static guint vtd_uint64_hash(gconstpointer v)
> > >      return (guint)*(const uint64_t *)v;
> > >  }
> > >
> > > +static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
> > > +{
> > > +    const struct vtd_as_key *key1 = v1;
> > > +    const struct vtd_as_key *key2 = v2;
> > > +
> > > +    return (key1->bus == key2->bus) && (key1->devfn == key2->devfn);
> > > +}
> > > +
> > > +static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
> > > +{
> > > +    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> > > +}
> >
> > Nit: we could directly drop this one and use PCI_BUILD_BDF().
> 
> Will fix.
> 
> >
> > > +
> > > +/*
> > > + * Note that we use pointer to PCIBus as the key, so hashing/shifting
> > > + * based on the pointer value is intended.
> >
> > Thanks for the comment; that helps.
> >
> > Should we also mention that this hash is not the only interface to identify
> > two vtd_as*, say, even if on a 32bit system we got last 24 bits collapsed
> > on two vtd_as* pointers, we can still have vtd_as_equal() to guard us?
> 
> Ok. let me add that in the next version.
> 
> >
> > > + */
> > > +static guint vtd_as_hash(gconstpointer v)
> > > +{
> > > +    const struct vtd_as_key *key = v;
> > > +    guint value = (guint)(uintptr_t)key->bus;
> > > +
> > > +    return (guint)(value << 8 | key->devfn);
> > > +}
> > > +
> > >  static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
> > >                                            gpointer user_data)
> > >  {
> > > @@ -248,22 +283,14 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
> > >  static void vtd_reset_context_cache_locked(IntelIOMMUState *s)
> > >  {
> > >      VTDAddressSpace *vtd_as;
> > > -    VTDBus *vtd_bus;
> > > -    GHashTableIter bus_it;
> > > -    uint32_t devfn_it;
> > > +    GHashTableIter as_it;
> > >
> > >      trace_vtd_context_cache_reset();
> > >
> > > -    g_hash_table_iter_init(&bus_it, s->vtd_as_by_busptr);
> > > +    g_hash_table_iter_init(&as_it, s->vtd_as);
> > >
> > > -    while (g_hash_table_iter_next (&bus_it, NULL, (void**)&vtd_bus)) {
> > > -        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
> > > -            vtd_as = vtd_bus->dev_as[devfn_it];
> > > -            if (!vtd_as) {
> > > -                continue;
> > > -            }
> > > -            vtd_as->context_cache_entry.context_cache_gen = 0;
> > > -        }
> > > +    while (g_hash_table_iter_next (&as_it, NULL, (void**)&vtd_as)) {
> > > +        vtd_as->context_cache_entry.context_cache_gen = 0;
> > >      }
> > >      s->context_cache_gen = 1;
> > >  }
> > > @@ -993,32 +1020,6 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > >      return slpte & rsvd_mask;
> > >  }
> > >
> > > -/* Find the VTD address space associated with a given bus number */
> > > -static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
> > > -{
> > > -    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
> > > -    GHashTableIter iter;
> > > -
> > > -    if (vtd_bus) {
> > > -        return vtd_bus;
> > > -    }
> > > -
> > > -    /*
> > > -     * Iterate over the registered buses to find the one which
> > > -     * currently holds this bus number and update the bus_num
> > > -     * lookup table.
> > > -     */
> > > -    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> > > -    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> > > -        if (pci_bus_num(vtd_bus->bus) == bus_num) {
> > > -            s->vtd_as_by_bus_num[bus_num] = vtd_bus;
> > > -            return vtd_bus;
> > > -        }
> > > -    }
> > > -
> > > -    return NULL;
> > > -}
> > > -
> > >  /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
> > >   * of the translation, can be used for deciding the size of large page.
> > >   */
> > > @@ -1634,24 +1635,13 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
> > >
> > >  static void vtd_switch_address_space_all(IntelIOMMUState *s)
> > >  {
> > > +    VTDAddressSpace *vtd_as;
> > >      GHashTableIter iter;
> > > -    VTDBus *vtd_bus;
> > > -    int i;
> > > -
> > > -    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> > > -    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> > > -        for (i = 0; i < PCI_DEVFN_MAX; i++) {
> > > -            if (!vtd_bus->dev_as[i]) {
> > > -                continue;
> > > -            }
> > > -            vtd_switch_address_space(vtd_bus->dev_as[i]);
> > > -        }
> > > -    }
> > > -}
> > >
> > > -static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
> > > -{
> > > -    return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> > > +    g_hash_table_iter_init(&iter, s->vtd_as);
> > > +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_as)) {
> > > +        vtd_switch_address_space(vtd_as);
> > > +    }
> > >  }
> > >
> > >  static const bool vtd_qualified_faults[] = {
> > > @@ -1688,18 +1678,39 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
> > >      return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
> > >  }
> > >
> > > +static gboolean vtd_find_as_by_sid(gpointer key, gpointer value,
> > > +                                   gpointer user_data)
> > > +{
> > > +    struct vtd_as_key *as_key = (struct vtd_as_key *)key;
> > > +    uint16_t target_sid = *(uint16_t *)user_data;
> > > +    uint16_t sid = vtd_make_source_id(pci_bus_num(as_key->bus),
> > > +                                      as_key->devfn);
> > > +    return sid == target_sid;
> > > +}
> > > +
> > > +static VTDAddressSpace *vtd_get_as_by_sid(IntelIOMMUState *s, uint16_t sid)
> > > +{
> > > +    uint8_t bus_num = sid >> 8;
> >
> > PCI_BUS_NUM(sid)?
> 
> Will do.
> 
> >
> > > +    VTDAddressSpace *vtd_as = s->vtd_as_cache[bus_num];
> > > +
> > > +    if (vtd_as &&
> > > +        (sid == vtd_make_source_id(pci_bus_num(vtd_as->bus),
> > > +                                   vtd_as->devfn))) {
> > > +        return vtd_as;
> > > +    }
> > > +
> > > +    vtd_as = g_hash_table_find(s->vtd_as, vtd_find_as_by_sid, &sid);
> > > +    s->vtd_as_cache[bus_num] = vtd_as;
> > > +
> > > +    return vtd_as;
> > > +}
> > > +
> > >  static void vtd_pt_enable_fast_path(IntelIOMMUState *s, uint16_t source_id)
> > >  {
> > > -    VTDBus *vtd_bus;
> > >      VTDAddressSpace *vtd_as;
> > >      bool success = false;
> > >
> > > -    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
> > > -    if (!vtd_bus) {
> > > -        goto out;
> > > -    }
> > > -
> > > -    vtd_as = vtd_bus->dev_as[VTD_SID_TO_DEVFN(source_id)];
> > > +    vtd_as = vtd_get_as_by_sid(s, source_id);
> > >      if (!vtd_as) {
> > >          goto out;
> > >      }
> > > @@ -1907,11 +1918,10 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
> > >                                            uint16_t source_id,
> > >                                            uint16_t func_mask)
> > >  {
> > > +    GHashTableIter as_it;
> > >      uint16_t mask;
> > > -    VTDBus *vtd_bus;
> > >      VTDAddressSpace *vtd_as;
> > >      uint8_t bus_n, devfn;
> > > -    uint16_t devfn_it;
> > >
> > >      trace_vtd_inv_desc_cc_devices(source_id, func_mask);
> > >
> > > @@ -1934,32 +1944,31 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
> > >      mask = ~mask;
> > >
> > >      bus_n = VTD_SID_TO_BUS(source_id);
> > > -    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> > > -    if (vtd_bus) {
> > > -        devfn = VTD_SID_TO_DEVFN(source_id);
> > > -        for (devfn_it = 0; devfn_it < PCI_DEVFN_MAX; ++devfn_it) {
> > > -            vtd_as = vtd_bus->dev_as[devfn_it];
> > > -            if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
> > > -                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> > > -                                             VTD_PCI_FUNC(devfn_it));
> > > -                vtd_iommu_lock(s);
> > > -                vtd_as->context_cache_entry.context_cache_gen = 0;
> > > -                vtd_iommu_unlock(s);
> > > -                /*
> > > -                 * Do switch address space when needed, in case if the
> > > -                 * device passthrough bit is switched.
> > > -                 */
> > > -                vtd_switch_address_space(vtd_as);
> > > -                /*
> > > -                 * So a device is moving out of (or moving into) a
> > > -                 * domain, resync the shadow page table.
> > > -                 * This won't bring bad even if we have no such
> > > -                 * notifier registered - the IOMMU notification
> > > -                 * framework will skip MAP notifications if that
> > > -                 * happened.
> > > -                 */
> > > -                vtd_sync_shadow_page_table(vtd_as);
> > > -            }
> > > +    devfn = VTD_SID_TO_DEVFN(source_id);
> > > +
> > > +    g_hash_table_iter_init(&as_it, s->vtd_as);
> > > +    while (g_hash_table_iter_next(&as_it, NULL, (void**)&vtd_as)) {
> > > +        if ((pci_bus_num(vtd_as->bus) == bus_n) &&
> > > +            (vtd_as->devfn & mask) == (devfn & mask)) {
> > > +            trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(vtd_as->devfn),
> > > +                                         VTD_PCI_FUNC(vtd_as->devfn));
> > > +            vtd_iommu_lock(s);
> > > +            vtd_as->context_cache_entry.context_cache_gen = 0;
> > > +            vtd_iommu_unlock(s);
> > > +            /*
> > > +             * Do switch address space when needed, in case if the
> > > +             * device passthrough bit is switched.
> > > +             */
> > > +            vtd_switch_address_space(vtd_as);
> > > +            /*
> > > +             * So a device is moving out of (or moving into) a
> > > +             * domain, resync the shadow page table.
> > > +             * This won't bring bad even if we have no such
> > > +             * notifier registered - the IOMMU notification
> > > +             * framework will skip MAP notifications if that
> > > +             * happened.
> > > +             */
> > > +            vtd_sync_shadow_page_table(vtd_as);
> > >          }
> > >      }
> > >  }
> > > @@ -2473,18 +2482,13 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
> > >  {
> > >      VTDAddressSpace *vtd_dev_as;
> > >      IOMMUTLBEvent event;
> > > -    struct VTDBus *vtd_bus;
> > >      hwaddr addr;
> > >      uint64_t sz;
> > >      uint16_t sid;
> > > -    uint8_t devfn;
> > >      bool size;
> > > -    uint8_t bus_num;
> > >
> > >      addr = VTD_INV_DESC_DEVICE_IOTLB_ADDR(inv_desc->hi);
> > >      sid = VTD_INV_DESC_DEVICE_IOTLB_SID(inv_desc->lo);
> > > -    devfn = sid & 0xff;
> > > -    bus_num = sid >> 8;
> > >      size = VTD_INV_DESC_DEVICE_IOTLB_SIZE(inv_desc->hi);
> > >
> > >      if ((inv_desc->lo & VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO) ||
> > > @@ -2495,12 +2499,11 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
> > >          return false;
> > >      }
> > >
> > > -    vtd_bus = vtd_find_as_from_bus_num(s, bus_num);
> > > -    if (!vtd_bus) {
> > > -        goto done;
> > > -    }
> > > -
> > > -    vtd_dev_as = vtd_bus->dev_as[devfn];
> > > +    /*
> > > +     * Using sid is OK since the guest should have finished the
> > > +     * initialization of both the bus and device.
> > > +     */
> > > +    vtd_dev_as = vtd_get_as_by_sid(s, sid);
> > >      if (!vtd_dev_as) {
> > >          goto done;
> > >      }
> > > @@ -3426,27 +3429,27 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
> > >
> > >  VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > >  {
> > > -    uintptr_t key = (uintptr_t)bus;
> > > -    VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
> > > +    /*
> > > +     * We can't simply use sid here since the bus number might not be
> > > +     * initialized by the guest.
> > > +     */
> > > +    struct vtd_as_key key = {
> > > +        .bus = bus,
> > > +        .devfn = devfn,
> > > +    };
> > >      VTDAddressSpace *vtd_dev_as;
> > >      char name[128];
> > >
> > > -    if (!vtd_bus) {
> > > -        uintptr_t *new_key = g_malloc(sizeof(*new_key));
> > > -        *new_key = (uintptr_t)bus;
> > > -        /* No corresponding free() */
> > > -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> > > -                            PCI_DEVFN_MAX);
> > > -        vtd_bus->bus = bus;
> > > -        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
> > > -    }
> > > +    vtd_dev_as = g_hash_table_lookup(s->vtd_as, &key);
> > > +    if (!vtd_dev_as) {
> > > +        struct vtd_as_key *new_key = g_malloc(sizeof(*new_key));
> > >
> > > -    vtd_dev_as = vtd_bus->dev_as[devfn];
> > > +        new_key->bus = bus;
> > > +        new_key->devfn = devfn;
> > >
> > > -    if (!vtd_dev_as) {
> > >          snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
> > >                   PCI_FUNC(devfn));
> > > -        vtd_bus->dev_as[devfn] = vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
> > > +        vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
> > >
> > >          vtd_dev_as->bus = bus;
> > >          vtd_dev_as->devfn = (uint8_t)devfn;
> > > @@ -3502,6 +3505,8 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > >                                              &vtd_dev_as->nodmar, 0);
> > >
> > >          vtd_switch_address_space(vtd_dev_as);
> > > +
> > > +        g_hash_table_insert(s->vtd_as, new_key, vtd_dev_as);
> > >      }
> > >      return vtd_dev_as;
> > >  }
> > > @@ -3875,7 +3880,6 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> > >
> > >      QLIST_INIT(&s->vtd_as_with_notifiers);
> > >      qemu_mutex_init(&s->iommu_lock);
> > > -    memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
> > >      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> > >                            "intel_iommu", DMAR_REG_SIZE);
> > >
> > > @@ -3897,8 +3901,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> > >      /* No corresponding destroy */
> > >      s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
> > >                                       g_free, g_free);
> > > -    s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
> > > -                                              g_free, g_free);
> > > +    s->vtd_as = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
> > > +                                      g_free, g_free);
> > >      vtd_init(s);
> > >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
> > >      pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > index 3b5ac869db..fa1bed353c 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -58,7 +58,6 @@ typedef struct VTDContextEntry VTDContextEntry;
> > >  typedef struct VTDContextCacheEntry VTDContextCacheEntry;
> > >  typedef struct VTDAddressSpace VTDAddressSpace;
> > >  typedef struct VTDIOTLBEntry VTDIOTLBEntry;
> > > -typedef struct VTDBus VTDBus;
> > >  typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> > >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> > > @@ -111,12 +110,6 @@ struct VTDAddressSpace {
> > >      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
> > >  };
> > >
> > > -struct VTDBus {
> > > -    PCIBus* bus;             /* A reference to the bus to provide translation for */
> > > -    /* A table of VTDAddressSpace objects indexed by devfn */
> > > -    VTDAddressSpace *dev_as[];
> > > -};
> > > -
> > >  struct VTDIOTLBEntry {
> > >      uint64_t gfn;
> > >      uint16_t domain_id;
> > > @@ -253,8 +246,8 @@ struct IntelIOMMUState {
> > >      uint32_t context_cache_gen;     /* Should be in [1,MAX] */
> > >      GHashTable *iotlb;              /* IOTLB */
> > >
> > > -    GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
> > > -    VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> > > +    GHashTable *vtd_as;             /* VTD address space indexed by source id*/
> >
> > It's not indexed by source ID but vtd_as_key?
> 
> Right, let me fix that.
> 
> >
> > Meanwhile how about renaming it to vtd_address_spaces?  Since we use vtd_as
> > as the name for VTDAddressSpace* in most code paths, so imho it'll be
> > easier to identify the two.
> 
> Ok.

If with all above nitpicks fixed, please feel free to add:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function
  2022-03-21  5:54 ` [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function Jason Wang
  2022-03-24  8:26   ` Tian, Kevin
@ 2022-04-22 13:08   ` Peter Xu
  1 sibling, 0 replies; 43+ messages in thread
From: Peter Xu @ 2022-04-22 13:08 UTC (permalink / raw)
  To: Jason Wang; +Cc: yi.l.liu, yi.y.sun, qemu-devel, mst

On Mon, Mar 21, 2022 at 01:54:28PM +0800, Jason Wang wrote:
> We used to have a macro for VTD_PE_GET_FPD_ERR() but it has an
> internal goto which prevents it from being reused. This patch convert
> that macro to a dedicated function and let the caller to decide what
> to do (e.g using goto or not). This makes sure it can be re-used for
> other function that requires fault reporting.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-04-02  7:27               ` Tian, Kevin
  2022-04-06  3:31                 ` Jason Wang
@ 2022-04-22 15:03                 ` Peter Xu
  2022-04-23 16:51                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Xu @ 2022-04-22 15:03 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Jason Wang, Liu, Yi L, yi.y.sun, qemu-devel, mst

On Sat, Apr 02, 2022 at 07:27:15AM +0000, Tian, Kevin wrote:
> > > > > Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for
> > a
> > > > > coarse-grained knob design:
> > > > > --
> > > > >   Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > capabilities
> > > > >   related to scalable mode translation, thus there are multiple
> > combinations.
> > > > >   While this vIOMMU implementation wants simplify it for user by
> > providing
> > > > >   typical combinations. User could config it by "x-scalable-mode" option.
> > > > The
> > > > >   usage is as below:
> > > > >     "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > > > >
> > > > >     - "legacy": gives support for SL page table
> > > > >     - "modern": gives support for FL page table, pasid, virtual command
> > > > >     -  if not configured, means no scalable mode support, if not proper
> > > > >        configured, will throw error
> > > > > --
> > > > >
> > > > > Which way do you prefer to?
> > > > >
> > > > > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-
> > 02/msg02805.html
> > > >
> > > > My understanding is that, if we want to deploy Qemu in a production
> > > > environment, we can't use the "x-" prefix. We need a full
> > > > implementation of each cap.
> > > >
> > > > E.g
> > > > -device intel-iommu,first-level=on,scalable-mode=on etc.
> > > >
> > >
> > > You meant each cap will get a separate control option?
> > >
> > > But that way requires the management stack or admin to have deep
> > > knowledge about how combinations of different capabilities work, e.g.
> > > if just turning on scalable mode w/o first-level cannot support vSVA
> > > on assigned devices. Is this a common practice when defining Qemu
> > > parameters?
> > 
> > We can have a safe and good default value for each cap. E.g
> > 
> > In qemu 8.0 we think scalable is mature, we can make scalable to be
> > enabled by default
> > in qemu 8.1 we think first-level is mature, we can make first level to
> > be enabled by default.
> > 
> 
> OK, that is a workable way.

Sorry again for a very late comment, I think I agree with both of you. :-)

For debugging purpose parameters like pasid-mode, I'd suggest we make the
default value to be always depend on the parmaeters that we'll expose to
the user at last always with the coarse-grained way.

Assuming that's scalable-mode to be exported by plan (which sounds
reasonable), then we by default have pasid mode on if scalable-mode is
modern, otherwise off.  IMHO we don't even need to bother with turning it
on/off in machine types since we don't even declare these debugging
parameters supported, do we? :)

But these debugging parameters could be useful for debugging and triaging
for sure, keeping them always prefixed with x-.  So it makes sense to have
them when we're making intermediate steps for the whole building.

Then at some point all things stable we export scalable-mode to replace
x-scalable-mode, with an initial versioning alongside (and it doesn't need
to be started with vt-d 3.0, maybe rev3.3 or even later).

How's that sound?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 4/4] intel-iommu: PASID support
  2022-04-22 15:03                 ` Peter Xu
@ 2022-04-23 16:51                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2022-04-23 16:51 UTC (permalink / raw)
  To: Peter Xu; +Cc: Tian, Kevin, Liu, Yi L, yi.y.sun, Jason Wang, qemu-devel

On Fri, Apr 22, 2022 at 11:03:51AM -0400, Peter Xu wrote:
> On Sat, Apr 02, 2022 at 07:27:15AM +0000, Tian, Kevin wrote:
> > > > > > Earlier when Yi proposed Qemu changes for guest SVA [1] he aimed for
> > > a
> > > > > > coarse-grained knob design:
> > > > > > --
> > > > > >   Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > > capabilities
> > > > > >   related to scalable mode translation, thus there are multiple
> > > combinations.
> > > > > >   While this vIOMMU implementation wants simplify it for user by
> > > providing
> > > > > >   typical combinations. User could config it by "x-scalable-mode" option.
> > > > > The
> > > > > >   usage is as below:
> > > > > >     "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > > > > >
> > > > > >     - "legacy": gives support for SL page table
> > > > > >     - "modern": gives support for FL page table, pasid, virtual command
> > > > > >     -  if not configured, means no scalable mode support, if not proper
> > > > > >        configured, will throw error
> > > > > > --
> > > > > >
> > > > > > Which way do you prefer to?
> > > > > >
> > > > > > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-
> > > 02/msg02805.html
> > > > >
> > > > > My understanding is that, if we want to deploy Qemu in a production
> > > > > environment, we can't use the "x-" prefix. We need a full
> > > > > implementation of each cap.
> > > > >
> > > > > E.g
> > > > > -device intel-iommu,first-level=on,scalable-mode=on etc.
> > > > >
> > > >
> > > > You meant each cap will get a separate control option?
> > > >
> > > > But that way requires the management stack or admin to have deep
> > > > knowledge about how combinations of different capabilities work, e.g.
> > > > if just turning on scalable mode w/o first-level cannot support vSVA
> > > > on assigned devices. Is this a common practice when defining Qemu
> > > > parameters?
> > > 
> > > We can have a safe and good default value for each cap. E.g
> > > 
> > > In qemu 8.0 we think scalable is mature, we can make scalable to be
> > > enabled by default
> > > in qemu 8.1 we think first-level is mature, we can make first level to
> > > be enabled by default.
> > > 
> > 
> > OK, that is a workable way.
> 
> Sorry again for a very late comment, I think I agree with both of you. :-)
> 
> For debugging purpose parameters like pasid-mode, I'd suggest we make the
> default value to be always depend on the parmaeters that we'll expose to
> the user at last always with the coarse-grained way.
> 
> Assuming that's scalable-mode to be exported by plan (which sounds
> reasonable), then we by default have pasid mode on if scalable-mode is
> modern, otherwise off.  IMHO we don't even need to bother with turning it
> on/off in machine types since we don't even declare these debugging
> parameters supported, do we? :)
> 
> But these debugging parameters could be useful for debugging and triaging
> for sure, keeping them always prefixed with x-.  So it makes sense to have
> them when we're making intermediate steps for the whole building.
> 
> Then at some point all things stable we export scalable-mode to replace
> x-scalable-mode, with an initial versioning alongside (and it doesn't need
> to be started with vt-d 3.0, maybe rev3.3 or even later).
> 
> How's that sound?
> 
> Thanks,
> 
> -- 
> Peter Xu

Sounds good.

-- 
MST



^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-04-23 16:53 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-21  5:54 [PATCH V2 0/4] PASID support for Intel IOMMU Jason Wang
2022-03-21  5:54 ` [PATCH V2 1/4] intel-iommu: don't warn guest errors when getting rid2pasid entry Jason Wang
2022-03-24  8:21   ` Tian, Kevin
2022-03-28  2:27     ` Jason Wang
2022-03-28  8:53       ` Yi Liu
2022-03-29  4:52         ` Jason Wang
2022-03-30  8:16           ` Tian, Kevin
2022-03-30  8:36             ` Jason Wang
2022-04-02  7:33               ` Tian, Kevin
2022-04-06  3:33                 ` Jason Wang
2022-04-06  3:41                   ` Tian, Kevin
2022-04-22  0:13               ` Peter Xu
2022-04-22  6:24                 ` Jason Wang
2022-04-22  7:57           ` Michael S. Tsirkin
2022-03-21  5:54 ` [PATCH V2 2/4] intel-iommu: drop VTDBus Jason Wang
2022-04-22  1:17   ` Peter Xu
2022-04-22  6:26     ` Jason Wang
2022-04-22 12:55       ` Peter Xu
2022-03-21  5:54 ` [PATCH V2 3/4] intel-iommu: convert VTD_PE_GET_FPD_ERR() to be a function Jason Wang
2022-03-24  8:26   ` Tian, Kevin
2022-03-28  2:27     ` Jason Wang
2022-04-22 13:08   ` Peter Xu
2022-03-21  5:54 ` [PATCH V2 4/4] intel-iommu: PASID support Jason Wang
2022-03-24  8:53   ` Tian, Kevin
2022-03-28  2:31     ` Jason Wang
2022-03-28  6:47       ` Tian, Kevin
2022-03-29  4:46         ` Jason Wang
2022-03-30  8:00           ` Tian, Kevin
2022-03-30  8:32             ` Jason Wang
2022-04-02  7:27               ` Tian, Kevin
2022-04-06  3:31                 ` Jason Wang
2022-04-22 15:03                 ` Peter Xu
2022-04-23 16:51                   ` Michael S. Tsirkin
2022-03-28  7:03   ` Tian, Kevin
2022-03-29  4:48     ` Jason Wang
2022-03-30  8:02       ` Tian, Kevin
2022-03-30  8:31         ` Jason Wang
2022-04-02  7:24           ` Tian, Kevin
2022-04-06  3:31             ` Jason Wang
2022-03-28  8:45   ` Yi Liu
2022-03-29  4:54     ` Jason Wang
2022-04-01 13:42       ` Yi Liu
2022-04-02  1:52         ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.