All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode
@ 2019-01-30  5:09 Yi Sun
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation Yi Sun
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Yi Sun @ 2019-01-30  5:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

Intel vt-d rev3.0 [1] introduces a new translation mode called
'scalable mode', which enables PASID-granular translations for
first level, second level, nested and pass-through modes. The
vt-d scalable mode is the key ingredient to enable Scalable I/O
Virtualization (Scalable IOV) [2] [3], which allows sharing a
device in minimal possible granularity (ADI - Assignable Device
Interface). As a result, previous Extended Context (ECS) mode
is deprecated (no production ever implements ECS).

This patch set emulates a minimal capability set of VT-d scalable
mode, equivalent to what is available in VT-d legacy mode today:
    1. Scalable mode root entry, context entry and PASID table
    2. Seconds level translation under scalable mode
    3. Queued invalidation (with 256 bits descriptor)
    4. Pass-through mode

Corresponding intel-iommu driver support will be included in
kernel 5.0:
    https://www.spinics.net/lists/kernel/msg2985279.html

We will add emulation of full scalable mode capability along with
guest iommu driver progress later, e.g.:
    1. First level translation
    2. Nested translation
    3. Per-PASID invalidation descriptors
    4. Page request services for handling recoverable faults

References:
[1] https://software.intel.com/en-us/download/intel-virtualization-technology-for-directed-io-architecture-specification
[2] https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[3] https://schd.ws/hosted_files/lc32018/00/LC3-SIOV-final.pdf

Liu, Yi L (2):
  intel_iommu: scalable mode emulation
  intel_iommu: add 256 bits qi_desc support

Yi Sun (1):
  intel_iommu: add scalable-mode option to make scalable mode work

 hw/i386/intel_iommu.c          | 732 ++++++++++++++++++++++++++++++++---------
 hw/i386/intel_iommu_internal.h |  57 +++-
 hw/i386/trace-events           |   2 +-
 include/hw/i386/intel_iommu.h  |  20 +-
 4 files changed, 644 insertions(+), 167 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation
  2019-01-30  5:09 [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Yi Sun
@ 2019-01-30  5:09 ` Yi Sun
  2019-02-11 10:12   ` Peter Xu
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-01-30  5:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

From: "Liu, Yi L" <yi.l.liu@intel.com>

Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
replace extended context mode. This patch extends current emulator to
support Scalable Mode which includes root table, context table and new
pasid table format change. Now intel_iommu emulates both legacy mode
and scalable mode (with legacy-equivalent capability set).

The key points are below:
1. Extend root table operations to support both legacy mode and scalable
   mode.
2. Extend context table operations to support both legacy mode and
   scalable mode.
3. Add pasid tabled operations to support scalable mode.

[Yi Sun is co-developer to contribute much to refine the whole commit.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 528 ++++++++++++++++++++++++++++++++++-------
 hw/i386/intel_iommu_internal.h |  43 +++-
 hw/i386/trace-events           |   2 +-
 include/hw/i386/intel_iommu.h  |  16 +-
 4 files changed, 498 insertions(+), 91 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8b72735..396ac8e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -37,6 +37,34 @@
 #include "kvm_i386.h"
 #include "trace.h"
 
+#define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)
+
+/* context entry operations */
+#define vtd_get_ce_size(s, ce) \
+    (((s)->root_scalable) ? \
+     VTD_CTX_ENTRY_SM_SIZE : VTD_CTX_ENTRY_LECY_SIZE)
+#define vtd_ce_get_domain_id(ce) VTD_CONTEXT_ENTRY_DID((ce)->val[1])
+#define vtd_ce_get_rid2pasid(ce) \
+    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
+#define vtd_ce_get_pasid_dir_table(ce) \
+    ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
+
+/* pasid operations */
+#define vtd_pdire_get_pasidt_base(pdire) \
+    ((pdire)->val & VTD_PASID_TABLE_BASE_ADDR_MASK)
+#define vtd_get_pasid_dir_entry_size() VTD_PASID_DIR_ENTRY_SIZE
+#define vtd_get_pasid_entry_size() VTD_PASID_ENTRY_SIZE
+#define vtd_get_pasid_dir_index(pasid) VTD_PASID_DIR_INDEX(pasid)
+#define vtd_get_pasid_table_index(pasid) VTD_PASID_TABLE_INDEX(pasid)
+
+/* pe operations */
+#define vtd_pe_get_type(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
+#define vtd_pe_get_level(pe) (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
+#define vtd_pe_get_agaw(pe) \
+    (30 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9)
+#define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
+#define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
+
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
@@ -512,9 +540,15 @@ static void vtd_generate_completion_event(IntelIOMMUState *s)
     }
 }
 
-static inline bool vtd_root_entry_present(VTDRootEntry *root)
+static inline bool vtd_root_entry_present(IntelIOMMUState *s,
+                                          VTDRootEntry *re,
+                                          uint8_t devfn)
 {
-    return root->val & VTD_ROOT_ENTRY_P;
+    if (s->root_scalable && vtd_devfn_check(devfn)) {
+        return re->hi & VTD_ROOT_ENTRY_P;
+    }
+
+    return re->lo & VTD_ROOT_ENTRY_P;
 }
 
 static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
@@ -524,36 +558,64 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
 
     addr = s->root + index * sizeof(*re);
     if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
-        re->val = 0;
+        re->lo = 0;
         return -VTD_FR_ROOT_TABLE_INV;
     }
-    re->val = le64_to_cpu(re->val);
+    re->lo = le64_to_cpu(re->lo);
+    if (s->root_scalable) {
+        re->hi = le64_to_cpu(re->hi);
+    }
     return 0;
 }
 
-static inline bool vtd_ce_present(VTDContextEntry *context)
+static inline bool vtd_ce_present(VTDContextEntry *ce)
+{
+    return ce->val[0] & VTD_CONTEXT_ENTRY_P;
+}
+
+static inline dma_addr_t vtd_get_context_base(IntelIOMMUState *s,
+                                              VTDRootEntry *re,
+                                              uint8_t *index)
 {
-    return context->lo & VTD_CONTEXT_ENTRY_P;
+    if (s->root_scalable && vtd_devfn_check(*index)) {
+        *index = *index & (~VTD_DEVFN_CHECK_MASK);
+        return re->hi & VTD_ROOT_ENTRY_CTP;
+    }
+
+    return re->lo & VTD_ROOT_ENTRY_CTP;
 }
 
-static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
+static void vtd_context_entry_format(IntelIOMMUState *s,
+                                     VTDContextEntry *ce)
+{
+    ce->val[0] = le64_to_cpu(ce->val[0]);
+    ce->val[1] = le64_to_cpu(ce->val[1]);
+    if (s->root_scalable) {
+        ce->val[2] = le64_to_cpu(ce->val[2]);
+        ce->val[3] = le64_to_cpu(ce->val[3]);
+    }
+}
+
+static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
+                                           VTDRootEntry *re,
+                                           uint8_t index,
                                            VTDContextEntry *ce)
 {
-    dma_addr_t addr;
+    dma_addr_t addr, ce_size;
 
     /* we have checked that root entry is present */
-    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
-    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
+    ce_size = vtd_get_ce_size(s, ce);
+    addr = vtd_get_context_base(s, re, &index) + index * ce_size;
+    if (dma_memory_read(&address_space_memory, addr, ce, ce_size)) {
         return -VTD_FR_CONTEXT_TABLE_INV;
     }
-    ce->lo = le64_to_cpu(ce->lo);
-    ce->hi = le64_to_cpu(ce->hi);
+    vtd_context_entry_format(s, ce);
     return 0;
 }
 
 static inline dma_addr_t vtd_ce_get_slpt_base(VTDContextEntry *ce)
 {
-    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
+    return ce->val[0] & VTD_CONTEXT_ENTRY_SLPTPTR;
 }
 
 static inline uint64_t vtd_get_slpte_addr(uint64_t slpte, uint8_t aw)
@@ -600,28 +662,203 @@ static inline bool vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
            (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
 }
 
+/* Return true if check passed, otherwise false */
+static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
+                                     VTDPASIDEntry *pe)
+{
+    switch (vtd_pe_get_type(pe)) {
+    case VTD_SM_PASID_ENTRY_FLT:
+    case VTD_SM_PASID_ENTRY_SLT:
+    case VTD_SM_PASID_ENTRY_NESTED:
+        break;
+    case VTD_SM_PASID_ENTRY_PT:
+        if (!x86_iommu->pt_supported) {
+            return false;
+        }
+        break;
+    default:
+        /* Unknwon type */
+        return false;
+    }
+    return true;
+}
+
+static inline int vtd_get_pasid_dire(dma_addr_t pasid_dir_base,
+                                     uint32_t pasid,
+                                     VTDPASIDDirEntry *pdire)
+{
+    uint32_t index;
+    dma_addr_t addr, entry_size;
+
+    index = vtd_get_pasid_dir_index(pasid);
+    entry_size = vtd_get_pasid_dir_entry_size();
+    addr = pasid_dir_base + index * entry_size;
+    if (dma_memory_read(&address_space_memory, addr, pdire, entry_size)) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    return 0;
+}
+
+static inline int vtd_get_pasid_entry(IntelIOMMUState *s,
+                                      uint32_t pasid,
+                                      VTDPASIDDirEntry *pdire,
+                                      VTDPASIDEntry *pe)
+{
+    uint32_t index;
+    dma_addr_t addr, entry_size;
+    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
+
+    index = vtd_get_pasid_table_index(pasid);
+    entry_size = vtd_get_pasid_entry_size();
+    addr = vtd_pdire_get_pasidt_base(pdire);
+    addr = addr + index * entry_size;
+    if (dma_memory_read(&address_space_memory, addr, pe, entry_size)) {
+        memset(pe->val, 0, sizeof(pe->val));
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    /* Do translation type check */
+    if (!vtd_pe_type_check(x86_iommu, pe)) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    if (!vtd_is_level_supported(s, vtd_pe_get_level(pe))) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    return 0;
+}
+
+static int vtd_get_pasid_entry_from_pasid(IntelIOMMUState *s,
+                                          dma_addr_t pasid_dir_base,
+                                          uint32_t pasid,
+                                          VTDPASIDEntry *pe)
+{
+    int ret;
+    VTDPASIDDirEntry pdire;
+
+    ret = vtd_get_pasid_dire(pasid_dir_base, pasid, &pdire);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vtd_get_pasid_entry(s, pasid, &pdire, pe);
+    if (ret) {
+        return ret;
+    }
+
+    return ret;
+}
+
+static inline int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+                                             VTDContextEntry *ce,
+                                             VTDPASIDEntry *pe)
+{
+    uint32_t pasid;
+    dma_addr_t pasid_dir_base;
+    int ret = 0;
+
+    pasid = vtd_ce_get_rid2pasid(ce);
+    pasid_dir_base = vtd_ce_get_pasid_dir_table(ce);
+    ret = vtd_get_pasid_entry_from_pasid(s, pasid_dir_base, pasid, pe);
+
+    return ret;
+}
+
+static inline int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
+                                       VTDContextEntry *ce,
+                                       bool *pe_fpd_set)
+{
+    int ret;
+    uint32_t pasid;
+    dma_addr_t pasid_dir_base;
+    VTDPASIDDirEntry pdire;
+    VTDPASIDEntry pe;
+
+    pasid = vtd_ce_get_rid2pasid(ce);
+    pasid_dir_base = vtd_ce_get_pasid_dir_table(ce);
+
+    ret = vtd_get_pasid_dire(pasid_dir_base, pasid, &pdire);
+    if (ret) {
+        return ret;
+    }
+
+    if (pdire.val & VTD_PASID_DIR_FPD) {
+        *pe_fpd_set = true;
+        return 0;
+    }
+
+    ret = vtd_get_pasid_entry(s, pasid, &pdire, &pe);
+    if (ret) {
+        return ret;
+    }
+
+    if (pe.val[0] & VTD_PASID_ENTRY_FPD) {
+        *pe_fpd_set = true;
+    }
+
+    return 0;
+}
+
 /* Get the page-table level that hardware should use for the second-level
  * page-table walk from the Address Width field of context-entry.
  */
 static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
 {
-    return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
+    return 2 + (ce->val[1] & VTD_CONTEXT_ENTRY_AW);
+}
+
+static inline uint32_t vtd_get_iova_level(IntelIOMMUState *s,
+                                          VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return vtd_pe_get_level(&pe);
+    }
+
+    return vtd_ce_get_level(ce);
 }
 
 static inline uint32_t vtd_ce_get_agaw(VTDContextEntry *ce)
 {
-    return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
+    return 30 + (ce->val[1] & VTD_CONTEXT_ENTRY_AW) * 9;
+}
+
+static inline uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return vtd_pe_get_agaw(&pe);
+    }
+
+    return vtd_ce_get_agaw(ce);
 }
 
 static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce)
 {
-    return ce->lo & VTD_CONTEXT_ENTRY_TT;
+    return ce->val[0] & VTD_CONTEXT_ENTRY_TT;
 }
 
 /* Return true if check passed, otherwise false */
-static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
+static inline bool vtd_ce_type_check(IntelIOMMUState *s,
+                                     X86IOMMUState *x86_iommu,
                                      VTDContextEntry *ce)
 {
+    if (s->root_scalable) {
+        /*
+         * Translation Type locates in context entry only when VTD is in
+         * legacy mode. For scalable mode, need to return true to avoid
+         * unnecessary fault.
+         */
+        return true;
+    }
+
     switch (vtd_ce_get_type(ce)) {
     case VTD_CONTEXT_TT_MULTI_LEVEL:
         /* Always supported */
@@ -639,7 +876,7 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
         }
         break;
     default:
-        /* Unknwon type */
+        /* Unknown type */
         error_report_once("%s: unknown ce type: %"PRIu32, __func__,
                           vtd_ce_get_type(ce));
         return false;
@@ -647,21 +884,36 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
     return true;
 }
 
-static inline uint64_t vtd_iova_limit(VTDContextEntry *ce, uint8_t aw)
+static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
+                                      VTDContextEntry *ce, uint8_t aw)
 {
-    uint32_t ce_agaw = vtd_ce_get_agaw(ce);
+    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce);
     return 1ULL << MIN(ce_agaw, aw);
 }
 
 /* Return true if IOVA passes range check, otherwise false. */
-static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
+static inline bool vtd_iova_range_check(IntelIOMMUState *s,
+                                        uint64_t iova, VTDContextEntry *ce,
                                         uint8_t aw)
 {
     /*
      * Check if @iova is above 2^X-1, where X is the minimum of MGAW
      * in CAP_REG and AW in context-entry.
      */
-    return !(iova & ~(vtd_iova_limit(ce, aw) - 1));
+    return !(iova & ~(vtd_iova_limit(s, ce, aw) - 1));
+}
+
+static inline dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
+                                                 VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return vtd_pe_get_slpt_base(&pe);
+    }
+
+    return vtd_ce_get_slpt_base(ce);
 }
 
 /*
@@ -707,17 +959,18 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
 /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
+static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
+                             uint64_t iova, bool is_write,
                              uint64_t *slptep, uint32_t *slpte_level,
                              bool *reads, bool *writes, uint8_t aw_bits)
 {
-    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
-    uint32_t level = vtd_ce_get_level(ce);
+    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
+    uint32_t level = vtd_get_iova_level(s, ce);
     uint32_t offset;
     uint64_t slpte;
     uint64_t access_right_check;
 
-    if (!vtd_iova_range_check(iova, ce, aw_bits)) {
+    if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
         error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
                           __func__, iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
@@ -733,7 +986,7 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
         if (slpte == (uint64_t)-1) {
             error_report_once("%s: detected read error on DMAR slpte "
                               "(iova=0x%" PRIx64 ")", __func__, iova);
-            if (level == vtd_ce_get_level(ce)) {
+            if (level == vtd_get_iova_level(s, ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
             } else {
@@ -962,29 +1215,96 @@ next:
 /**
  * vtd_page_walk - walk specific IOVA range, and call the hook
  *
+ * @s: intel iommu state
  * @ce: context entry to walk upon
  * @start: IOVA address to start the walk
  * @end: IOVA range end address (start <= addr < end)
  * @info: page walking information struct
  */
-static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
+static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
+                         uint64_t start, uint64_t end,
                          vtd_page_walk_info *info)
 {
-    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
-    uint32_t level = vtd_ce_get_level(ce);
+    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
+    uint32_t level = vtd_get_iova_level(s, ce);
 
-    if (!vtd_iova_range_check(start, ce, info->aw)) {
+    if (!vtd_iova_range_check(s, start, ce, info->aw)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
-    if (!vtd_iova_range_check(end, ce, info->aw)) {
+    if (!vtd_iova_range_check(s, end, ce, info->aw)) {
         /* Fix end so that it reaches the maximum */
-        end = vtd_iova_limit(ce, info->aw);
+        end = vtd_iova_limit(s, ce, info->aw);
     }
 
     return vtd_page_walk_level(addr, start, end, level, true, true, info);
 }
 
+static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
+                                          VTDRootEntry *re)
+{
+    /* Legacy Mode reserved bits check */
+    if (!s->root_scalable &&
+        (re->hi || (re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
+        goto rsvd_err;
+
+    /* Scalable Mode reserved bits check */
+    if (s->root_scalable &&
+        ((re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits)) ||
+         (re->hi & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
+        goto rsvd_err;
+
+    return 0;
+
+rsvd_err:
+    error_report_once("%s: invalid root entry: hi=0x%"PRIx64
+                      ", lo=0x%"PRIx64,
+                      __func__, re->hi, re->lo);
+    return -VTD_FR_ROOT_ENTRY_RSVD;
+}
+
+static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
+                                                    VTDContextEntry *ce)
+{
+    if (!s->root_scalable &&
+        (ce->val[1] & VTD_CONTEXT_ENTRY_RSVD_HI ||
+         ce->val[0] & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
+        error_report_once("%s: invalid context entry: hi=%"PRIx64
+                          ", lo=%"PRIx64" (reserved nonzero)",
+                          __func__, ce->val[1], ce->val[0]);
+        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    }
+
+    if (s->root_scalable &&
+        (ce->val[0] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(s->aw_bits) ||
+         ce->val[1] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 ||
+         ce->val[2] ||
+         ce->val[3])) {
+        error_report_once("%s: invalid context entry: val[3]=%"PRIx64
+                          ", val[2]=%"PRIx64
+                          ", val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64" (reserved nonzero)",
+                          __func__, ce->val[3], ce->val[2],
+                          ce->val[1], ce->val[0]);
+        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    }
+
+    return 0;
+}
+
+static inline int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    /*
+     * Make sure in Scalable Mode, a present context entry
+     * has valid rid2pasid setting, which includes valid
+     * rid2pasid field and corresponding pasid entry setting
+     */
+    return vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+}
+
 /* Map a device to its corresponding domain (context-entry) */
 static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
                                     uint8_t devfn, VTDContextEntry *ce)
@@ -998,20 +1318,18 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return ret_fr;
     }
 
-    if (!vtd_root_entry_present(&re)) {
+    if (!vtd_root_entry_present(s, &re, devfn)) {
         /* Not error - it's okay we don't have root entry. */
         trace_vtd_re_not_present(bus_num);
         return -VTD_FR_ROOT_ENTRY_P;
     }
 
-    if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD(s->aw_bits))) {
-        error_report_once("%s: invalid root entry: rsvd=0x%"PRIx64
-                          ", val=0x%"PRIx64" (reserved nonzero)",
-                          __func__, re.rsvd, re.val);
-        return -VTD_FR_ROOT_ENTRY_RSVD;
+    ret_fr = vtd_root_entry_rsvd_bits_check(s, &re);
+    if (ret_fr) {
+        return ret_fr;
     }
 
-    ret_fr = vtd_get_context_entry_from_root(&re, devfn, ce);
+    ret_fr = vtd_get_context_entry_from_root(s, &re, devfn, ce);
     if (ret_fr) {
         return ret_fr;
     }
@@ -1022,28 +1340,40 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return -VTD_FR_CONTEXT_ENTRY_P;
     }
 
-    if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
-               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
-        error_report_once("%s: invalid context entry: hi=%"PRIx64
-                          ", lo=%"PRIx64" (reserved nonzero)",
-                          __func__, ce->hi, ce->lo);
-        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    ret_fr = vtd_context_entry_rsvd_bits_check(s, ce);
+    if (ret_fr) {
+        return ret_fr;
     }
 
     /* Check if the programming of context-entry is valid */
-    if (!vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
+    if (!s->root_scalable &&
+        !vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
         error_report_once("%s: invalid context entry: hi=%"PRIx64
                           ", lo=%"PRIx64" (level %d not supported)",
-                          __func__, ce->hi, ce->lo, vtd_ce_get_level(ce));
+                          __func__, ce->val[1], ce->val[0],
+                          vtd_ce_get_level(ce));
         return -VTD_FR_CONTEXT_ENTRY_INV;
     }
 
     /* Do translation type check */
-    if (!vtd_ce_type_check(x86_iommu, ce)) {
+    if (!vtd_ce_type_check(s, x86_iommu, ce)) {
         /* Errors dumped in vtd_ce_type_check() */
         return -VTD_FR_CONTEXT_ENTRY_INV;
     }
 
+    /*
+     * Check if the programming of context-entry.rid2pasid
+     * and corresponding pasid setting is valid, and thus
+     * avoids to check pasid entry fetching result in future
+     * helper function calling.
+     */
+    if (s->root_scalable) {
+        ret_fr = vtd_ce_rid2pasid_check(s, ce);
+        if (ret_fr) {
+            return ret_fr;
+        }
+    }
+
     return 0;
 }
 
@@ -1065,10 +1395,10 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
         .notify_unmap = true,
         .aw = s->aw_bits,
         .as = vtd_as,
-        .domain_id = VTD_CONTEXT_ENTRY_DID(ce->hi),
+        .domain_id = VTD_CONTEXT_ENTRY_DID(ce->val[1]),
     };
 
-    return vtd_page_walk(ce, addr, addr + size, &info);
+    return vtd_page_walk(s, ce, addr, addr + size, &info);
 }
 
 static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
@@ -1103,35 +1433,24 @@ static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
 }
 
 /*
- * Fetch translation type for specific device. Returns <0 if error
- * happens, otherwise return the shifted type to check against
- * VTD_CONTEXT_TT_*.
+ * Check if specific device is configed to bypass address
+ * translation for DMA requests. In Scalable Mode, bypass
+ * 1st-level translation or 2nd-level translation, it depends
+ * on PGTT setting.
  */
-static int vtd_dev_get_trans_type(VTDAddressSpace *as)
+static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
 {
     IntelIOMMUState *s;
     VTDContextEntry ce;
+    VTDPASIDEntry pe;
     int ret;
 
-    s = as->iommu_state;
+    assert(as);
 
+    s = as->iommu_state;
     ret = vtd_dev_to_context_entry(s, pci_bus_num(as->bus),
                                    as->devfn, &ce);
     if (ret) {
-        return ret;
-    }
-
-    return vtd_ce_get_type(&ce);
-}
-
-static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
-{
-    int ret;
-
-    assert(as);
-
-    ret = vtd_dev_get_trans_type(as);
-    if (ret < 0) {
         /*
          * Possibly failed to parse the context entry for some reason
          * (e.g., during init, or any guest configuration errors on
@@ -1141,7 +1460,25 @@ static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
         return false;
     }
 
-    return ret == VTD_CONTEXT_TT_PASS_THROUGH;
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, &ce, &pe);
+        return (vtd_pe_get_type(&pe) == VTD_SM_PASID_ENTRY_PT);
+    }
+
+    return (vtd_ce_get_type(&ce) == VTD_CONTEXT_TT_PASS_THROUGH);
+}
+
+static inline uint16_t vtd_get_domain_id(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return vtd_pe_get_domain_id(&pe);
+    }
+
+    return vtd_ce_get_domain_id(ce);
 }
 
 /* Return whether the device is using IOMMU translation. */
@@ -1317,14 +1654,29 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     /* Try to fetch context-entry from cache first */
     if (cc_entry->context_cache_gen == s->context_cache_gen) {
-        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
-                               cc_entry->context_entry.lo,
+        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.val[1],
+                               cc_entry->context_entry.val[0],
                                cc_entry->context_cache_gen);
         ce = cc_entry->context_entry;
-        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+        is_fpd_set = ce.val[0] & VTD_CONTEXT_ENTRY_FPD;
+        if (s->root_scalable) {
+            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
+            if (ret_fr) {
+                ret_fr = -ret_fr;
+                if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
+                    trace_vtd_fault_disabled();
+                } else {
+                    vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+                }
+                goto error;
+            }
+        }
     } else {
         ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
-        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+        is_fpd_set = ce.val[0] & VTD_CONTEXT_ENTRY_FPD;
+        if (!ret_fr && s->root_scalable) {
+            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
+        }
         if (ret_fr) {
             ret_fr = -ret_fr;
             if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
@@ -1335,7 +1687,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
             goto error;
         }
         /* Update context-cache */
-        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
+        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.val[1], ce.val[0],
                                   cc_entry->context_cache_gen,
                                   s->context_cache_gen);
         cc_entry->context_entry = ce;
@@ -1367,7 +1719,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         return true;
     }
 
-    ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
+    ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
                                &reads, &writes, s->aw_bits);
     if (ret_fr) {
         ret_fr = -ret_fr;
@@ -1381,7 +1733,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     page_mask = vtd_slpt_level_page_mask(level);
     access_flags = IOMMU_ACCESS_FLAG(reads, writes);
-    vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
+    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
                      access_flags, level);
 out:
     vtd_iommu_unlock(s);
@@ -1404,6 +1756,7 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
 {
     s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
     s->root_extended = s->root & VTD_RTADDR_RTT;
+    s->root_scalable = s->root & VTD_RTADDR_SMT;
     s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
 
     trace_vtd_reg_dmar_root(s->root, s->root_extended);
@@ -1573,7 +1926,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
-            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            domain_id == vtd_get_domain_id(s, &ce)) {
             vtd_sync_shadow_page_table(vtd_as);
         }
     }
@@ -1591,7 +1944,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
-        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+        if (!ret && domain_id == vtd_get_domain_id(s, &ce)) {
             if (vtd_as_has_map_notifier(vtd_as)) {
                 /*
                  * As long as we have MAP notifications registered in
@@ -2629,6 +2982,7 @@ static const VMStateDescription vtd_vmstate = {
         VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
         VMSTATE_UINT8(iq_last_desc_type, IntelIOMMUState),
         VMSTATE_BOOL(root_extended, IntelIOMMUState),
+        VMSTATE_BOOL(root_scalable, IntelIOMMUState),
         VMSTATE_BOOL(dmar_enabled, IntelIOMMUState),
         VMSTATE_BOOL(qi_enabled, IntelIOMMUState),
         VMSTATE_BOOL(intr_enabled, IntelIOMMUState),
@@ -3098,10 +3452,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     vtd_address_space_unmap(vtd_as, n);
 
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
-        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
+        trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : "",
+                                  bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn),
-                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
-                                  ce.hi, ce.lo);
+                                  vtd_get_domain_id(s, &ce),
+                                  ce.val[1], ce.val[0]);
         if (vtd_as_has_map_notifier(vtd_as)) {
             /* This is required only for MAP typed notifiers */
             vtd_page_walk_info info = {
@@ -3110,10 +3465,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .notify_unmap = false,
                 .aw = s->aw_bits,
                 .as = vtd_as,
-                .domain_id = VTD_CONTEXT_ENTRY_DID(ce.hi),
+                .domain_id = VTD_CONTEXT_ENTRY_DID(ce.val[1]),
             };
 
-            vtd_page_walk(&ce, 0, ~0ULL, &info);
+            vtd_page_walk(s, &ce, 0, ~0ULL, &info);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
@@ -3137,6 +3492,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     s->root = 0;
     s->root_extended = false;
+    s->root_scalable = false;
     s->dmar_enabled = false;
     s->iq_head = 0;
     s->iq_tail = 0;
@@ -3198,7 +3554,7 @@ static void vtd_init(IntelIOMMUState *s)
     vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
     vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
     vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
-    vtd_define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
+    vtd_define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffffc00ULL, 0);
     vtd_define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
     vtd_define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 00e9edb..02674f9 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -172,6 +172,7 @@
 
 /* RTADDR_REG */
 #define VTD_RTADDR_RTT              (1ULL << 11)
+#define VTD_RTADDR_SMT              (1ULL << 10)
 #define VTD_RTADDR_ADDR_MASK(aw)    (VTD_HAW_MASK(aw) ^ 0xfffULL)
 
 /* IRTA_REG */
@@ -294,6 +295,8 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
+    VTD_FR_PASID_TABLE_INV = 0x58,
+
     /* This is not a normal fault reason. We use this to indicate some faults
      * that are not referenced by the VT-d specification.
      * Fault event with such reason should not be recorded.
@@ -411,8 +414,8 @@ typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
 #define VTD_PAGE_MASK_1G            (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
 
 struct VTDRootEntry {
-    uint64_t val;
-    uint64_t rsvd;
+    uint64_t lo;
+    uint64_t hi;
 };
 typedef struct VTDRootEntry VTDRootEntry;
 
@@ -423,6 +426,10 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_ROOT_ENTRY_NR           (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
 #define VTD_ROOT_ENTRY_RSVD(aw)     (0xffeULL | ~VTD_HAW_MASK(aw))
 
+#define VTD_ROOT_ENTRY_SIZE         16
+
+#define VTD_DEVFN_CHECK_MASK        0x80
+
 /* Masks for struct VTDContextEntry */
 /* lo */
 #define VTD_CONTEXT_ENTRY_P         (1ULL << 0)
@@ -441,6 +448,38 @@ typedef struct VTDRootEntry VTDRootEntry;
 
 #define VTD_CONTEXT_ENTRY_NR        (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
 
+#define VTD_CTX_ENTRY_LECY_SIZE     16
+#define VTD_CTX_ENTRY_SM_SIZE       32
+
+#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
+#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
+#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
+
+/* PASID Table Related Definitions */
+#define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
+#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
+#define VTD_PASID_DIR_ENTRY_SIZE      8
+#define VTD_PASID_ENTRY_SIZE          64
+#define VTD_PASID_DIR_BITS_MASK       (0x3fffULL)
+#define VTD_PASID_DIR_INDEX(pasid)    (((pasid) >> 6) & VTD_PASID_DIR_BITS_MASK)
+#define VTD_PASID_DIR_FPD             (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
+#define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
+#define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+
+/* PASID Granular Translation Type Mask */
+#define VTD_SM_PASID_ENTRY_PGTT        (7ULL << 6)
+#define VTD_SM_PASID_ENTRY_FLT         (1ULL << 6)
+#define VTD_SM_PASID_ENTRY_SLT         (2ULL << 6)
+#define VTD_SM_PASID_ENTRY_NESTED      (3ULL << 6)
+#define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
+
+#define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
+#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+
+/* Second Level Page Translation Pointer*/
+#define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
+
 /* Paging Structure common */
 #define VTD_SL_PT_PAGE_SIZE_MASK    (1ULL << 7)
 /* Bits to decide the offset for each level */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 77244fc..cae1b76 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -30,7 +30,7 @@ vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
 vtd_fault_disabled(void) "Fault processing disabled for context entry"
-vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_replay_ce_valid(const char *mode, uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "%s: replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
 vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
 vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
 vtd_page_walk_one(uint16_t domain, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "domain 0x%"PRIu16" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index a321cc9..ff13ff27 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -66,11 +66,12 @@ typedef struct VTDIOTLBEntry VTDIOTLBEntry;
 typedef struct VTDBus VTDBus;
 typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
+typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
+typedef struct VTDPASIDEntry VTDPASIDEntry;
 
 /* Context-Entry */
 struct VTDContextEntry {
-    uint64_t lo;
-    uint64_t hi;
+    uint64_t val[4];
 };
 
 struct VTDContextCacheEntry {
@@ -81,6 +82,16 @@ struct VTDContextCacheEntry {
     struct VTDContextEntry context_entry;
 };
 
+/* PASID Directory Entry */
+struct VTDPASIDDirEntry {
+    uint64_t val;
+};
+
+/* PASID Table Entry */
+struct VTDPASIDEntry {
+    uint64_t val[8];
+};
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -212,6 +223,7 @@ struct IntelIOMMUState {
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
+    bool root_scalable;             /* Type of root table (scalable or not) */
     bool dmar_enabled;              /* Set if DMA remapping is enabled */
 
     uint16_t iq_head;               /* Current invalidation queue head */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-01-30  5:09 [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Yi Sun
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation Yi Sun
@ 2019-01-30  5:09 ` Yi Sun
  2019-02-12  6:27   ` Peter Xu
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
  2019-02-11 10:37 ` [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Peter Xu
  3 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-01-30  5:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

From: "Liu, Yi L" <yi.l.liu@intel.com>

Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
Mode. This patch adds emulation of 256bits qi_desc.

[Yi Sun is co-developer to rebase and refine the patch.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 182 +++++++++++++++++++++++++----------------
 hw/i386/intel_iommu_internal.h |   8 +-
 include/hw/i386/intel_iommu.h  |   1 +
 3 files changed, 116 insertions(+), 75 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 396ac8e..3664a00 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 
 #define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)
+#define vtd_ecap_smts(s) ((s)->ecap & VTD_ECAP_SMTS)
 
 /* context entry operations */
 #define vtd_get_ce_size(s, ce) \
@@ -65,6 +66,9 @@
 #define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
 #define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
 
+/* invalidation desc */
+#define vtd_get_inv_desc_width(s) ((s)->iq_dw ? 32 : 16)
+
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
@@ -1759,6 +1763,11 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
     s->root_scalable = s->root & VTD_RTADDR_SMT;
     s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
 
+    /* if Scalable mode is not enabled, enforce iq_dw to be 16 byte */
+    if (!s->root_scalable) {
+        s->iq_dw = 0;
+    }
+
     trace_vtd_reg_dmar_root(s->root, s->root_extended);
 }
 
@@ -2052,7 +2061,7 @@ static void vtd_handle_gcmd_qie(IntelIOMMUState *s, bool en)
     if (en) {
         s->iq = iqa_val & VTD_IQA_IQA_MASK(s->aw_bits);
         /* 2^(x+8) entries */
-        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8);
+        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8 - (s->iq_dw ? 1 : 0));
         s->qi_enabled = true;
         trace_vtd_inv_qi_setup(s->iq, s->iq_size);
         /* Ok - report back to driver */
@@ -2219,54 +2228,66 @@ static void vtd_handle_iotlb_write(IntelIOMMUState *s)
 }
 
 /* Fetch an Invalidation Descriptor from the Invalidation Queue */
-static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
+static bool vtd_get_inv_desc(IntelIOMMUState *s,
                              VTDInvDesc *inv_desc)
 {
-    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
-    if (dma_memory_read(&address_space_memory, addr, inv_desc,
-        sizeof(*inv_desc))) {
-        error_report_once("Read INV DESC failed");
-        inv_desc->lo = 0;
-        inv_desc->hi = 0;
+    dma_addr_t base_addr = s->iq;
+    uint32_t offset = s->iq_head;
+    uint32_t dw = vtd_get_inv_desc_width(s);
+    dma_addr_t addr = base_addr + offset * dw;
+
+    /* init */
+    inv_desc->val[0] = 0;
+    inv_desc->val[1] = 0;
+    inv_desc->val[2] = 0;
+    inv_desc->val[3] = 0;
+
+    if (dma_memory_read(&address_space_memory, addr, inv_desc, dw)) {
+        error_report_once("Read INV DESC failed.");
         return false;
     }
-    inv_desc->lo = le64_to_cpu(inv_desc->lo);
-    inv_desc->hi = le64_to_cpu(inv_desc->hi);
+    inv_desc->val[0] = le64_to_cpu(inv_desc->val[0]);
+    inv_desc->val[1] = le64_to_cpu(inv_desc->val[1]);
+    inv_desc->val[2] = le64_to_cpu(inv_desc->val[2]);
+    inv_desc->val[3] = le64_to_cpu(inv_desc->val[3]);
     return true;
 }
 
 static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 {
-    if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
-        (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
-        error_report_once("%s: invalid wait desc: hi=%"PRIx64", lo=%"PRIx64
-                          " (reserved nonzero)", __func__, inv_desc->hi,
-                          inv_desc->lo);
+    if ((inv_desc->val[1] & VTD_INV_DESC_WAIT_RSVD_HI) ||
+        (inv_desc->val[0] & VTD_INV_DESC_WAIT_RSVD_LO)) {
+        error_report_once("%s: invalid wait desc: val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64
+                          " (reserved nonzero)", __func__, inv_desc->val[1],
+                          inv_desc->val[0]);
         return false;
     }
-    if (inv_desc->lo & VTD_INV_DESC_WAIT_SW) {
+    if (inv_desc->val[0] & VTD_INV_DESC_WAIT_SW) {
         /* Status Write */
-        uint32_t status_data = (uint32_t)(inv_desc->lo >>
+        uint32_t status_data = (uint32_t)(inv_desc->val[0] >>
                                VTD_INV_DESC_WAIT_DATA_SHIFT);
 
-        assert(!(inv_desc->lo & VTD_INV_DESC_WAIT_IF));
+        assert(!(inv_desc->val[0] & VTD_INV_DESC_WAIT_IF));
 
         /* FIXME: need to be masked with HAW? */
-        dma_addr_t status_addr = inv_desc->hi;
+        dma_addr_t status_addr = inv_desc->val[1];
         trace_vtd_inv_desc_wait_sw(status_addr, status_data);
         status_data = cpu_to_le32(status_data);
         if (dma_memory_write(&address_space_memory, status_addr, &status_data,
                              sizeof(status_data))) {
-            trace_vtd_inv_desc_wait_write_fail(inv_desc->hi, inv_desc->lo);
+            trace_vtd_inv_desc_wait_write_fail(inv_desc->val[1],
+                                               inv_desc->val[0]);
             return false;
         }
-    } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
+    } else if (inv_desc->val[0] & VTD_INV_DESC_WAIT_IF) {
         /* Interrupt flag */
         vtd_generate_completion_event(s);
     } else {
-        error_report_once("%s: invalid wait desc: hi=%"PRIx64", lo=%"PRIx64
-                          " (unknown type)", __func__, inv_desc->hi,
-                          inv_desc->lo);
+        error_report_once("%s: invalid wait desc: val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64
+                          " (unknown type)", __func__, inv_desc->val[1],
+                          inv_desc->val[0]);
         return false;
     }
     return true;
@@ -2277,31 +2298,33 @@ static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
 {
     uint16_t sid, fmask;
 
-    if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
-        error_report_once("%s: invalid cc inv desc: hi=%"PRIx64", lo=%"PRIx64
-                          " (reserved nonzero)", __func__, inv_desc->hi,
-                          inv_desc->lo);
+    if ((inv_desc->val[0] & VTD_INV_DESC_CC_RSVD) || inv_desc->val[1]) {
+        error_report_once("%s: invalid cc inv desc: val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64
+                          " (reserved nonzero)", __func__, inv_desc->val[1],
+                          inv_desc->val[0]);
         return false;
     }
-    switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
+    switch (inv_desc->val[0] & VTD_INV_DESC_CC_G) {
     case VTD_INV_DESC_CC_DOMAIN:
         trace_vtd_inv_desc_cc_domain(
-            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
+            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->val[0]));
         /* Fall through */
     case VTD_INV_DESC_CC_GLOBAL:
         vtd_context_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_CC_DEVICE:
-        sid = VTD_INV_DESC_CC_SID(inv_desc->lo);
-        fmask = VTD_INV_DESC_CC_FM(inv_desc->lo);
+        sid = VTD_INV_DESC_CC_SID(inv_desc->val[0]);
+        fmask = VTD_INV_DESC_CC_FM(inv_desc->val[0]);
         vtd_context_device_invalidate(s, sid, fmask);
         break;
 
     default:
-        error_report_once("%s: invalid cc inv desc: hi=%"PRIx64", lo=%"PRIx64
-                          " (invalid type)", __func__, inv_desc->hi,
-                          inv_desc->lo);
+        error_report_once("%s: invalid cc inv desc: val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64
+                          " (invalid type)", __func__, inv_desc->val[1],
+                          inv_desc->val[0]);
         return false;
     }
     return true;
@@ -2313,32 +2336,32 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     uint8_t am;
     hwaddr addr;
 
-    if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
-        (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
-        error_report_once("%s: invalid iotlb inv desc: hi=0x%"PRIx64
-                          ", lo=0x%"PRIx64" (reserved bits unzero)\n",
-                          __func__, inv_desc->hi, inv_desc->lo);
+    if ((inv_desc->val[0] & VTD_INV_DESC_IOTLB_RSVD_LO) ||
+        (inv_desc->val[1] & VTD_INV_DESC_IOTLB_RSVD_HI)) {
+        error_report_once("%s: invalid iotlb inv desc: val[1]=0x%"PRIx64
+                          ", val[0]=0x%"PRIx64" (reserved bits unzero)\n",
+                          __func__, inv_desc->val[1], inv_desc->val[0]);
         return false;
     }
 
-    switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
+    switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
     case VTD_INV_DESC_IOTLB_GLOBAL:
         vtd_iotlb_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_IOTLB_DOMAIN:
-        domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
+        domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->val[0]);
         vtd_iotlb_domain_invalidate(s, domain_id);
         break;
 
     case VTD_INV_DESC_IOTLB_PAGE:
-        domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
-        addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
-        am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
+        domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->val[0]);
+        addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->val[1]);
+        am = VTD_INV_DESC_IOTLB_AM(inv_desc->val[1]);
         if (am > VTD_MAMV) {
-            error_report_once("%s: invalid iotlb inv desc: hi=0x%"PRIx64
-                              ", lo=0x%"PRIx64" (am=%u > VTD_MAMV=%u)\n",
-                              __func__, inv_desc->hi, inv_desc->lo,
+            error_report_once("%s: invalid iotlb inv desc: val[1]=0x%"PRIx64
+                              ", val[0]=0x%"PRIx64" (am=%u > VTD_MAMV=%u)\n",
+                              __func__, inv_desc->val[1], inv_desc->val[0],
                               am, (unsigned)VTD_MAMV);
             return false;
         }
@@ -2346,10 +2369,10 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
         break;
 
     default:
-        error_report_once("%s: invalid iotlb inv desc: hi=0x%"PRIx64
-                          ", lo=0x%"PRIx64" (type mismatch: 0x%llx)\n",
-                          __func__, inv_desc->hi, inv_desc->lo,
-                          inv_desc->lo & VTD_INV_DESC_IOTLB_G);
+        error_report_once("%s: invalid iotlb inv desc: val[1]=0x%"PRIx64
+                          ", val[0]=0x%"PRIx64" (type mismatch: 0x%llx)\n",
+                          __func__, inv_desc->val[1], inv_desc->val[0],
+                          inv_desc->val[0] & VTD_INV_DESC_IOTLB_G);
         return false;
     }
     return true;
@@ -2381,17 +2404,17 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
     bool size;
     uint8_t bus_num;
 
-    addr = VTD_INV_DESC_DEVICE_IOTLB_ADDR(inv_desc->hi);
-    sid = VTD_INV_DESC_DEVICE_IOTLB_SID(inv_desc->lo);
+    addr = VTD_INV_DESC_DEVICE_IOTLB_ADDR(inv_desc->val[1]);
+    sid = VTD_INV_DESC_DEVICE_IOTLB_SID(inv_desc->val[0]);
     devfn = sid & 0xff;
     bus_num = sid >> 8;
-    size = VTD_INV_DESC_DEVICE_IOTLB_SIZE(inv_desc->hi);
+    size = VTD_INV_DESC_DEVICE_IOTLB_SIZE(inv_desc->val[1]);
 
-    if ((inv_desc->lo & VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO) ||
-        (inv_desc->hi & VTD_INV_DESC_DEVICE_IOTLB_RSVD_HI)) {
-        error_report_once("%s: invalid dev-iotlb inv desc: hi=%"PRIx64
-                          ", lo=%"PRIx64" (reserved nonzero)", __func__,
-                          inv_desc->hi, inv_desc->lo);
+    if ((inv_desc->val[0] & VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO) ||
+        (inv_desc->val[1] & VTD_INV_DESC_DEVICE_IOTLB_RSVD_HI)) {
+        error_report_once("%s: invalid dev-iotlb inv desc: val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64" (reserved nonzero)", __func__,
+                          inv_desc->val[1], inv_desc->val[0]);
         return false;
     }
 
@@ -2437,54 +2460,64 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
     uint8_t desc_type;
 
     trace_vtd_inv_qi_head(s->iq_head);
-    if (!vtd_get_inv_desc(s->iq, s->iq_head, &inv_desc)) {
+    if (!vtd_get_inv_desc(s, &inv_desc)) {
         s->iq_last_desc_type = VTD_INV_DESC_NONE;
         return false;
     }
-    desc_type = inv_desc.lo & VTD_INV_DESC_TYPE;
+    if (inv_desc.val[3] || inv_desc.val[2]) {
+        error_report_once("%s: invalid inv desc: val[3]=%"PRIx64
+                          ", val[2]=%"PRIx64
+                          " (detect reserve non-zero)", __func__,
+                          inv_desc.val[3],
+                          inv_desc.val[2]);
+        return false;
+    }
+
+    desc_type = inv_desc.val[0] & VTD_INV_DESC_TYPE;
     /* FIXME: should update at first or at last? */
     s->iq_last_desc_type = desc_type;
 
     switch (desc_type) {
     case VTD_INV_DESC_CC:
-        trace_vtd_inv_desc("context-cache", inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("context-cache", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_context_cache_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IOTLB:
-        trace_vtd_inv_desc("iotlb", inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iotlb", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_iotlb_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_WAIT:
-        trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("wait", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_wait_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IEC:
-        trace_vtd_inv_desc("iec", inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iec", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_DEVICE:
-        trace_vtd_inv_desc("device", inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("device", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_device_iotlb_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     default:
-        error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
-                          " (unknown type)", __func__, inv_desc.hi,
-                          inv_desc.lo);
+        error_report_once("%s: invalid inv desc: val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64
+                          " (unknown type)", __func__, inv_desc.val[1],
+                          inv_desc.val[0]);
         return false;
     }
     s->iq_head++;
@@ -2525,7 +2558,7 @@ static void vtd_handle_iqt_write(IntelIOMMUState *s)
 {
     uint64_t val = vtd_get_quad_raw(s, DMAR_IQT_REG);
 
-    s->iq_tail = VTD_IQT_QT(val);
+    s->iq_tail = VTD_IQT_QT(s->iq_dw, val);
     trace_vtd_inv_qi_tail(s->iq_tail);
 
     if (s->qi_enabled && !(vtd_get_long_raw(s, DMAR_FSTS_REG) & VTD_FSTS_IQE)) {
@@ -2794,6 +2827,11 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         } else {
             vtd_set_quad(s, addr, val);
         }
+        if (vtd_ecap_smts(s)) {
+            s->iq_dw = val & VTD_IQA_DW_MASK;
+        } else {
+            s->iq_dw = 0;
+        }
         break;
 
     case DMAR_IQA_REG_HI:
@@ -3577,7 +3615,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     vtd_define_quad(s, DMAR_IQH_REG, 0, 0, 0);
     vtd_define_quad(s, DMAR_IQT_REG, 0, 0x7fff0ULL, 0);
-    vtd_define_quad(s, DMAR_IQA_REG, 0, 0xfffffffffffff007ULL, 0);
+    vtd_define_quad(s, DMAR_IQA_REG, 0, 0xfffffffffffff807ULL, 0);
     vtd_define_long(s, DMAR_ICS_REG, 0, 0, 0x1UL);
     vtd_define_long(s, DMAR_IECTL_REG, 0x80000000UL, 0x80000000UL, 0);
     vtd_define_long(s, DMAR_IEDATA_REG, 0, 0xffffffffUL, 0);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 02674f9..2a753c5 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,6 +190,7 @@
 #define VTD_ECAP_EIM                (1ULL << 4)
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_SMTS               (1ULL << 43)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
@@ -218,11 +219,13 @@
 #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
 
 /* IQT_REG */
-#define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
+#define VTD_IQT_QT(dw_bit, val)     (dw_bit ? (((val) >> 5) & 0x3fffULL) : \
+                                     (((val) >> 4) & 0x7fffULL))
 
 /* IQA_REG */
 #define VTD_IQA_IQA_MASK(aw)        (VTD_HAW_MASK(aw) ^ 0xfffULL)
 #define VTD_IQA_QS                  0x7ULL
+#define VTD_IQA_DW_MASK             0x800
 
 /* IQH_REG */
 #define VTD_IQH_QH_SHIFT            4
@@ -321,8 +324,7 @@ typedef struct VTDInvDescIEC VTDInvDescIEC;
 /* Queued Invalidation Descriptor */
 union VTDInvDesc {
     struct {
-        uint64_t lo;
-        uint64_t hi;
+        uint64_t val[4];
     };
     union {
         VTDInvDescIEC iec;
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index ff13ff27..a5da139 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -230,6 +230,7 @@ struct IntelIOMMUState {
     uint16_t iq_tail;               /* Current invalidation queue tail */
     dma_addr_t iq;                  /* Current invalidation queue pointer */
     uint16_t iq_size;               /* IQ Size in number of entries */
+    uint16_t iq_dw;                 /* IQ descriptor width */
     bool qi_enabled;                /* Set if the QI is enabled */
     uint8_t iq_last_desc_type;      /* The type of last completed descriptor */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-01-30  5:09 [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Yi Sun
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation Yi Sun
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
@ 2019-01-30  5:09 ` Yi Sun
  2019-02-12  6:46   ` Peter Xu
  2019-02-11 10:37 ` [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Peter Xu
  3 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-01-30  5:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

This patch adds an option to provide flexibility for user to expose
Scalable Mode to guest. User could expose Scalable Mode to guest by
the config as below:

"-device intel-iommu,caching-mode=on,scalable-mode=on"

The Linux iommu driver has supported scalable mode. Please refer below
patch set:

    https://www.spinics.net/lists/kernel/msg2985279.html

Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 22 ++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  6 ++++++
 include/hw/i386/intel_iommu.h  |  3 ++-
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 3664a00..447fdf3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2492,6 +2492,17 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    /*
+     * TODO: the entity of below two cases will be implemented in future series.
+     * To make guest (which integrates scalable mode support patch set in
+     * iommu driver) work, just return true is enough so far.
+     */
+    case VTD_INV_DESC_PC:
+        break;
+
+    case VTD_INV_DESC_PIOTLB:
+        break;
+
     case VTD_INV_DESC_WAIT:
         trace_vtd_inv_desc("wait", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_wait_desc(s, &inv_desc)) {
@@ -3051,6 +3062,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
+    DEFINE_PROP_BOOL("scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -3583,6 +3595,16 @@ static void vtd_init(IntelIOMMUState *s)
         s->cap |= VTD_CAP_CM;
     }
 
+    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
+    if (s->scalable_mode) {
+        if (!s->caching_mode) {
+            error_report("Need to set caching-mode for scalable mode");
+            exit(1);
+        }
+        s->cap |= VTD_CAP_DWD | VTD_CAP_DRD;
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+    }
+
     vtd_reset_caches(s);
 
     /* Define registers with default values and bit semantics */
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 2a753c5..b01953a 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,7 +190,9 @@
 #define VTD_ECAP_EIM                (1ULL << 4)
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_SLTS               (1ULL << 46)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
@@ -209,6 +211,8 @@
 #define VTD_CAP_DRAIN_READ          (1ULL << 55)
 #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
 #define VTD_CAP_CM                  (1ULL << 7)
+#define VTD_CAP_DWD                 (1ULL << 54)
+#define VTD_CAP_DRD                 (1ULL << 55)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
@@ -340,6 +344,8 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_IEC                0x4 /* Interrupt Entry Cache
                                                Invalidate Descriptor */
 #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
+#define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
+#define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
 #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index a5da139..a04fad6 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -219,7 +219,8 @@ struct IntelIOMMUState {
     uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
     uint32_t version;
 
-    bool caching_mode;          /* RO - is cap CM enabled? */
+    bool caching_mode;              /* RO - is cap CM enabled? */
+    bool scalable_mode;             /* RO - is Scalable Mode supported? */
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation Yi Sun
@ 2019-02-11 10:12   ` Peter Xu
  2019-02-13  7:38     ` Yi Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-11 10:12 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Wed, Jan 30, 2019 at 01:09:11PM +0800, Yi Sun wrote:
> From: "Liu, Yi L" <yi.l.liu@intel.com>
> 
> Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
> replace extended context mode. This patch extends current emulator to
> support Scalable Mode which includes root table, context table and new
> pasid table format change. Now intel_iommu emulates both legacy mode
> and scalable mode (with legacy-equivalent capability set).
> 
> The key points are below:
> 1. Extend root table operations to support both legacy mode and scalable
>    mode.
> 2. Extend context table operations to support both legacy mode and
>    scalable mode.
> 3. Add pasid tabled operations to support scalable mode.

(this patch looks generally good to me, but I've got some trivial
 comments below...)

> 
> [Yi Sun is co-developer to contribute much to refine the whole commit.]
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>

I think you should have your signed-off-by to be the latter one since
you are the one who processed the patch last (and who posted it).

> ---
>  hw/i386/intel_iommu.c          | 528 ++++++++++++++++++++++++++++++++++-------
>  hw/i386/intel_iommu_internal.h |  43 +++-
>  hw/i386/trace-events           |   2 +-
>  include/hw/i386/intel_iommu.h  |  16 +-
>  4 files changed, 498 insertions(+), 91 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 8b72735..396ac8e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -37,6 +37,34 @@
>  #include "kvm_i386.h"
>  #include "trace.h"
>  
> +#define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)

"vtd_devfn_check(devfn)" is merely as long as "devfn &
VTD_DEVFN_CHECK_MASK", isn't it? :)

I would just drop the macro.

> +
> +/* context entry operations */
> +#define vtd_get_ce_size(s, ce) \
> +    (((s)->root_scalable) ? \
> +     VTD_CTX_ENTRY_SM_SIZE : VTD_CTX_ENTRY_LECY_SIZE)

"ce" is not used.  Also, if a macro is only used once, I'd just embed
it in the function.  This one is only used in
vtd_get_context_entry_from_root().

> +#define vtd_ce_get_domain_id(ce) VTD_CONTEXT_ENTRY_DID((ce)->val[1])

Is this correct for scalable mode?  Section 9.4, Figure 9-34, it says
ce->val[1] has RID_PASID in bits 64-83 rather than domain ID.

> +#define vtd_ce_get_rid2pasid(ce) \
> +    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
> +#define vtd_ce_get_pasid_dir_table(ce) \
> +    ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
> +
> +/* pasid operations */
> +#define vtd_pdire_get_pasidt_base(pdire) \
> +    ((pdire)->val & VTD_PASID_TABLE_BASE_ADDR_MASK)
> +#define vtd_get_pasid_dir_entry_size() VTD_PASID_DIR_ENTRY_SIZE
> +#define vtd_get_pasid_entry_size() VTD_PASID_ENTRY_SIZE
> +#define vtd_get_pasid_dir_index(pasid) VTD_PASID_DIR_INDEX(pasid)
> +#define vtd_get_pasid_table_index(pasid) VTD_PASID_TABLE_INDEX(pasid)

These macros seem useless.  Please use the existing ones, they are
good enough AFAICT.  Also, please use capital letters for macro
definitions so that format will be matched with existing codes.  The
capital issue is there for the whole series, please adjust them
accordingly.  I'll stop here on commenting anything about macros...

> +
> +/* pe operations */
> +#define vtd_pe_get_type(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
> +#define vtd_pe_get_level(pe) (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
> +#define vtd_pe_get_agaw(pe) \
> +    (30 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9)
> +#define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
> +#define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
> +
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> @@ -512,9 +540,15 @@ static void vtd_generate_completion_event(IntelIOMMUState *s)
>      }
>  }
>  
> -static inline bool vtd_root_entry_present(VTDRootEntry *root)
> +static inline bool vtd_root_entry_present(IntelIOMMUState *s,
> +                                          VTDRootEntry *re,
> +                                          uint8_t devfn)
>  {
> -    return root->val & VTD_ROOT_ENTRY_P;
> +    if (s->root_scalable && vtd_devfn_check(devfn)) {
> +        return re->hi & VTD_ROOT_ENTRY_P;
> +    }
> +
> +    return re->lo & VTD_ROOT_ENTRY_P;
>  }
>  
>  static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
> @@ -524,36 +558,64 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
>  
>      addr = s->root + index * sizeof(*re);
>      if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
> -        re->val = 0;
> +        re->lo = 0;
>          return -VTD_FR_ROOT_TABLE_INV;
>      }
> -    re->val = le64_to_cpu(re->val);
> +    re->lo = le64_to_cpu(re->lo);
> +    if (s->root_scalable) {
> +        re->hi = le64_to_cpu(re->hi);

Maybe simply make it unconditional - legacy mode has re->hi defined
too, though they are all zeros.

> +    }
>      return 0;
>  }
>  
> -static inline bool vtd_ce_present(VTDContextEntry *context)
> +static inline bool vtd_ce_present(VTDContextEntry *ce)
> +{
> +    return ce->val[0] & VTD_CONTEXT_ENTRY_P;
> +}
> +
> +static inline dma_addr_t vtd_get_context_base(IntelIOMMUState *s,
> +                                              VTDRootEntry *re,
> +                                              uint8_t *index)
>  {
> -    return context->lo & VTD_CONTEXT_ENTRY_P;
> +    if (s->root_scalable && vtd_devfn_check(*index)) {
> +        *index = *index & (~VTD_DEVFN_CHECK_MASK);

Operating on *index is a bit tricky... if this function is only used
once in vtd_get_context_entry_from_root() then how about squash it
there?

> +        return re->hi & VTD_ROOT_ENTRY_CTP;
> +    }
> +
> +    return re->lo & VTD_ROOT_ENTRY_CTP;
>  }
>  
> -static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
> +static void vtd_context_entry_format(IntelIOMMUState *s,
> +                                     VTDContextEntry *ce)
> +{
> +    ce->val[0] = le64_to_cpu(ce->val[0]);
> +    ce->val[1] = le64_to_cpu(ce->val[1]);
> +    if (s->root_scalable) {
> +        ce->val[2] = le64_to_cpu(ce->val[2]);
> +        ce->val[3] = le64_to_cpu(ce->val[3]);
> +    }
> +}

Only used once.  Squash into caller?  Itself does not make much sense.

> +
> +static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
> +                                           VTDRootEntry *re,
> +                                           uint8_t index,
>                                             VTDContextEntry *ce)
>  {
> -    dma_addr_t addr;
> +    dma_addr_t addr, ce_size;
>  
>      /* we have checked that root entry is present */
> -    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
> -    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
> +    ce_size = vtd_get_ce_size(s, ce);
> +    addr = vtd_get_context_base(s, re, &index) + index * ce_size;
> +    if (dma_memory_read(&address_space_memory, addr, ce, ce_size)) {
>          return -VTD_FR_CONTEXT_TABLE_INV;
>      }
> -    ce->lo = le64_to_cpu(ce->lo);
> -    ce->hi = le64_to_cpu(ce->hi);
> +    vtd_context_entry_format(s, ce);
>      return 0;
>  }
>  
>  static inline dma_addr_t vtd_ce_get_slpt_base(VTDContextEntry *ce)
>  {
> -    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
> +    return ce->val[0] & VTD_CONTEXT_ENTRY_SLPTPTR;
>  }
>  
>  static inline uint64_t vtd_get_slpte_addr(uint64_t slpte, uint8_t aw)
> @@ -600,28 +662,203 @@ static inline bool vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
>             (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
>  }
>  
> +/* Return true if check passed, otherwise false */
> +static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
> +                                     VTDPASIDEntry *pe)
> +{
> +    switch (vtd_pe_get_type(pe)) {
> +    case VTD_SM_PASID_ENTRY_FLT:
> +    case VTD_SM_PASID_ENTRY_SLT:
> +    case VTD_SM_PASID_ENTRY_NESTED:
> +        break;
> +    case VTD_SM_PASID_ENTRY_PT:
> +        if (!x86_iommu->pt_supported) {
> +            return false;
> +        }
> +        break;
> +    default:
> +        /* Unknwon type */
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static inline int vtd_get_pasid_dire(dma_addr_t pasid_dir_base,
> +                                     uint32_t pasid,
> +                                     VTDPASIDDirEntry *pdire)
> +{
> +    uint32_t index;
> +    dma_addr_t addr, entry_size;
> +
> +    index = vtd_get_pasid_dir_index(pasid);
> +    entry_size = vtd_get_pasid_dir_entry_size();
> +    addr = pasid_dir_base + index * entry_size;
> +    if (dma_memory_read(&address_space_memory, addr, pdire, entry_size)) {
> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    return 0;
> +}
> +
> +static inline int vtd_get_pasid_entry(IntelIOMMUState *s,
> +                                      uint32_t pasid,
> +                                      VTDPASIDDirEntry *pdire,
> +                                      VTDPASIDEntry *pe)
> +{
> +    uint32_t index;
> +    dma_addr_t addr, entry_size;
> +    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> +
> +    index = vtd_get_pasid_table_index(pasid);
> +    entry_size = vtd_get_pasid_entry_size();
> +    addr = vtd_pdire_get_pasidt_base(pdire);
> +    addr = addr + index * entry_size;
> +    if (dma_memory_read(&address_space_memory, addr, pe, entry_size)) {
> +        memset(pe->val, 0, sizeof(pe->val));

No need (like all the rest of the places)?

> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    /* Do translation type check */
> +    if (!vtd_pe_type_check(x86_iommu, pe)) {
> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    if (!vtd_is_level_supported(s, vtd_pe_get_level(pe))) {
> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vtd_get_pasid_entry_from_pasid(IntelIOMMUState *s,
> +                                          dma_addr_t pasid_dir_base,
> +                                          uint32_t pasid,
> +                                          VTDPASIDEntry *pe)
> +{
> +    int ret;
> +    VTDPASIDDirEntry pdire;
> +
> +    ret = vtd_get_pasid_dire(pasid_dir_base, pasid, &pdire);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vtd_get_pasid_entry(s, pasid, &pdire, pe);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return ret;
> +}
> +
> +static inline int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> +                                             VTDContextEntry *ce,
> +                                             VTDPASIDEntry *pe)
> +{
> +    uint32_t pasid;
> +    dma_addr_t pasid_dir_base;
> +    int ret = 0;
> +
> +    pasid = vtd_ce_get_rid2pasid(ce);
> +    pasid_dir_base = vtd_ce_get_pasid_dir_table(ce);
> +    ret = vtd_get_pasid_entry_from_pasid(s, pasid_dir_base, pasid, pe);
> +
> +    return ret;
> +}
> +
> +static inline int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
> +                                       VTDContextEntry *ce,
> +                                       bool *pe_fpd_set)
> +{
> +    int ret;
> +    uint32_t pasid;
> +    dma_addr_t pasid_dir_base;
> +    VTDPASIDDirEntry pdire;
> +    VTDPASIDEntry pe;
> +
> +    pasid = vtd_ce_get_rid2pasid(ce);
> +    pasid_dir_base = vtd_ce_get_pasid_dir_table(ce);
> +
> +    ret = vtd_get_pasid_dire(pasid_dir_base, pasid, &pdire);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    if (pdire.val & VTD_PASID_DIR_FPD) {
> +        *pe_fpd_set = true;
> +        return 0;
> +    }
> +
> +    ret = vtd_get_pasid_entry(s, pasid, &pdire, &pe);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    if (pe.val[0] & VTD_PASID_ENTRY_FPD) {
> +        *pe_fpd_set = true;
> +    }
> +
> +    return 0;
> +}
> +
>  /* Get the page-table level that hardware should use for the second-level
>   * page-table walk from the Address Width field of context-entry.
>   */
>  static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
>  {
> -    return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
> +    return 2 + (ce->val[1] & VTD_CONTEXT_ENTRY_AW);
> +}
> +
> +static inline uint32_t vtd_get_iova_level(IntelIOMMUState *s,
> +                                          VTDContextEntry *ce)
> +{
> +    VTDPASIDEntry pe;
> +
> +    if (s->root_scalable) {
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        return vtd_pe_get_level(&pe);
> +    }
> +
> +    return vtd_ce_get_level(ce);
>  }
>  
>  static inline uint32_t vtd_ce_get_agaw(VTDContextEntry *ce)
>  {
> -    return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
> +    return 30 + (ce->val[1] & VTD_CONTEXT_ENTRY_AW) * 9;
> +}
> +
> +static inline uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
> +                                         VTDContextEntry *ce)
> +{
> +    VTDPASIDEntry pe;
> +
> +    if (s->root_scalable) {
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        return vtd_pe_get_agaw(&pe);
> +    }
> +
> +    return vtd_ce_get_agaw(ce);
>  }
>  
>  static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce)
>  {
> -    return ce->lo & VTD_CONTEXT_ENTRY_TT;
> +    return ce->val[0] & VTD_CONTEXT_ENTRY_TT;
>  }
>  
>  /* Return true if check passed, otherwise false */
> -static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
> +static inline bool vtd_ce_type_check(IntelIOMMUState *s,
> +                                     X86IOMMUState *x86_iommu,
>                                       VTDContextEntry *ce)

No need to pass it again.  Simply:

    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);

Or use INTEL_IOMMU_DEVICE() for the reverse.

>  {
> +    if (s->root_scalable) {
> +        /*
> +         * Translation Type locates in context entry only when VTD is in
> +         * legacy mode. For scalable mode, need to return true to avoid
> +         * unnecessary fault.
> +         */
> +        return true;
> +    }
> +
>      switch (vtd_ce_get_type(ce)) {
>      case VTD_CONTEXT_TT_MULTI_LEVEL:
>          /* Always supported */
> @@ -639,7 +876,7 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
>          }
>          break;
>      default:
> -        /* Unknwon type */
> +        /* Unknown type */
>          error_report_once("%s: unknown ce type: %"PRIu32, __func__,
>                            vtd_ce_get_type(ce));
>          return false;
> @@ -647,21 +884,36 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
>      return true;
>  }
>  
> -static inline uint64_t vtd_iova_limit(VTDContextEntry *ce, uint8_t aw)
> +static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
> +                                      VTDContextEntry *ce, uint8_t aw)
>  {
> -    uint32_t ce_agaw = vtd_ce_get_agaw(ce);
> +    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce);
>      return 1ULL << MIN(ce_agaw, aw);
>  }
>  
>  /* Return true if IOVA passes range check, otherwise false. */
> -static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> +static inline bool vtd_iova_range_check(IntelIOMMUState *s,
> +                                        uint64_t iova, VTDContextEntry *ce,
>                                          uint8_t aw)
>  {
>      /*
>       * Check if @iova is above 2^X-1, where X is the minimum of MGAW
>       * in CAP_REG and AW in context-entry.
>       */
> -    return !(iova & ~(vtd_iova_limit(ce, aw) - 1));
> +    return !(iova & ~(vtd_iova_limit(s, ce, aw) - 1));
> +}
> +
> +static inline dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> +                                                 VTDContextEntry *ce)
> +{
> +    VTDPASIDEntry pe;
> +
> +    if (s->root_scalable) {
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        return vtd_pe_get_slpt_base(&pe);
> +    }
> +
> +    return vtd_ce_get_slpt_base(ce);
>  }
>  
>  /*
> @@ -707,17 +959,18 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
>  /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> -static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
> +static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
> +                             uint64_t iova, bool is_write,
>                               uint64_t *slptep, uint32_t *slpte_level,
>                               bool *reads, bool *writes, uint8_t aw_bits)
>  {
> -    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
> -    uint32_t level = vtd_ce_get_level(ce);
> +    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
> +    uint32_t level = vtd_get_iova_level(s, ce);
>      uint32_t offset;
>      uint64_t slpte;
>      uint64_t access_right_check;
>  
> -    if (!vtd_iova_range_check(iova, ce, aw_bits)) {
> +    if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
>          error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
>                            __func__, iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
> @@ -733,7 +986,7 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>          if (slpte == (uint64_t)-1) {
>              error_report_once("%s: detected read error on DMAR slpte "
>                                "(iova=0x%" PRIx64 ")", __func__, iova);
> -            if (level == vtd_ce_get_level(ce)) {
> +            if (level == vtd_get_iova_level(s, ce)) {
>                  /* Invalid programming of context-entry */
>                  return -VTD_FR_CONTEXT_ENTRY_INV;
>              } else {
> @@ -962,29 +1215,96 @@ next:
>  /**
>   * vtd_page_walk - walk specific IOVA range, and call the hook
>   *
> + * @s: intel iommu state
>   * @ce: context entry to walk upon
>   * @start: IOVA address to start the walk
>   * @end: IOVA range end address (start <= addr < end)
>   * @info: page walking information struct
>   */
> -static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> +static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
> +                         uint64_t start, uint64_t end,
>                           vtd_page_walk_info *info)
>  {
> -    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
> -    uint32_t level = vtd_ce_get_level(ce);
> +    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
> +    uint32_t level = vtd_get_iova_level(s, ce);
>  
> -    if (!vtd_iova_range_check(start, ce, info->aw)) {
> +    if (!vtd_iova_range_check(s, start, ce, info->aw)) {
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
>  
> -    if (!vtd_iova_range_check(end, ce, info->aw)) {
> +    if (!vtd_iova_range_check(s, end, ce, info->aw)) {
>          /* Fix end so that it reaches the maximum */
> -        end = vtd_iova_limit(ce, info->aw);
> +        end = vtd_iova_limit(s, ce, info->aw);
>      }
>  
>      return vtd_page_walk_level(addr, start, end, level, true, true, info);
>  }
>  
> +static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> +                                          VTDRootEntry *re)
> +{
> +    /* Legacy Mode reserved bits check */
> +    if (!s->root_scalable &&
> +        (re->hi || (re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
> +        goto rsvd_err;
> +
> +    /* Scalable Mode reserved bits check */
> +    if (s->root_scalable &&
> +        ((re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits)) ||
> +         (re->hi & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
> +        goto rsvd_err;
> +
> +    return 0;
> +
> +rsvd_err:
> +    error_report_once("%s: invalid root entry: hi=0x%"PRIx64
> +                      ", lo=0x%"PRIx64,
> +                      __func__, re->hi, re->lo);
> +    return -VTD_FR_ROOT_ENTRY_RSVD;
> +}
> +
> +static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
> +                                                    VTDContextEntry *ce)
> +{
> +    if (!s->root_scalable &&
> +        (ce->val[1] & VTD_CONTEXT_ENTRY_RSVD_HI ||
> +         ce->val[0] & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
> +        error_report_once("%s: invalid context entry: hi=%"PRIx64
> +                          ", lo=%"PRIx64" (reserved nonzero)",
> +                          __func__, ce->val[1], ce->val[0]);
> +        return -VTD_FR_CONTEXT_ENTRY_RSVD;
> +    }
> +
> +    if (s->root_scalable &&
> +        (ce->val[0] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(s->aw_bits) ||
> +         ce->val[1] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 ||
> +         ce->val[2] ||
> +         ce->val[3])) {
> +        error_report_once("%s: invalid context entry: val[3]=%"PRIx64
> +                          ", val[2]=%"PRIx64
> +                          ", val[1]=%"PRIx64
> +                          ", val[0]=%"PRIx64" (reserved nonzero)",
> +                          __func__, ce->val[3], ce->val[2],
> +                          ce->val[1], ce->val[0]);
> +        return -VTD_FR_CONTEXT_ENTRY_RSVD;
> +    }
> +
> +    return 0;
> +}
> +
> +static inline int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
> +                                         VTDContextEntry *ce)
> +{
> +    VTDPASIDEntry pe;
> +
> +    /*
> +     * Make sure in Scalable Mode, a present context entry
> +     * has valid rid2pasid setting, which includes valid
> +     * rid2pasid field and corresponding pasid entry setting
> +     */
> +    return vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +}
> +
>  /* Map a device to its corresponding domain (context-entry) */
>  static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>                                      uint8_t devfn, VTDContextEntry *ce)
> @@ -998,20 +1318,18 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>          return ret_fr;
>      }
>  
> -    if (!vtd_root_entry_present(&re)) {
> +    if (!vtd_root_entry_present(s, &re, devfn)) {
>          /* Not error - it's okay we don't have root entry. */
>          trace_vtd_re_not_present(bus_num);
>          return -VTD_FR_ROOT_ENTRY_P;
>      }
>  
> -    if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD(s->aw_bits))) {
> -        error_report_once("%s: invalid root entry: rsvd=0x%"PRIx64
> -                          ", val=0x%"PRIx64" (reserved nonzero)",
> -                          __func__, re.rsvd, re.val);
> -        return -VTD_FR_ROOT_ENTRY_RSVD;
> +    ret_fr = vtd_root_entry_rsvd_bits_check(s, &re);
> +    if (ret_fr) {
> +        return ret_fr;
>      }
>  
> -    ret_fr = vtd_get_context_entry_from_root(&re, devfn, ce);
> +    ret_fr = vtd_get_context_entry_from_root(s, &re, devfn, ce);
>      if (ret_fr) {
>          return ret_fr;
>      }
> @@ -1022,28 +1340,40 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>          return -VTD_FR_CONTEXT_ENTRY_P;
>      }
>  
> -    if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
> -               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
> -        error_report_once("%s: invalid context entry: hi=%"PRIx64
> -                          ", lo=%"PRIx64" (reserved nonzero)",
> -                          __func__, ce->hi, ce->lo);
> -        return -VTD_FR_CONTEXT_ENTRY_RSVD;
> +    ret_fr = vtd_context_entry_rsvd_bits_check(s, ce);
> +    if (ret_fr) {
> +        return ret_fr;
>      }
>  
>      /* Check if the programming of context-entry is valid */
> -    if (!vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
> +    if (!s->root_scalable &&
> +        !vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
>          error_report_once("%s: invalid context entry: hi=%"PRIx64
>                            ", lo=%"PRIx64" (level %d not supported)",
> -                          __func__, ce->hi, ce->lo, vtd_ce_get_level(ce));
> +                          __func__, ce->val[1], ce->val[0],
> +                          vtd_ce_get_level(ce));
>          return -VTD_FR_CONTEXT_ENTRY_INV;
>      }
>  
>      /* Do translation type check */
> -    if (!vtd_ce_type_check(x86_iommu, ce)) {
> +    if (!vtd_ce_type_check(s, x86_iommu, ce)) {
>          /* Errors dumped in vtd_ce_type_check() */
>          return -VTD_FR_CONTEXT_ENTRY_INV;
>      }
>  
> +    /*
> +     * Check if the programming of context-entry.rid2pasid
> +     * and corresponding pasid setting is valid, and thus
> +     * avoids to check pasid entry fetching result in future
> +     * helper function calling.
> +     */
> +    if (s->root_scalable) {
> +        ret_fr = vtd_ce_rid2pasid_check(s, ce);
> +        if (ret_fr) {
> +            return ret_fr;
> +        }
> +    }
> +
>      return 0;
>  }
>  
> @@ -1065,10 +1395,10 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
>          .notify_unmap = true,
>          .aw = s->aw_bits,
>          .as = vtd_as,
> -        .domain_id = VTD_CONTEXT_ENTRY_DID(ce->hi),
> +        .domain_id = VTD_CONTEXT_ENTRY_DID(ce->val[1]),

So here for scalable mode the domain ID will be in the pasid table
entries rather than context entries, so probably more changes
required.

>      };
>  
> -    return vtd_page_walk(ce, addr, addr + size, &info);
> +    return vtd_page_walk(s, ce, addr, addr + size, &info);
>  }
>  
>  static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
> @@ -1103,35 +1433,24 @@ static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
>  }
>  
>  /*
> - * Fetch translation type for specific device. Returns <0 if error
> - * happens, otherwise return the shifted type to check against
> - * VTD_CONTEXT_TT_*.
> + * Check if specific device is configed to bypass address
> + * translation for DMA requests. In Scalable Mode, bypass
> + * 1st-level translation or 2nd-level translation, it depends
> + * on PGTT setting.
>   */
> -static int vtd_dev_get_trans_type(VTDAddressSpace *as)
> +static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
>  {
>      IntelIOMMUState *s;
>      VTDContextEntry ce;
> +    VTDPASIDEntry pe;
>      int ret;
>  
> -    s = as->iommu_state;
> +    assert(as);
>  
> +    s = as->iommu_state;
>      ret = vtd_dev_to_context_entry(s, pci_bus_num(as->bus),
>                                     as->devfn, &ce);
>      if (ret) {
> -        return ret;
> -    }
> -
> -    return vtd_ce_get_type(&ce);
> -}
> -
> -static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
> -{
> -    int ret;
> -
> -    assert(as);
> -
> -    ret = vtd_dev_get_trans_type(as);
> -    if (ret < 0) {
>          /*
>           * Possibly failed to parse the context entry for some reason
>           * (e.g., during init, or any guest configuration errors on
> @@ -1141,7 +1460,25 @@ static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
>          return false;
>      }
>  
> -    return ret == VTD_CONTEXT_TT_PASS_THROUGH;
> +    if (s->root_scalable) {
> +        vtd_ce_get_rid2pasid_entry(s, &ce, &pe);

Better check error code too, then return false if error detected.

> +        return (vtd_pe_get_type(&pe) == VTD_SM_PASID_ENTRY_PT);
> +    }
> +
> +    return (vtd_ce_get_type(&ce) == VTD_CONTEXT_TT_PASS_THROUGH);
> +}
> +
> +static inline uint16_t vtd_get_domain_id(IntelIOMMUState *s,
> +                                         VTDContextEntry *ce)
> +{
> +    VTDPASIDEntry pe;
> +
> +    if (s->root_scalable) {
> +        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> +        return vtd_pe_get_domain_id(&pe);
> +    }
> +
> +    return vtd_ce_get_domain_id(ce);
>  }
>  
>  /* Return whether the device is using IOMMU translation. */
> @@ -1317,14 +1654,29 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>  
>      /* Try to fetch context-entry from cache first */
>      if (cc_entry->context_cache_gen == s->context_cache_gen) {
> -        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
> -                               cc_entry->context_entry.lo,
> +        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.val[1],
> +                               cc_entry->context_entry.val[0],
>                                 cc_entry->context_cache_gen);
>          ce = cc_entry->context_entry;
> -        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> +        is_fpd_set = ce.val[0] & VTD_CONTEXT_ENTRY_FPD;

The spec says this bit as: "Enables or disables recording/reporting of
non-recoverable faults found in this Scalable-Mode context-entry",
then should I assume that this bit has higher priority than the PASID
table FPD bits?  If so, below you'll also need to change this:

> +        if (s->root_scalable) {

to:

  if (!is_fpd_set && s->root_scalable) {
    // explicitly clear is_fpd_set again
    is_fpd_set = false;
    ...

Otherwise the per-PASID FPD can overwrite the per context entry SPD,
or you could also be using the old per context value as per pasid
value?

> +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> +            if (ret_fr) {
> +                ret_fr = -ret_fr;
> +                if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> +                    trace_vtd_fault_disabled();
> +                } else {
> +                    vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> +                }
> +                goto error;
> +            }
> +        }
>      } else {
>          ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> -        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> +        is_fpd_set = ce.val[0] & VTD_CONTEXT_ENTRY_FPD;
> +        if (!ret_fr && s->root_scalable) {

Similar question here like above.

> +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> +        }
>          if (ret_fr) {
>              ret_fr = -ret_fr;
>              if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> @@ -1335,7 +1687,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>              goto error;
>          }
>          /* Update context-cache */
> -        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
> +        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.val[1], ce.val[0],
>                                    cc_entry->context_cache_gen,
>                                    s->context_cache_gen);
>          cc_entry->context_entry = ce;
> @@ -1367,7 +1719,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          return true;
>      }
>  
> -    ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
> +    ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
>                                 &reads, &writes, s->aw_bits);
>      if (ret_fr) {
>          ret_fr = -ret_fr;
> @@ -1381,7 +1733,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>  
>      page_mask = vtd_slpt_level_page_mask(level);
>      access_flags = IOMMU_ACCESS_FLAG(reads, writes);
> -    vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
> +    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
>                       access_flags, level);
>  out:
>      vtd_iommu_unlock(s);
> @@ -1404,6 +1756,7 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
>  {
>      s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
>      s->root_extended = s->root & VTD_RTADDR_RTT;
> +    s->root_scalable = s->root & VTD_RTADDR_SMT;

Note that although you haven't declared support for scalable mode in
device capabilities, a customized guest OS could have already set
root_scalable=true here if it wants and it can start to play with
these codes.  I think it's probably fine if the code is strong enough,
just want to make sure whether this is what you want.

>      s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
>  
>      trace_vtd_reg_dmar_root(s->root, s->root_extended);
> @@ -1573,7 +1926,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>      QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                        vtd_as->devfn, &ce) &&
> -            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> +            domain_id == vtd_get_domain_id(s, &ce)) {
>              vtd_sync_shadow_page_table(vtd_as);
>          }
>      }
> @@ -1591,7 +1944,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>      QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                         vtd_as->devfn, &ce);
> -        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> +        if (!ret && domain_id == vtd_get_domain_id(s, &ce)) {
>              if (vtd_as_has_map_notifier(vtd_as)) {
>                  /*
>                   * As long as we have MAP notifications registered in
> @@ -2629,6 +2982,7 @@ static const VMStateDescription vtd_vmstate = {
>          VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
>          VMSTATE_UINT8(iq_last_desc_type, IntelIOMMUState),
>          VMSTATE_BOOL(root_extended, IntelIOMMUState),
> +        VMSTATE_BOOL(root_scalable, IntelIOMMUState),
>          VMSTATE_BOOL(dmar_enabled, IntelIOMMUState),
>          VMSTATE_BOOL(qi_enabled, IntelIOMMUState),
>          VMSTATE_BOOL(intr_enabled, IntelIOMMUState),
> @@ -3098,10 +3452,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>      vtd_address_space_unmap(vtd_as, n);
>  
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> -        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> +        trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : "",
> +                                  bus_n, PCI_SLOT(vtd_as->devfn),
>                                    PCI_FUNC(vtd_as->devfn),
> -                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
> -                                  ce.hi, ce.lo);
> +                                  vtd_get_domain_id(s, &ce),
> +                                  ce.val[1], ce.val[0]);
>          if (vtd_as_has_map_notifier(vtd_as)) {
>              /* This is required only for MAP typed notifiers */
>              vtd_page_walk_info info = {
> @@ -3110,10 +3465,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                  .notify_unmap = false,
>                  .aw = s->aw_bits,
>                  .as = vtd_as,
> -                .domain_id = VTD_CONTEXT_ENTRY_DID(ce.hi),
> +                .domain_id = VTD_CONTEXT_ENTRY_DID(ce.val[1]),
>              };
>  
> -            vtd_page_walk(&ce, 0, ~0ULL, &info);
> +            vtd_page_walk(s, &ce, 0, ~0ULL, &info);
>          }
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> @@ -3137,6 +3492,7 @@ static void vtd_init(IntelIOMMUState *s)
>  
>      s->root = 0;
>      s->root_extended = false;
> +    s->root_scalable = false;
>      s->dmar_enabled = false;
>      s->iq_head = 0;
>      s->iq_tail = 0;
> @@ -3198,7 +3554,7 @@ static void vtd_init(IntelIOMMUState *s)
>      vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
>      vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
>      vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
> -    vtd_define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
> +    vtd_define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffffc00ULL, 0);
>      vtd_define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
>      vtd_define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
>  
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 00e9edb..02674f9 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -172,6 +172,7 @@
>  
>  /* RTADDR_REG */
>  #define VTD_RTADDR_RTT              (1ULL << 11)
> +#define VTD_RTADDR_SMT              (1ULL << 10)
>  #define VTD_RTADDR_ADDR_MASK(aw)    (VTD_HAW_MASK(aw) ^ 0xfffULL)
>  
>  /* IRTA_REG */
> @@ -294,6 +295,8 @@ typedef enum VTDFaultReason {
>                                    * request while disabled */
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
> +    VTD_FR_PASID_TABLE_INV = 0x58,

Simple comment is welcomed; all the rest error definitions have
comments.

> +
>      /* This is not a normal fault reason. We use this to indicate some faults
>       * that are not referenced by the VT-d specification.
>       * Fault event with such reason should not be recorded.
> @@ -411,8 +414,8 @@ typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
>  #define VTD_PAGE_MASK_1G            (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
>  
>  struct VTDRootEntry {
> -    uint64_t val;
> -    uint64_t rsvd;
> +    uint64_t lo;
> +    uint64_t hi;
>  };
>  typedef struct VTDRootEntry VTDRootEntry;
>  
> @@ -423,6 +426,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_ROOT_ENTRY_NR           (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
>  #define VTD_ROOT_ENTRY_RSVD(aw)     (0xffeULL | ~VTD_HAW_MASK(aw))
>  
> +#define VTD_ROOT_ENTRY_SIZE         16

This is never used?

> +
> +#define VTD_DEVFN_CHECK_MASK        0x80
> +
>  /* Masks for struct VTDContextEntry */
>  /* lo */
>  #define VTD_CONTEXT_ENTRY_P         (1ULL << 0)
> @@ -441,6 +448,38 @@ typedef struct VTDRootEntry VTDRootEntry;
>  
>  #define VTD_CONTEXT_ENTRY_NR        (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
>  
> +#define VTD_CTX_ENTRY_LECY_SIZE     16

LEGACY?  Then the next can spell out SCALABLE too.

> +#define VTD_CTX_ENTRY_SM_SIZE       32
> +
> +#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
> +#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
> +#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
> +
> +/* PASID Table Related Definitions */
> +#define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
> +#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> +#define VTD_PASID_DIR_ENTRY_SIZE      8
> +#define VTD_PASID_ENTRY_SIZE          64
> +#define VTD_PASID_DIR_BITS_MASK       (0x3fffULL)
> +#define VTD_PASID_DIR_INDEX(pasid)    (((pasid) >> 6) & VTD_PASID_DIR_BITS_MASK)
> +#define VTD_PASID_DIR_FPD             (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
> +#define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
> +#define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> +
> +/* PASID Granular Translation Type Mask */
> +#define VTD_SM_PASID_ENTRY_PGTT        (7ULL << 6)
> +#define VTD_SM_PASID_ENTRY_FLT         (1ULL << 6)
> +#define VTD_SM_PASID_ENTRY_SLT         (2ULL << 6)
> +#define VTD_SM_PASID_ENTRY_NESTED      (3ULL << 6)
> +#define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
> +
> +#define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
> +#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +
> +/* Second Level Page Translation Pointer*/
> +#define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
> +
>  /* Paging Structure common */
>  #define VTD_SL_PT_PAGE_SIZE_MASK    (1ULL << 7)
>  /* Bits to decide the offset for each level */
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 77244fc..cae1b76 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -30,7 +30,7 @@ vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32
>  vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
>  vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
>  vtd_fault_disabled(void) "Fault processing disabled for context entry"
> -vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_replay_ce_valid(const char *mode, uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "%s: replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
>  vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
>  vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
>  vtd_page_walk_one(uint16_t domain, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "domain 0x%"PRIu16" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index a321cc9..ff13ff27 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -66,11 +66,12 @@ typedef struct VTDIOTLBEntry VTDIOTLBEntry;
>  typedef struct VTDBus VTDBus;
>  typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> +typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> +typedef struct VTDPASIDEntry VTDPASIDEntry;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> -    uint64_t lo;
> -    uint64_t hi;
> +    uint64_t val[4];

You can actually make it an enum, two benefits:

- you don't ever need to touch any existing valid usages of lo/hi
  vars (though you've already done it...), and

- people won't get confused when they only see the legacy definition
  of context entry (which is only 128bits long, so this 256bits
  defintion could be confusing)

>  };
>  
>  struct VTDContextCacheEntry {
> @@ -81,6 +82,16 @@ struct VTDContextCacheEntry {
>      struct VTDContextEntry context_entry;
>  };
>  
> +/* PASID Directory Entry */
> +struct VTDPASIDDirEntry {
> +    uint64_t val;
> +};
> +
> +/* PASID Table Entry */
> +struct VTDPASIDEntry {
> +    uint64_t val[8];
> +};
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -212,6 +223,7 @@ struct IntelIOMMUState {
>  
>      dma_addr_t root;                /* Current root table pointer */
>      bool root_extended;             /* Type of root table (extended or not) */
> +    bool root_scalable;             /* Type of root table (scalable or not) */
>      bool dmar_enabled;              /* Set if DMA remapping is enabled */
>  
>      uint16_t iq_head;               /* Current invalidation queue head */
> -- 
> 1.9.1
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode
  2019-01-30  5:09 [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Yi Sun
                   ` (2 preceding siblings ...)
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
@ 2019-02-11 10:37 ` Peter Xu
  2019-02-13  5:46   ` Yi Sun
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-11 10:37 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Wed, Jan 30, 2019 at 01:09:10PM +0800, Yi Sun wrote:
> Intel vt-d rev3.0 [1] introduces a new translation mode called
> 'scalable mode', which enables PASID-granular translations for
> first level, second level, nested and pass-through modes. The
> vt-d scalable mode is the key ingredient to enable Scalable I/O
> Virtualization (Scalable IOV) [2] [3], which allows sharing a
> device in minimal possible granularity (ADI - Assignable Device
> Interface). As a result, previous Extended Context (ECS) mode
> is deprecated (no production ever implements ECS).
> 
> This patch set emulates a minimal capability set of VT-d scalable
> mode, equivalent to what is available in VT-d legacy mode today:
>     1. Scalable mode root entry, context entry and PASID table
>     2. Seconds level translation under scalable mode
>     3. Queued invalidation (with 256 bits descriptor)
>     4. Pass-through mode
> 
> Corresponding intel-iommu driver support will be included in
> kernel 5.0:
>     https://www.spinics.net/lists/kernel/msg2985279.html
> 
> We will add emulation of full scalable mode capability along with
> guest iommu driver progress later, e.g.:
>     1. First level translation
>     2. Nested translation
>     3. Per-PASID invalidation descriptors
>     4. Page request services for handling recoverable faults

Hi, YiSun/YiLiu,

Have you tested against any existing usages of VT-d with this series
applied?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
@ 2019-02-12  6:27   ` Peter Xu
  2019-02-13  9:00     ` Yi Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-12  6:27 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Wed, Jan 30, 2019 at 01:09:12PM +0800, Yi Sun wrote:
> From: "Liu, Yi L" <yi.l.liu@intel.com>
> 
> Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
> Mode. This patch adds emulation of 256bits qi_desc.
> 
> [Yi Sun is co-developer to rebase and refine the patch.]
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>

Same here; I think you should have your s-o-b last. ;)

> ---
>  hw/i386/intel_iommu.c          | 182 +++++++++++++++++++++++++----------------
>  hw/i386/intel_iommu_internal.h |   8 +-
>  include/hw/i386/intel_iommu.h  |   1 +
>  3 files changed, 116 insertions(+), 75 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 396ac8e..3664a00 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -38,6 +38,7 @@
>  #include "trace.h"
>  
>  #define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)
> +#define vtd_ecap_smts(s) ((s)->ecap & VTD_ECAP_SMTS)

I'd prefer capital letters for macros.  Your call.

>  
>  /* context entry operations */
>  #define vtd_get_ce_size(s, ce) \
> @@ -65,6 +66,9 @@
>  #define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
>  #define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
>  
> +/* invalidation desc */
> +#define vtd_get_inv_desc_width(s) ((s)->iq_dw ? 32 : 16)

Nit: I'll prefer dropping all the "get" wordings in these macros to be
"vtd_inv_desc_width" since that "get" doesn't help much on
understanding its meanings.  But it's personal preference too.

And since you've already have the iq_dw variable - why not store the
width directly into it?  An uint8_t would suffice.

> +
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> @@ -1759,6 +1763,11 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
>      s->root_scalable = s->root & VTD_RTADDR_SMT;
>      s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
>  
> +    /* if Scalable mode is not enabled, enforce iq_dw to be 16 byte */
> +    if (!s->root_scalable) {
> +        s->iq_dw = 0;

This is ok but I feel it a bit awkward to change IQ setup specifically
in root table setup.  You can simply do this:

- in vtd_init(), always set iq_dw=16.  This will cover system resets
  or IOMMU device reset, then

- only update iq_dw to 32 in vtd_mem_write() where guest specified

> +    }
> +
>      trace_vtd_reg_dmar_root(s->root, s->root_extended);
>  }
>  
> @@ -2052,7 +2061,7 @@ static void vtd_handle_gcmd_qie(IntelIOMMUState *s, bool en)
>      if (en) {
>          s->iq = iqa_val & VTD_IQA_IQA_MASK(s->aw_bits);
>          /* 2^(x+8) entries */
> -        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8);
> +        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8 - (s->iq_dw ? 1 : 0));
>          s->qi_enabled = true;
>          trace_vtd_inv_qi_setup(s->iq, s->iq_size);
>          /* Ok - report back to driver */
> @@ -2219,54 +2228,66 @@ static void vtd_handle_iotlb_write(IntelIOMMUState *s)
>  }
>  
>  /* Fetch an Invalidation Descriptor from the Invalidation Queue */
> -static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
> +static bool vtd_get_inv_desc(IntelIOMMUState *s,
>                               VTDInvDesc *inv_desc)
>  {
> -    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
> -    if (dma_memory_read(&address_space_memory, addr, inv_desc,
> -        sizeof(*inv_desc))) {
> -        error_report_once("Read INV DESC failed");
> -        inv_desc->lo = 0;
> -        inv_desc->hi = 0;
> +    dma_addr_t base_addr = s->iq;
> +    uint32_t offset = s->iq_head;
> +    uint32_t dw = vtd_get_inv_desc_width(s);
> +    dma_addr_t addr = base_addr + offset * dw;
> +
> +    /* init */
> +    inv_desc->val[0] = 0;
> +    inv_desc->val[1] = 0;
> +    inv_desc->val[2] = 0;
> +    inv_desc->val[3] = 0;

No need?

> +
> +    if (dma_memory_read(&address_space_memory, addr, inv_desc, dw)) {
> +        error_report_once("Read INV DESC failed.");
>          return false;
>      }
> -    inv_desc->lo = le64_to_cpu(inv_desc->lo);
> -    inv_desc->hi = le64_to_cpu(inv_desc->hi);
> +    inv_desc->val[0] = le64_to_cpu(inv_desc->val[0]);
> +    inv_desc->val[1] = le64_to_cpu(inv_desc->val[1]);
> +    inv_desc->val[2] = le64_to_cpu(inv_desc->val[2]);
> +    inv_desc->val[3] = le64_to_cpu(inv_desc->val[3]);

Similar comments here on VTDInvDesc - you can keep the hi/lo fields,
but instead you can define the val[4] into the union too.  Then:

  if (dw == 16) {
    inv_desc->lo = ...
    inv_desc->hi = ...
  } else {
    inv_desc->val[0] = ...
    inv_desc->val[1] = ...
    inv_desc->val[2] = ...
    inv_desc->val[3] = ...
  }

Then this patch can be greatly simplified too since you don't need to
switch VTDInvDesc.{lo|hi} to val[0|1].

[...]

> @@ -2525,7 +2558,7 @@ static void vtd_handle_iqt_write(IntelIOMMUState *s)
>  {
>      uint64_t val = vtd_get_quad_raw(s, DMAR_IQT_REG);
>  
> -    s->iq_tail = VTD_IQT_QT(val);
> +    s->iq_tail = VTD_IQT_QT(s->iq_dw, val);

You can check against bit 4 when iq_dw==32 and report error otherwise.
That's explicitly mentioned by the spec (chap 10.4.22).

>      trace_vtd_inv_qi_tail(s->iq_tail);
>  
>      if (s->qi_enabled && !(vtd_get_long_raw(s, DMAR_FSTS_REG) & VTD_FSTS_IQE)) {

[...]


Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-01-30  5:09 ` [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
@ 2019-02-12  6:46   ` Peter Xu
  2019-02-15  5:22     ` Yi Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-12  6:46 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Wed, Jan 30, 2019 at 01:09:13PM +0800, Yi Sun wrote:
> This patch adds an option to provide flexibility for user to expose
> Scalable Mode to guest. User could expose Scalable Mode to guest by
> the config as below:
> 
> "-device intel-iommu,caching-mode=on,scalable-mode=on"
> 
> The Linux iommu driver has supported scalable mode. Please refer below
> patch set:
> 
>     https://www.spinics.net/lists/kernel/msg2985279.html
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/i386/intel_iommu.c          | 22 ++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  6 ++++++
>  include/hw/i386/intel_iommu.h  |  3 ++-
>  3 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 3664a00..447fdf3 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2492,6 +2492,17 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> +    /*
> +     * TODO: the entity of below two cases will be implemented in future series.
> +     * To make guest (which integrates scalable mode support patch set in
> +     * iommu driver) work, just return true is enough so far.

When you repost, could you mention about what tests have you done?  I
can think of some general usages:

Device matrix:

- virtio-net, vhost=on|off
- device assignment

Configuration matrix:

- legacy/scalable
- passthrough=on|off

I believe smoke test (like netperf a few seconds, or even ping) would
be enough, cover some (or all, which would be very nice to have :) of
above scenarios.

> +     */
> +    case VTD_INV_DESC_PC:
> +        break;
> +
> +    case VTD_INV_DESC_PIOTLB:
> +        break;
> +
>      case VTD_INV_DESC_WAIT:
>          trace_vtd_inv_desc("wait", inv_desc.val[1], inv_desc.val[0]);
>          if (!vtd_process_wait_desc(s, &inv_desc)) {
> @@ -3051,6 +3062,7 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>                        VTD_HOST_ADDRESS_WIDTH),
>      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
> +    DEFINE_PROP_BOOL("scalable-mode", IntelIOMMUState, scalable_mode, FALSE),

Let's start with x-scalable-mode?  Less burden for all.

>      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> @@ -3583,6 +3595,16 @@ static void vtd_init(IntelIOMMUState *s)
>          s->cap |= VTD_CAP_CM;
>      }
>  
> +    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> +    if (s->scalable_mode) {
> +        if (!s->caching_mode) {
> +            error_report("Need to set caching-mode for scalable mode");

Could I ask why?

> +            exit(1);
> +        }
> +        s->cap |= VTD_CAP_DWD | VTD_CAP_DRD;

Draining is supported now so this line is not needed.

> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> +    }
> +
>      vtd_reset_caches(s);
>  
>      /* Define registers with default values and bit semantics */
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 2a753c5..b01953a 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,7 +190,9 @@
>  #define VTD_ECAP_EIM                (1ULL << 4)
>  #define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
> +#define VTD_ECAP_SRS                (1ULL << 31)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> +#define VTD_ECAP_SLTS               (1ULL << 46)
>  
>  /* CAP_REG */
>  /* (offset >> 4) << 24 */
> @@ -209,6 +211,8 @@
>  #define VTD_CAP_DRAIN_READ          (1ULL << 55)
>  #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
>  #define VTD_CAP_CM                  (1ULL << 7)
> +#define VTD_CAP_DWD                 (1ULL << 54)
> +#define VTD_CAP_DRD                 (1ULL << 55)

These can be dropped too (see VTD_CAP_DRAIN above).

>  
>  /* Supported Adjusted Guest Address Widths */
>  #define VTD_CAP_SAGAW_SHIFT         8
> @@ -340,6 +344,8 @@ typedef union VTDInvDesc VTDInvDesc;
>  #define VTD_INV_DESC_IEC                0x4 /* Interrupt Entry Cache
>                                                 Invalidate Descriptor */
>  #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
> +#define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
> +#define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
>  #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
>  
>  /* Masks for Invalidation Wait Descriptor*/
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index a5da139..a04fad6 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -219,7 +219,8 @@ struct IntelIOMMUState {
>      uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>      uint32_t version;
>  
> -    bool caching_mode;          /* RO - is cap CM enabled? */
> +    bool caching_mode;              /* RO - is cap CM enabled? */
> +    bool scalable_mode;             /* RO - is Scalable Mode supported? */
>  
>      dma_addr_t root;                /* Current root table pointer */
>      bool root_extended;             /* Type of root table (extended or not) */
> -- 
> 1.9.1
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode
  2019-02-11 10:37 ` [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Peter Xu
@ 2019-02-13  5:46   ` Yi Sun
  0 siblings, 0 replies; 26+ messages in thread
From: Yi Sun @ 2019-02-13  5:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On 19-02-11 18:37:41, Peter Xu wrote:
> On Wed, Jan 30, 2019 at 01:09:10PM +0800, Yi Sun wrote:
> > Intel vt-d rev3.0 [1] introduces a new translation mode called
> > 'scalable mode', which enables PASID-granular translations for
> > first level, second level, nested and pass-through modes. The
> > vt-d scalable mode is the key ingredient to enable Scalable I/O
> > Virtualization (Scalable IOV) [2] [3], which allows sharing a
> > device in minimal possible granularity (ADI - Assignable Device
> > Interface). As a result, previous Extended Context (ECS) mode
> > is deprecated (no production ever implements ECS).
> > 
> > This patch set emulates a minimal capability set of VT-d scalable
> > mode, equivalent to what is available in VT-d legacy mode today:
> >     1. Scalable mode root entry, context entry and PASID table
> >     2. Seconds level translation under scalable mode
> >     3. Queued invalidation (with 256 bits descriptor)
> >     4. Pass-through mode
> > 
> > Corresponding intel-iommu driver support will be included in
> > kernel 5.0:
> >     https://www.spinics.net/lists/kernel/msg2985279.html
> > 
> > We will add emulation of full scalable mode capability along with
> > guest iommu driver progress later, e.g.:
> >     1. First level translation
> >     2. Nested translation
> >     3. Per-PASID invalidation descriptors
> >     4. Page request services for handling recoverable faults
> 
> Hi, YiSun/YiLiu,
> 
> Have you tested against any existing usages of VT-d with this series
> applied?
> 
Thanks for the review!

With kernel/qemu scalable mode enabling patch sets applied, I tested
kernel build/data copy/netperf on guest under both "scalable-mode"
enabled and "scalable-mode" disabled scenarios.

> Thanks,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation
  2019-02-11 10:12   ` Peter Xu
@ 2019-02-13  7:38     ` Yi Sun
  2019-02-13  8:03       ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-02-13  7:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On 19-02-11 18:12:13, Peter Xu wrote:
> On Wed, Jan 30, 2019 at 01:09:11PM +0800, Yi Sun wrote:
> > From: "Liu, Yi L" <yi.l.liu@intel.com>
> > 
> > Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
> > replace extended context mode. This patch extends current emulator to
> > support Scalable Mode which includes root table, context table and new
> > pasid table format change. Now intel_iommu emulates both legacy mode
> > and scalable mode (with legacy-equivalent capability set).
> > 
> > The key points are below:
> > 1. Extend root table operations to support both legacy mode and scalable
> >    mode.
> > 2. Extend context table operations to support both legacy mode and
> >    scalable mode.
> > 3. Add pasid tabled operations to support scalable mode.
> 
> (this patch looks generally good to me, but I've got some trivial
>  comments below...)
> 
Thank you!

> > 
> > [Yi Sun is co-developer to contribute much to refine the whole commit.]
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> 
> I think you should have your signed-off-by to be the latter one since
> you are the one who processed the patch last (and who posted it).
> 
Got it, thanks!

> > ---
> >  hw/i386/intel_iommu.c          | 528 ++++++++++++++++++++++++++++++++++-------
> >  hw/i386/intel_iommu_internal.h |  43 +++-
> >  hw/i386/trace-events           |   2 +-
> >  include/hw/i386/intel_iommu.h  |  16 +-
> >  4 files changed, 498 insertions(+), 91 deletions(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 8b72735..396ac8e 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -37,6 +37,34 @@
> >  #include "kvm_i386.h"
> >  #include "trace.h"
> >  
> > +#define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)
> 
> "vtd_devfn_check(devfn)" is merely as long as "devfn &
> VTD_DEVFN_CHECK_MASK", isn't it? :)
> 
> I would just drop the macro.
> 
There are two places to call this macro. Is that valuable to keep it?

> > +
> > +/* context entry operations */
> > +#define vtd_get_ce_size(s, ce) \
> > +    (((s)->root_scalable) ? \
> > +     VTD_CTX_ENTRY_SM_SIZE : VTD_CTX_ENTRY_LECY_SIZE)
> 
> "ce" is not used.  Also, if a macro is only used once, I'd just embed
> it in the function.  This one is only used in
> vtd_get_context_entry_from_root().
> 
Yes, I will drop this.

> > +#define vtd_ce_get_domain_id(ce) VTD_CONTEXT_ENTRY_DID((ce)->val[1])
> 
> Is this correct for scalable mode?  Section 9.4, Figure 9-34, it says
> ce->val[1] has RID_PASID in bits 64-83 rather than domain ID.
> 
This is for legacy context entry but not scalable-mode context entry.

> > +#define vtd_ce_get_rid2pasid(ce) \
> > +    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
> > +#define vtd_ce_get_pasid_dir_table(ce) \
> > +    ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
> > +
> > +/* pasid operations */
> > +#define vtd_pdire_get_pasidt_base(pdire) \
> > +    ((pdire)->val & VTD_PASID_TABLE_BASE_ADDR_MASK)
> > +#define vtd_get_pasid_dir_entry_size() VTD_PASID_DIR_ENTRY_SIZE
> > +#define vtd_get_pasid_entry_size() VTD_PASID_ENTRY_SIZE
> > +#define vtd_get_pasid_dir_index(pasid) VTD_PASID_DIR_INDEX(pasid)
> > +#define vtd_get_pasid_table_index(pasid) VTD_PASID_TABLE_INDEX(pasid)
> 
> These macros seem useless.  Please use the existing ones, they are
> good enough AFAICT.  Also, please use capital letters for macro
> definitions so that format will be matched with existing codes.  The
> capital issue is there for the whole series, please adjust them
> accordingly.  I'll stop here on commenting anything about macros...
> 
Ok, I will adjust macro names.

> > +
> > +/* pe operations */
> > +#define vtd_pe_get_type(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
> > +#define vtd_pe_get_level(pe) (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
> > +#define vtd_pe_get_agaw(pe) \
> > +    (30 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9)
> > +#define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
> > +#define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
> > +
> >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> >  
> > @@ -512,9 +540,15 @@ static void vtd_generate_completion_event(IntelIOMMUState *s)
> >      }
> >  }
> >  
> > -static inline bool vtd_root_entry_present(VTDRootEntry *root)
> > +static inline bool vtd_root_entry_present(IntelIOMMUState *s,
> > +                                          VTDRootEntry *re,
> > +                                          uint8_t devfn)
> >  {
> > -    return root->val & VTD_ROOT_ENTRY_P;
> > +    if (s->root_scalable && vtd_devfn_check(devfn)) {
> > +        return re->hi & VTD_ROOT_ENTRY_P;
> > +    }
> > +
> > +    return re->lo & VTD_ROOT_ENTRY_P;
> >  }
> >  
> >  static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
> > @@ -524,36 +558,64 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
> >  
> >      addr = s->root + index * sizeof(*re);
> >      if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
> > -        re->val = 0;
> > +        re->lo = 0;
> >          return -VTD_FR_ROOT_TABLE_INV;
> >      }
> > -    re->val = le64_to_cpu(re->val);
> > +    re->lo = le64_to_cpu(re->lo);
> > +    if (s->root_scalable) {
> > +        re->hi = le64_to_cpu(re->hi);
> 
> Maybe simply make it unconditional - legacy mode has re->hi defined
> too, though they are all zeros.
> 
That is good.

> > +    }
> >      return 0;
> >  }
> >  
> > -static inline bool vtd_ce_present(VTDContextEntry *context)
> > +static inline bool vtd_ce_present(VTDContextEntry *ce)
> > +{
> > +    return ce->val[0] & VTD_CONTEXT_ENTRY_P;
> > +}
> > +
> > +static inline dma_addr_t vtd_get_context_base(IntelIOMMUState *s,
> > +                                              VTDRootEntry *re,
> > +                                              uint8_t *index)
> >  {
> > -    return context->lo & VTD_CONTEXT_ENTRY_P;
> > +    if (s->root_scalable && vtd_devfn_check(*index)) {
> > +        *index = *index & (~VTD_DEVFN_CHECK_MASK);
> 
> Operating on *index is a bit tricky... if this function is only used
> once in vtd_get_context_entry_from_root() then how about squash it
> there?
> 
Ok, I think that would be fine.

> > +        return re->hi & VTD_ROOT_ENTRY_CTP;
> > +    }
> > +
> > +    return re->lo & VTD_ROOT_ENTRY_CTP;
> >  }
> >  
> > -static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
> > +static void vtd_context_entry_format(IntelIOMMUState *s,
> > +                                     VTDContextEntry *ce)
> > +{
> > +    ce->val[0] = le64_to_cpu(ce->val[0]);
> > +    ce->val[1] = le64_to_cpu(ce->val[1]);
> > +    if (s->root_scalable) {
> > +        ce->val[2] = le64_to_cpu(ce->val[2]);
> > +        ce->val[3] = le64_to_cpu(ce->val[3]);
> > +    }
> > +}
> 
> Only used once.  Squash into caller?  Itself does not make much sense.
> 
Sure.

[...]

> > +static inline int vtd_get_pasid_entry(IntelIOMMUState *s,
> > +                                      uint32_t pasid,
> > +                                      VTDPASIDDirEntry *pdire,
> > +                                      VTDPASIDEntry *pe)
> > +{
> > +    uint32_t index;
> > +    dma_addr_t addr, entry_size;
> > +    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > +
> > +    index = vtd_get_pasid_table_index(pasid);
> > +    entry_size = vtd_get_pasid_entry_size();
> > +    addr = vtd_pdire_get_pasidt_base(pdire);
> > +    addr = addr + index * entry_size;
> > +    if (dma_memory_read(&address_space_memory, addr, pe, entry_size)) {
> > +        memset(pe->val, 0, sizeof(pe->val));
> 
> No need (like all the rest of the places)?
> 
Read the deeper codes, pe will not be contaminated. So, I will remove
the memset.

> > +        return -VTD_FR_PASID_TABLE_INV;
> > +    }
> > +
> > +    /* Do translation type check */
> > +    if (!vtd_pe_type_check(x86_iommu, pe)) {
> > +        return -VTD_FR_PASID_TABLE_INV;
> > +    }
> > +
> > +    if (!vtd_is_level_supported(s, vtd_pe_get_level(pe))) {
> > +        return -VTD_FR_PASID_TABLE_INV;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
[...]

> >  /* Return true if check passed, otherwise false */
> > -static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
> > +static inline bool vtd_ce_type_check(IntelIOMMUState *s,
> > +                                     X86IOMMUState *x86_iommu,
> >                                       VTDContextEntry *ce)
> 
> No need to pass it again.  Simply:
> 
>     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> 
> Or use INTEL_IOMMU_DEVICE() for the reverse.
> 
That is good. Then, I don't need add IntelIOMMUState parameter.

> >  {
> > +    if (s->root_scalable) {
> > +        /*
> > +         * Translation Type locates in context entry only when VTD is in
> > +         * legacy mode. For scalable mode, need to return true to avoid
> > +         * unnecessary fault.
> > +         */
> > +        return true;
> > +    }
> > +
> >      switch (vtd_ce_get_type(ce)) {
> >      case VTD_CONTEXT_TT_MULTI_LEVEL:
> >          /* Always supported */
> > @@ -639,7 +876,7 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
> >          }
> >          break;
> >      default:
> > -        /* Unknwon type */
> > +        /* Unknown type */
> >          error_report_once("%s: unknown ce type: %"PRIu32, __func__,
> >                            vtd_ce_get_type(ce));
> >          return false;
> > @@ -647,21 +884,36 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
> >      return true;
> >  }
> >  
[...]

> > @@ -1065,10 +1395,10 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
> >          .notify_unmap = true,
> >          .aw = s->aw_bits,
> >          .as = vtd_as,
> > -        .domain_id = VTD_CONTEXT_ENTRY_DID(ce->hi),
> > +        .domain_id = VTD_CONTEXT_ENTRY_DID(ce->val[1]),
> 
> So here for scalable mode the domain ID will be in the pasid table
> entries rather than context entries, so probably more changes
> required.
> 
Yes, I should call vtd_get_domain_id(). Thanks!

> >      };
> >  
> > -    return vtd_page_walk(ce, addr, addr + size, &info);
> > +    return vtd_page_walk(s, ce, addr, addr + size, &info);
> >  }
> >  
> >  static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
> > @@ -1103,35 +1433,24 @@ static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
> >  }
> >  
> >  /*
> > - * Fetch translation type for specific device. Returns <0 if error
> > - * happens, otherwise return the shifted type to check against
> > - * VTD_CONTEXT_TT_*.
> > + * Check if specific device is configed to bypass address
> > + * translation for DMA requests. In Scalable Mode, bypass
> > + * 1st-level translation or 2nd-level translation, it depends
> > + * on PGTT setting.
> >   */
> > -static int vtd_dev_get_trans_type(VTDAddressSpace *as)
> > +static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
> >  {
> >      IntelIOMMUState *s;
> >      VTDContextEntry ce;
> > +    VTDPASIDEntry pe;
> >      int ret;
> >  
> > -    s = as->iommu_state;
> > +    assert(as);
> >  
> > +    s = as->iommu_state;
> >      ret = vtd_dev_to_context_entry(s, pci_bus_num(as->bus),
> >                                     as->devfn, &ce);
> >      if (ret) {
> > -        return ret;
> > -    }
> > -
> > -    return vtd_ce_get_type(&ce);
> > -}
> > -
> > -static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
> > -{
> > -    int ret;
> > -
> > -    assert(as);
> > -
> > -    ret = vtd_dev_get_trans_type(as);
> > -    if (ret < 0) {
> >          /*
> >           * Possibly failed to parse the context entry for some reason
> >           * (e.g., during init, or any guest configuration errors on
> > @@ -1141,7 +1460,25 @@ static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
> >          return false;
> >      }
> >  
> > -    return ret == VTD_CONTEXT_TT_PASS_THROUGH;
> > +    if (s->root_scalable) {
> > +        vtd_ce_get_rid2pasid_entry(s, &ce, &pe);
> 
> Better check error code too, then return false if error detected.
> 
Ok, thanks!

> > +        return (vtd_pe_get_type(&pe) == VTD_SM_PASID_ENTRY_PT);
> > +    }
> > +
> > +    return (vtd_ce_get_type(&ce) == VTD_CONTEXT_TT_PASS_THROUGH);
> > +}
> > +
> > +static inline uint16_t vtd_get_domain_id(IntelIOMMUState *s,
> > +                                         VTDContextEntry *ce)
> > +{
> > +    VTDPASIDEntry pe;
> > +
> > +    if (s->root_scalable) {
> > +        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
> > +        return vtd_pe_get_domain_id(&pe);
> > +    }
> > +
> > +    return vtd_ce_get_domain_id(ce);
> >  }
> >  
> >  /* Return whether the device is using IOMMU translation. */
> > @@ -1317,14 +1654,29 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >  
> >      /* Try to fetch context-entry from cache first */
> >      if (cc_entry->context_cache_gen == s->context_cache_gen) {
> > -        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
> > -                               cc_entry->context_entry.lo,
> > +        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.val[1],
> > +                               cc_entry->context_entry.val[0],
> >                                 cc_entry->context_cache_gen);
> >          ce = cc_entry->context_entry;
> > -        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> > +        is_fpd_set = ce.val[0] & VTD_CONTEXT_ENTRY_FPD;
> 
> The spec says this bit as: "Enables or disables recording/reporting of
> non-recoverable faults found in this Scalable-Mode context-entry",
> then should I assume that this bit has higher priority than the PASID
> table FPD bits?  If so, below you'll also need to change this:
> 
> > +        if (s->root_scalable) {
> 
> to:
> 
>   if (!is_fpd_set && s->root_scalable) {
>     // explicitly clear is_fpd_set again
>     is_fpd_set = false;
>     ...
> 
> Otherwise the per-PASID FPD can overwrite the per context entry SPD,
> or you could also be using the old per context value as per pasid
> value?
> 
Oh, yes, thanks for the finding!

> > +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> > +            if (ret_fr) {
> > +                ret_fr = -ret_fr;
> > +                if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> > +                    trace_vtd_fault_disabled();
> > +                } else {
> > +                    vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> > +                }
> > +                goto error;
> > +            }
> > +        }
> >      } else {
> >          ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > -        is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> > +        is_fpd_set = ce.val[0] & VTD_CONTEXT_ENTRY_FPD;
> > +        if (!ret_fr && s->root_scalable) {
> 
> Similar question here like above.
> 
Thanks!

> > +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> > +        }
> >          if (ret_fr) {
> >              ret_fr = -ret_fr;
> >              if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> > @@ -1335,7 +1687,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >              goto error;
> >          }
> >          /* Update context-cache */
> > -        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
> > +        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.val[1], ce.val[0],
> >                                    cc_entry->context_cache_gen,
> >                                    s->context_cache_gen);
> >          cc_entry->context_entry = ce;
> > @@ -1367,7 +1719,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >          return true;
> >      }
> >  
> > -    ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
> > +    ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
> >                                 &reads, &writes, s->aw_bits);
> >      if (ret_fr) {
> >          ret_fr = -ret_fr;
> > @@ -1381,7 +1733,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >  
> >      page_mask = vtd_slpt_level_page_mask(level);
> >      access_flags = IOMMU_ACCESS_FLAG(reads, writes);
> > -    vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
> > +    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
> >                       access_flags, level);
> >  out:
> >      vtd_iommu_unlock(s);
> > @@ -1404,6 +1756,7 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
> >  {
> >      s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
> >      s->root_extended = s->root & VTD_RTADDR_RTT;
> > +    s->root_scalable = s->root & VTD_RTADDR_SMT;
> 
> Note that although you haven't declared support for scalable mode in
> device capabilities, a customized guest OS could have already set
> root_scalable=true here if it wants and it can start to play with
> these codes.  I think it's probably fine if the code is strong enough,
> just want to make sure whether this is what you want.
> 
A good point. Then, I would like to move it into patch 3 and even add a
protection by checking if "scalable-mode" is on.

> >      s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
> >  
> >      trace_vtd_reg_dmar_root(s->root, s->root_extended);
> > @@ -1573,7 +1926,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
> >      QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> >          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >                                        vtd_as->devfn, &ce) &&
> > -            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > +            domain_id == vtd_get_domain_id(s, &ce)) {
> >              vtd_sync_shadow_page_table(vtd_as);
> >          }
> >      }
[...]

> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 00e9edb..02674f9 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -172,6 +172,7 @@
> >  
> >  /* RTADDR_REG */
> >  #define VTD_RTADDR_RTT              (1ULL << 11)
> > +#define VTD_RTADDR_SMT              (1ULL << 10)
> >  #define VTD_RTADDR_ADDR_MASK(aw)    (VTD_HAW_MASK(aw) ^ 0xfffULL)
> >  
> >  /* IRTA_REG */
> > @@ -294,6 +295,8 @@ typedef enum VTDFaultReason {
> >                                    * request while disabled */
> >      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
> >  
> > +    VTD_FR_PASID_TABLE_INV = 0x58,
> 
> Simple comment is welcomed; all the rest error definitions have
> comments.
> 
Ok, will add it.

> > +
> >      /* This is not a normal fault reason. We use this to indicate some faults
> >       * that are not referenced by the VT-d specification.
> >       * Fault event with such reason should not be recorded.
> > @@ -411,8 +414,8 @@ typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
> >  #define VTD_PAGE_MASK_1G            (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
> >  
> >  struct VTDRootEntry {
> > -    uint64_t val;
> > -    uint64_t rsvd;
> > +    uint64_t lo;
> > +    uint64_t hi;
> >  };
> >  typedef struct VTDRootEntry VTDRootEntry;
> >  
> > @@ -423,6 +426,10 @@ typedef struct VTDRootEntry VTDRootEntry;
> >  #define VTD_ROOT_ENTRY_NR           (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
> >  #define VTD_ROOT_ENTRY_RSVD(aw)     (0xffeULL | ~VTD_HAW_MASK(aw))
> >  
> > +#define VTD_ROOT_ENTRY_SIZE         16
> 
> This is never used?
> 
Will remove it.

> > +
> > +#define VTD_DEVFN_CHECK_MASK        0x80
> > +
> >  /* Masks for struct VTDContextEntry */
> >  /* lo */
> >  #define VTD_CONTEXT_ENTRY_P         (1ULL << 0)
> > @@ -441,6 +448,38 @@ typedef struct VTDRootEntry VTDRootEntry;
> >  
> >  #define VTD_CONTEXT_ENTRY_NR        (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
> >  
> > +#define VTD_CTX_ENTRY_LECY_SIZE     16
> 
> LEGACY?  Then the next can spell out SCALABLE too.
> 
Ok, thanks!

[...]

> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index a321cc9..ff13ff27 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -66,11 +66,12 @@ typedef struct VTDIOTLBEntry VTDIOTLBEntry;
> >  typedef struct VTDBus VTDBus;
> >  typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > +typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> > +typedef struct VTDPASIDEntry VTDPASIDEntry;
> >  
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > -    uint64_t lo;
> > -    uint64_t hi;
> > +    uint64_t val[4];
> 
> You can actually make it an enum, two benefits:
> 
Thanks for the suggestion! DYM 'union'?

> - you don't ever need to touch any existing valid usages of lo/hi
>   vars (though you've already done it...), and
> 
> - people won't get confused when they only see the legacy definition
>   of context entry (which is only 128bits long, so this 256bits
>   defintion could be confusing)
> 
> >  };
> >  
> >  struct VTDContextCacheEntry {
> > @@ -81,6 +82,16 @@ struct VTDContextCacheEntry {
> >      struct VTDContextEntry context_entry;
> >  };
> >  
> > +/* PASID Directory Entry */
> > +struct VTDPASIDDirEntry {
> > +    uint64_t val;
> > +};
> > +
> > +/* PASID Table Entry */
> > +struct VTDPASIDEntry {
> > +    uint64_t val[8];
> > +};
> > +
> >  struct VTDAddressSpace {
> >      PCIBus *bus;
> >      uint8_t devfn;
> > @@ -212,6 +223,7 @@ struct IntelIOMMUState {
> >  
> >      dma_addr_t root;                /* Current root table pointer */
> >      bool root_extended;             /* Type of root table (extended or not) */
> > +    bool root_scalable;             /* Type of root table (scalable or not) */
> >      bool dmar_enabled;              /* Set if DMA remapping is enabled */
> >  
> >      uint16_t iq_head;               /* Current invalidation queue head */
> > -- 
> > 1.9.1
> > 
> 
> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation
  2019-02-13  7:38     ` Yi Sun
@ 2019-02-13  8:03       ` Peter Xu
  2019-02-13  8:28         ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-13  8:03 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Wed, Feb 13, 2019 at 03:38:55PM +0800, Yi Sun wrote:

[...]

> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 8b72735..396ac8e 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -37,6 +37,34 @@
> > >  #include "kvm_i386.h"
> > >  #include "trace.h"
> > >  
> > > +#define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)
> > 
> > "vtd_devfn_check(devfn)" is merely as long as "devfn &
> > VTD_DEVFN_CHECK_MASK", isn't it? :)
> > 
> > I would just drop the macro.
> > 
> There are two places to call this macro. Is that valuable to keep it?

The point here is "A ? true : false" is exactly "A" when used in
condition checks.  So IMHO it's clean enough to write:

  if (devfn & VTD_DEVFN_CHECK_MASK) {
    ...
  }

Comparing to:

  if (vtd_devfn_check(devfn)) {
    ...
  }

And imho actually the name "check"/"mask" is confusing itself
already...  So maybe even dropping both vtd_devfn_check() and
VTD_DEVFN_CHECK_MASK (note: there's nothing about any validity checks,
and it's not a mask at all!) and simply:

  if (devfn >= UINT8_MAX / 2)

That's even clearer to me that it's splitted into two halves.

[...]

> > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > index a321cc9..ff13ff27 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -66,11 +66,12 @@ typedef struct VTDIOTLBEntry VTDIOTLBEntry;
> > >  typedef struct VTDBus VTDBus;
> > >  typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> > >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > > +typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> > > +typedef struct VTDPASIDEntry VTDPASIDEntry;
> > >  
> > >  /* Context-Entry */
> > >  struct VTDContextEntry {
> > > -    uint64_t lo;
> > > -    uint64_t hi;
> > > +    uint64_t val[4];
> > 
> > You can actually make it an enum, two benefits:
> > 
> Thanks for the suggestion! DYM 'union'?

Yes. :)

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation
  2019-02-13  8:03       ` Peter Xu
@ 2019-02-13  8:28         ` Peter Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Xu @ 2019-02-13  8:28 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Wed, Feb 13, 2019 at 04:03:05PM +0800, Peter Xu wrote:
> On Wed, Feb 13, 2019 at 03:38:55PM +0800, Yi Sun wrote:
> 
> [...]
> 
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 8b72735..396ac8e 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -37,6 +37,34 @@
> > > >  #include "kvm_i386.h"
> > > >  #include "trace.h"
> > > >  
> > > > +#define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)
> > > 
> > > "vtd_devfn_check(devfn)" is merely as long as "devfn &
> > > VTD_DEVFN_CHECK_MASK", isn't it? :)
> > > 
> > > I would just drop the macro.
> > > 
> > There are two places to call this macro. Is that valuable to keep it?
> 
> The point here is "A ? true : false" is exactly "A" when used in
> condition checks.  So IMHO it's clean enough to write:
> 
>   if (devfn & VTD_DEVFN_CHECK_MASK) {
>     ...
>   }
> 
> Comparing to:
> 
>   if (vtd_devfn_check(devfn)) {
>     ...
>   }
> 
> And imho actually the name "check"/"mask" is confusing itself
> already...  So maybe even dropping both vtd_devfn_check() and
> VTD_DEVFN_CHECK_MASK (note: there's nothing about any validity checks,
> and it's not a mask at all!) and simply:
> 
>   if (devfn >= UINT8_MAX / 2)
> 
> That's even clearer to me that it's splitted into two halves.

Sorry I didn't notice UINT8_MAX is 255 rather than 256, then it should
be ">"...  anyway, these are a bit nitpicking, feel free to use your
own preference at last.

> 
> [...]
> 
> > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > index a321cc9..ff13ff27 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -66,11 +66,12 @@ typedef struct VTDIOTLBEntry VTDIOTLBEntry;
> > > >  typedef struct VTDBus VTDBus;
> > > >  typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> > > >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > > > +typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> > > > +typedef struct VTDPASIDEntry VTDPASIDEntry;
> > > >  
> > > >  /* Context-Entry */
> > > >  struct VTDContextEntry {
> > > > -    uint64_t lo;
> > > > -    uint64_t hi;
> > > > +    uint64_t val[4];
> > > 
> > > You can actually make it an enum, two benefits:
> > > 
> > Thanks for the suggestion! DYM 'union'?
> 
> Yes. :)
> 
> Regards,
> 
> -- 
> Peter Xu

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-12  6:27   ` Peter Xu
@ 2019-02-13  9:00     ` Yi Sun
  2019-02-13 10:42       ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-02-13  9:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On 19-02-12 14:27:28, Peter Xu wrote:
> On Wed, Jan 30, 2019 at 01:09:12PM +0800, Yi Sun wrote:
> > From: "Liu, Yi L" <yi.l.liu@intel.com>
> > 
> > Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
> > Mode. This patch adds emulation of 256bits qi_desc.
> > 
> > [Yi Sun is co-developer to rebase and refine the patch.]
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> 
> Same here; I think you should have your s-o-b last. ;)
> 
Sure.

> > ---
> >  hw/i386/intel_iommu.c          | 182 +++++++++++++++++++++++++----------------
> >  hw/i386/intel_iommu_internal.h |   8 +-
> >  include/hw/i386/intel_iommu.h  |   1 +
> >  3 files changed, 116 insertions(+), 75 deletions(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 396ac8e..3664a00 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -38,6 +38,7 @@
> >  #include "trace.h"
> >  
> >  #define vtd_devfn_check(devfn) ((devfn & VTD_DEVFN_CHECK_MASK) ? true : false)
> > +#define vtd_ecap_smts(s) ((s)->ecap & VTD_ECAP_SMTS)
> 
> I'd prefer capital letters for macros.  Your call.
> 
Will change them.

> >  
> >  /* context entry operations */
> >  #define vtd_get_ce_size(s, ce) \
> > @@ -65,6 +66,9 @@
> >  #define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
> >  #define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
> >  
> > +/* invalidation desc */
> > +#define vtd_get_inv_desc_width(s) ((s)->iq_dw ? 32 : 16)
> 
> Nit: I'll prefer dropping all the "get" wordings in these macros to be
> "vtd_inv_desc_width" since that "get" doesn't help much on
> understanding its meanings.  But it's personal preference too.
> 
That is fine.

> And since you've already have the iq_dw variable - why not store the
> width directly into it?  An uint8_t would suffice.
> 
iq_dw corresponds to VTD_IQA_DW_MASK (Descriptor Width defined in IQA
register). 1 means 256-bit descriptor, 0 means 128-bit descriptor.

It is also used in vtd_handle_gcmd_qie() and VTD_IQT_QT() by checking if
its value is 1.

So, I would prefer to keep the original design.

> > +
> >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> >  
> > @@ -1759,6 +1763,11 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
> >      s->root_scalable = s->root & VTD_RTADDR_SMT;
> >      s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
> >  
> > +    /* if Scalable mode is not enabled, enforce iq_dw to be 16 byte */
> > +    if (!s->root_scalable) {
> > +        s->iq_dw = 0;
> 
> This is ok but I feel it a bit awkward to change IQ setup specifically
> in root table setup.  You can simply do this:
> 
> - in vtd_init(), always set iq_dw=16.  This will cover system resets
>   or IOMMU device reset, then
> 
Per above explanation, I can make iq_dw=0 in vtd_init().

> - only update iq_dw to 32 in vtd_mem_write() where guest specified
> 
May add check of s->root_scalable in vtd_mem_write().

> > +    }
> > +
> >      trace_vtd_reg_dmar_root(s->root, s->root_extended);
> >  }
> >  
> > @@ -2052,7 +2061,7 @@ static void vtd_handle_gcmd_qie(IntelIOMMUState *s, bool en)
> >      if (en) {
> >          s->iq = iqa_val & VTD_IQA_IQA_MASK(s->aw_bits);
> >          /* 2^(x+8) entries */
> > -        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8);
> > +        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8 - (s->iq_dw ? 1 : 0));
> >          s->qi_enabled = true;
> >          trace_vtd_inv_qi_setup(s->iq, s->iq_size);
> >          /* Ok - report back to driver */
> > @@ -2219,54 +2228,66 @@ static void vtd_handle_iotlb_write(IntelIOMMUState *s)
> >  }
> >  
> >  /* Fetch an Invalidation Descriptor from the Invalidation Queue */
> > -static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
> > +static bool vtd_get_inv_desc(IntelIOMMUState *s,
> >                               VTDInvDesc *inv_desc)
> >  {
> > -    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
> > -    if (dma_memory_read(&address_space_memory, addr, inv_desc,
> > -        sizeof(*inv_desc))) {
> > -        error_report_once("Read INV DESC failed");
> > -        inv_desc->lo = 0;
> > -        inv_desc->hi = 0;
> > +    dma_addr_t base_addr = s->iq;
> > +    uint32_t offset = s->iq_head;
> > +    uint32_t dw = vtd_get_inv_desc_width(s);
> > +    dma_addr_t addr = base_addr + offset * dw;
> > +
> > +    /* init */
> > +    inv_desc->val[0] = 0;
> > +    inv_desc->val[1] = 0;
> > +    inv_desc->val[2] = 0;
> > +    inv_desc->val[3] = 0;
> 
> No need?
> 
This is necessary. Per my test, the val[] are not 0 by default. That
makes bug happen.

> > +
> > +    if (dma_memory_read(&address_space_memory, addr, inv_desc, dw)) {
> > +        error_report_once("Read INV DESC failed.");
> >          return false;
> >      }
> > -    inv_desc->lo = le64_to_cpu(inv_desc->lo);
> > -    inv_desc->hi = le64_to_cpu(inv_desc->hi);
> > +    inv_desc->val[0] = le64_to_cpu(inv_desc->val[0]);
> > +    inv_desc->val[1] = le64_to_cpu(inv_desc->val[1]);
> > +    inv_desc->val[2] = le64_to_cpu(inv_desc->val[2]);
> > +    inv_desc->val[3] = le64_to_cpu(inv_desc->val[3]);
> 
> Similar comments here on VTDInvDesc - you can keep the hi/lo fields,
> but instead you can define the val[4] into the union too.  Then:
> 
>   if (dw == 16) {
>     inv_desc->lo = ...
>     inv_desc->hi = ...
>   } else {
>     inv_desc->val[0] = ...
>     inv_desc->val[1] = ...
>     inv_desc->val[2] = ...
>     inv_desc->val[3] = ...
>   }
> 
> Then this patch can be greatly simplified too since you don't need to
> switch VTDInvDesc.{lo|hi} to val[0|1].
> 
Ok, I will consider it.

> [...]
> 
> > @@ -2525,7 +2558,7 @@ static void vtd_handle_iqt_write(IntelIOMMUState *s)
> >  {
> >      uint64_t val = vtd_get_quad_raw(s, DMAR_IQT_REG);
> >  
> > -    s->iq_tail = VTD_IQT_QT(val);
> > +    s->iq_tail = VTD_IQT_QT(s->iq_dw, val);
> 
> You can check against bit 4 when iq_dw==32 and report error otherwise.
> That's explicitly mentioned by the spec (chap 10.4.22).
> 
OK, will add this check. Thanks!

> >      trace_vtd_inv_qi_tail(s->iq_tail);
> >  
> >      if (s->qi_enabled && !(vtd_get_long_raw(s, DMAR_FSTS_REG) & VTD_FSTS_IQE)) {
> 
> [...]
> 
> 
> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-13  9:00     ` Yi Sun
@ 2019-02-13 10:42       ` Peter Xu
  2019-02-14  1:52         ` Yi Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-13 10:42 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Wed, Feb 13, 2019 at 05:00:41PM +0800, Yi Sun wrote:

[...]

> > >  
> > >  /* context entry operations */
> > >  #define vtd_get_ce_size(s, ce) \
> > > @@ -65,6 +66,9 @@
> > >  #define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
> > >  #define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
> > >  
> > > +/* invalidation desc */
> > > +#define vtd_get_inv_desc_width(s) ((s)->iq_dw ? 32 : 16)
> > 
> > Nit: I'll prefer dropping all the "get" wordings in these macros to be
> > "vtd_inv_desc_width" since that "get" doesn't help much on
> > understanding its meanings.  But it's personal preference too.
> > 
> That is fine.
> 
> > And since you've already have the iq_dw variable - why not store the
> > width directly into it?  An uint8_t would suffice.
> > 
> iq_dw corresponds to VTD_IQA_DW_MASK (Descriptor Width defined in IQA
> register). 1 means 256-bit descriptor, 0 means 128-bit descriptor.
> 
> It is also used in vtd_handle_gcmd_qie() and VTD_IQT_QT() by checking if
> its value is 1.
> 
> So, I would prefer to keep the original design.

It's ok.   But please make it a boolean.  Now iq_dw can be 0x800.

[...]

> > >  /* Fetch an Invalidation Descriptor from the Invalidation Queue */
> > > -static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
> > > +static bool vtd_get_inv_desc(IntelIOMMUState *s,
> > >                               VTDInvDesc *inv_desc)
> > >  {
> > > -    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
> > > -    if (dma_memory_read(&address_space_memory, addr, inv_desc,
> > > -        sizeof(*inv_desc))) {
> > > -        error_report_once("Read INV DESC failed");
> > > -        inv_desc->lo = 0;
> > > -        inv_desc->hi = 0;
> > > +    dma_addr_t base_addr = s->iq;
> > > +    uint32_t offset = s->iq_head;
> > > +    uint32_t dw = vtd_get_inv_desc_width(s);
> > > +    dma_addr_t addr = base_addr + offset * dw;
> > > +
> > > +    /* init */
> > > +    inv_desc->val[0] = 0;
> > > +    inv_desc->val[1] = 0;
> > > +    inv_desc->val[2] = 0;
> > > +    inv_desc->val[3] = 0;
> > 
> > No need?
> > 
> This is necessary. Per my test, the val[] are not 0 by default.

I agree, it's a stack variable. However...

> That makes bug happen.

... could you explain the bug?

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-13 10:42       ` Peter Xu
@ 2019-02-14  1:52         ` Yi Sun
  2019-02-14  3:24           ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-02-14  1:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On 19-02-13 18:42:24, Peter Xu wrote:
> On Wed, Feb 13, 2019 at 05:00:41PM +0800, Yi Sun wrote:
> 
> [...]
> 
> > > >  
> > > >  /* context entry operations */
> > > >  #define vtd_get_ce_size(s, ce) \
> > > > @@ -65,6 +66,9 @@
> > > >  #define vtd_pe_get_slpt_base(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_SLPTPTR)
> > > >  #define vtd_pe_get_domain_id(pe) VTD_SM_PASID_ENTRY_DID((pe)->val[1])
> > > >  
> > > > +/* invalidation desc */
> > > > +#define vtd_get_inv_desc_width(s) ((s)->iq_dw ? 32 : 16)
> > > 
> > > Nit: I'll prefer dropping all the "get" wordings in these macros to be
> > > "vtd_inv_desc_width" since that "get" doesn't help much on
> > > understanding its meanings.  But it's personal preference too.
> > > 
> > That is fine.
> > 
> > > And since you've already have the iq_dw variable - why not store the
> > > width directly into it?  An uint8_t would suffice.
> > > 
> > iq_dw corresponds to VTD_IQA_DW_MASK (Descriptor Width defined in IQA
> > register). 1 means 256-bit descriptor, 0 means 128-bit descriptor.
> > 
> > It is also used in vtd_handle_gcmd_qie() and VTD_IQT_QT() by checking if
> > its value is 1.
> > 
> > So, I would prefer to keep the original design.
> 
> It's ok.   But please make it a boolean.  Now iq_dw can be 0x800.
> 
Sure.

> [...]
> 
> > > >  /* Fetch an Invalidation Descriptor from the Invalidation Queue */
> > > > -static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
> > > > +static bool vtd_get_inv_desc(IntelIOMMUState *s,
> > > >                               VTDInvDesc *inv_desc)
> > > >  {
> > > > -    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
> > > > -    if (dma_memory_read(&address_space_memory, addr, inv_desc,
> > > > -        sizeof(*inv_desc))) {
> > > > -        error_report_once("Read INV DESC failed");
> > > > -        inv_desc->lo = 0;
> > > > -        inv_desc->hi = 0;
> > > > +    dma_addr_t base_addr = s->iq;
> > > > +    uint32_t offset = s->iq_head;
> > > > +    uint32_t dw = vtd_get_inv_desc_width(s);
> > > > +    dma_addr_t addr = base_addr + offset * dw;
> > > > +
> > > > +    /* init */
> > > > +    inv_desc->val[0] = 0;
> > > > +    inv_desc->val[1] = 0;
> > > > +    inv_desc->val[2] = 0;
> > > > +    inv_desc->val[3] = 0;
> > > 
> > > No need?
> > > 
> > This is necessary. Per my test, the val[] are not 0 by default.
> 
> I agree, it's a stack variable. However...
> 
> > That makes bug happen.
> 
> ... could you explain the bug?
> 
Below error can be observed.

qemu-system-x86_64: vtd_process_inv_desc: invalid inv desc: val[3]=10, val[2]=0 (detect reserve non-zero)

> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-14  1:52         ` Yi Sun
@ 2019-02-14  3:24           ` Peter Xu
  2019-02-14  6:27             ` Yi Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-14  3:24 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Thu, Feb 14, 2019 at 09:52:04AM +0800, Yi Sun wrote:

[...]

> > > > >  /* Fetch an Invalidation Descriptor from the Invalidation Queue */
> > > > > -static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
> > > > > +static bool vtd_get_inv_desc(IntelIOMMUState *s,
> > > > >                               VTDInvDesc *inv_desc)
> > > > >  {
> > > > > -    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
> > > > > -    if (dma_memory_read(&address_space_memory, addr, inv_desc,
> > > > > -        sizeof(*inv_desc))) {
> > > > > -        error_report_once("Read INV DESC failed");
> > > > > -        inv_desc->lo = 0;
> > > > > -        inv_desc->hi = 0;
> > > > > +    dma_addr_t base_addr = s->iq;
> > > > > +    uint32_t offset = s->iq_head;
> > > > > +    uint32_t dw = vtd_get_inv_desc_width(s);
> > > > > +    dma_addr_t addr = base_addr + offset * dw;
> > > > > +
> > > > > +    /* init */
> > > > > +    inv_desc->val[0] = 0;
> > > > > +    inv_desc->val[1] = 0;
> > > > > +    inv_desc->val[2] = 0;
> > > > > +    inv_desc->val[3] = 0;
> > > > 
> > > > No need?
> > > > 
> > > This is necessary. Per my test, the val[] are not 0 by default.
> > 
> > I agree, it's a stack variable. However...
> > 
> > > That makes bug happen.
> > 
> > ... could you explain the bug?
> > 
> Below error can be observed.
> 
> qemu-system-x86_64: vtd_process_inv_desc: invalid inv desc: val[3]=10, val[2]=0 (detect reserve non-zero)

Ok so you're checking val[2] & val[3] unconditionally:

    if (inv_desc.val[3] || inv_desc.val[2]) {
        error_report_once("%s: invalid inv desc: val[3]=%"PRIx64
                          ", val[2]=%"PRIx64
                          " (detect reserve non-zero)", __func__,
                          inv_desc.val[3],
                          inv_desc.val[2]);
        return false;
    }

Why?  Shouldn't they invalid if inv desc width is 128bits?

When 256 bits invalidation descriptor is used, the guest driver
should be responsible to fill in zeros into reserved fields.

Another question: is val[2] & val[3] used in any place even with
256bits mode?  From what I see from the spec (chap 6.5.2), all of them
seems to be reserved as zeros, then I don't understand why bother
extending this to 256bits...  Did I miss something?

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-14  3:24           ` Peter Xu
@ 2019-02-14  6:27             ` Yi Sun
  2019-02-14  7:13               ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-02-14  6:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On 19-02-14 11:24:35, Peter Xu wrote:
> On Thu, Feb 14, 2019 at 09:52:04AM +0800, Yi Sun wrote:
> 
> [...]
> 
> > > > > >  /* Fetch an Invalidation Descriptor from the Invalidation Queue */
> > > > > > -static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
> > > > > > +static bool vtd_get_inv_desc(IntelIOMMUState *s,
> > > > > >                               VTDInvDesc *inv_desc)
> > > > > >  {
> > > > > > -    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
> > > > > > -    if (dma_memory_read(&address_space_memory, addr, inv_desc,
> > > > > > -        sizeof(*inv_desc))) {
> > > > > > -        error_report_once("Read INV DESC failed");
> > > > > > -        inv_desc->lo = 0;
> > > > > > -        inv_desc->hi = 0;
> > > > > > +    dma_addr_t base_addr = s->iq;
> > > > > > +    uint32_t offset = s->iq_head;
> > > > > > +    uint32_t dw = vtd_get_inv_desc_width(s);
> > > > > > +    dma_addr_t addr = base_addr + offset * dw;
> > > > > > +
> > > > > > +    /* init */
> > > > > > +    inv_desc->val[0] = 0;
> > > > > > +    inv_desc->val[1] = 0;
> > > > > > +    inv_desc->val[2] = 0;
> > > > > > +    inv_desc->val[3] = 0;
> > > > > 
> > > > > No need?
> > > > > 
> > > > This is necessary. Per my test, the val[] are not 0 by default.
> > > 
> > > I agree, it's a stack variable. However...
> > > 
> > > > That makes bug happen.
> > > 
> > > ... could you explain the bug?
> > > 
> > Below error can be observed.
> > 
> > qemu-system-x86_64: vtd_process_inv_desc: invalid inv desc: val[3]=10, val[2]=0 (detect reserve non-zero)
> 
> Ok so you're checking val[2] & val[3] unconditionally:
> 
>     if (inv_desc.val[3] || inv_desc.val[2]) {
>         error_report_once("%s: invalid inv desc: val[3]=%"PRIx64
>                           ", val[2]=%"PRIx64
>                           " (detect reserve non-zero)", __func__,
>                           inv_desc.val[3],
>                           inv_desc.val[2]);
>         return false;
>     }
> 
> Why?  Shouldn't they invalid if inv desc width is 128bits?
> 
Here, my intention is to simplify the codes. If inv_desc.val[] has been
initialized to 0, the error will not happen no matter 128bits or 256bits.

> When 256 bits invalidation descriptor is used, the guest driver
> should be responsible to fill in zeros into reserved fields.
> 
> Another question: is val[2] & val[3] used in any place even with
> 256bits mode?  From what I see from the spec (chap 6.5.2), all of them
> seems to be reserved as zeros, then I don't understand why bother
> extending this to 256bits...  Did I miss something?
> 
Although the high 128bits are not used now, they are defined by spec
so that we should handle them. At least we can know if there is error
by checking if high 128bits are 0. Furthermore, we handle them in codes
now can avoid some changes in the future.

> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-14  6:27             ` Yi Sun
@ 2019-02-14  7:13               ` Peter Xu
  2019-02-14  7:35                 ` Tian, Kevin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-14  7:13 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Thu, Feb 14, 2019 at 02:27:30PM +0800, Yi Sun wrote:
> On 19-02-14 11:24:35, Peter Xu wrote:
> > On Thu, Feb 14, 2019 at 09:52:04AM +0800, Yi Sun wrote:
> > 
> > [...]
> > 
> > > > > > >  /* Fetch an Invalidation Descriptor from the Invalidation Queue */
> > > > > > > -static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
> > > > > > > +static bool vtd_get_inv_desc(IntelIOMMUState *s,
> > > > > > >                               VTDInvDesc *inv_desc)
> > > > > > >  {
> > > > > > > -    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
> > > > > > > -    if (dma_memory_read(&address_space_memory, addr, inv_desc,
> > > > > > > -        sizeof(*inv_desc))) {
> > > > > > > -        error_report_once("Read INV DESC failed");
> > > > > > > -        inv_desc->lo = 0;
> > > > > > > -        inv_desc->hi = 0;
> > > > > > > +    dma_addr_t base_addr = s->iq;
> > > > > > > +    uint32_t offset = s->iq_head;
> > > > > > > +    uint32_t dw = vtd_get_inv_desc_width(s);
> > > > > > > +    dma_addr_t addr = base_addr + offset * dw;
> > > > > > > +
> > > > > > > +    /* init */
> > > > > > > +    inv_desc->val[0] = 0;
> > > > > > > +    inv_desc->val[1] = 0;
> > > > > > > +    inv_desc->val[2] = 0;
> > > > > > > +    inv_desc->val[3] = 0;
> > > > > > 
> > > > > > No need?
> > > > > > 
> > > > > This is necessary. Per my test, the val[] are not 0 by default.
> > > > 
> > > > I agree, it's a stack variable. However...
> > > > 
> > > > > That makes bug happen.
> > > > 
> > > > ... could you explain the bug?
> > > > 
> > > Below error can be observed.
> > > 
> > > qemu-system-x86_64: vtd_process_inv_desc: invalid inv desc: val[3]=10, val[2]=0 (detect reserve non-zero)
> > 
> > Ok so you're checking val[2] & val[3] unconditionally:
> > 
> >     if (inv_desc.val[3] || inv_desc.val[2]) {
> >         error_report_once("%s: invalid inv desc: val[3]=%"PRIx64
> >                           ", val[2]=%"PRIx64
> >                           " (detect reserve non-zero)", __func__,
> >                           inv_desc.val[3],
> >                           inv_desc.val[2]);
> >         return false;
> >     }
> > 
> > Why?  Shouldn't they invalid if inv desc width is 128bits?
> > 
> Here, my intention is to simplify the codes. If inv_desc.val[] has been
> initialized to 0, the error will not happen no matter 128bits or 256bits.

You set something to zero and then you check it against zero... IMHO
it doesn't simplify the code but instead it is more confusing.  I
would prefer to only check when necessary, then drop the inits of the
stack variables.

> 
> > When 256 bits invalidation descriptor is used, the guest driver
> > should be responsible to fill in zeros into reserved fields.
> > 
> > Another question: is val[2] & val[3] used in any place even with
> > 256bits mode?  From what I see from the spec (chap 6.5.2), all of them
> > seems to be reserved as zeros, then I don't understand why bother
> > extending this to 256bits...  Did I miss something?
> > 
> Although the high 128bits are not used now, they are defined by spec
> so that we should handle them. At least we can know if there is error
> by checking if high 128bits are 0. Furthermore, we handle them in codes
> now can avoid some changes in the future.

Yes I understand you should implement the code by following the spec.
My question was to the spec not the code.  Please feel free to skip
this question if you don't know the answer.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-14  7:13               ` Peter Xu
@ 2019-02-14  7:35                 ` Tian, Kevin
  2019-02-14  8:13                   ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Tian, Kevin @ 2019-02-14  7:35 UTC (permalink / raw)
  To: Peter Xu, Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum, Liu,
	Yi L, Sun, Yi Y

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Thursday, February 14, 2019 3:14 PM
> 
> >
> > > When 256 bits invalidation descriptor is used, the guest driver
> > > should be responsible to fill in zeros into reserved fields.
> > >
> > > Another question: is val[2] & val[3] used in any place even with
> > > 256bits mode?  From what I see from the spec (chap 6.5.2), all of them
> > > seems to be reserved as zeros, then I don't understand why bother
> > > extending this to 256bits...  Did I miss something?
> > >

PRQ is extended to carry larger private data which requires 256bits
mode. You can take a look at 7.5.1.1 Page Request Descriptor.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-14  7:35                 ` Tian, Kevin
@ 2019-02-14  8:13                   ` Peter Xu
  2019-02-14  8:22                     ` Tian, Kevin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2019-02-14  8:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yi Sun, qemu-devel, pbonzini, rth, ehabkost, mst,
	marcel.apfelbaum, Liu, Yi L, Sun, Yi Y

On Thu, Feb 14, 2019 at 07:35:20AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Thursday, February 14, 2019 3:14 PM
> > 
> > >
> > > > When 256 bits invalidation descriptor is used, the guest driver
> > > > should be responsible to fill in zeros into reserved fields.
> > > >
> > > > Another question: is val[2] & val[3] used in any place even with
> > > > 256bits mode?  From what I see from the spec (chap 6.5.2), all of them
> > > > seems to be reserved as zeros, then I don't understand why bother
> > > > extending this to 256bits...  Did I miss something?
> > > >
> 
> PRQ is extended to carry larger private data which requires 256bits
> mode. You can take a look at 7.5.1.1 Page Request Descriptor.

But we are talking about IQ (Invalidation Queue), not PRQ, right?

I see that Page Request Queue seems to always have 256bits descriptors
(chap 10.4.32, there is no Descriptor Width field, and I think it
means DW==1 always), however the Invalidation Queue seems to support
both modes (chap 10.4.23, there is Descriptor Width field, DW==0
should be the legacy mode, and DW==1 should be the new mode).  While,
none of the invalidate descriptors described in chap 6.5.2 is using
the upper 128bits even if 256bits mode is supported.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-14  8:13                   ` Peter Xu
@ 2019-02-14  8:22                     ` Tian, Kevin
  2019-02-14  8:43                       ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Tian, Kevin @ 2019-02-14  8:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yi Sun, qemu-devel, pbonzini, rth, ehabkost, mst,
	marcel.apfelbaum, Liu, Yi L, Sun, Yi Y

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Thursday, February 14, 2019 4:13 PM
> 
> On Thu, Feb 14, 2019 at 07:35:20AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Thursday, February 14, 2019 3:14 PM
> > >
> > > >
> > > > > When 256 bits invalidation descriptor is used, the guest driver
> > > > > should be responsible to fill in zeros into reserved fields.
> > > > >
> > > > > Another question: is val[2] & val[3] used in any place even with
> > > > > 256bits mode?  From what I see from the spec (chap 6.5.2), all of
> them
> > > > > seems to be reserved as zeros, then I don't understand why bother
> > > > > extending this to 256bits...  Did I miss something?
> > > > >
> >
> > PRQ is extended to carry larger private data which requires 256bits
> > mode. You can take a look at 7.5.1.1 Page Request Descriptor.
> 
> But we are talking about IQ (Invalidation Queue), not PRQ, right?
> 
> I see that Page Request Queue seems to always have 256bits descriptors
> (chap 10.4.32, there is no Descriptor Width field, and I think it
> means DW==1 always), however the Invalidation Queue seems to support
> both modes (chap 10.4.23, there is Descriptor Width field, DW==0
> should be the legacy mode, and DW==1 should be the new mode).  While,
> none of the invalidate descriptors described in chap 6.5.2 is using
> the upper 128bits even if 256bits mode is supported.
> 

Page Group Response descriptor is composed by software through
invalidation queue, which needs to carry same private data back
(as in page request descriptor). it's described in 7.7.1.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-14  8:22                     ` Tian, Kevin
@ 2019-02-14  8:43                       ` Peter Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Xu @ 2019-02-14  8:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yi Sun, qemu-devel, pbonzini, rth, ehabkost, mst,
	marcel.apfelbaum, Liu, Yi L, Sun, Yi Y

On Thu, Feb 14, 2019 at 08:22:05AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Thursday, February 14, 2019 4:13 PM
> > 
> > On Thu, Feb 14, 2019 at 07:35:20AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Thursday, February 14, 2019 3:14 PM
> > > >
> > > > >
> > > > > > When 256 bits invalidation descriptor is used, the guest driver
> > > > > > should be responsible to fill in zeros into reserved fields.
> > > > > >
> > > > > > Another question: is val[2] & val[3] used in any place even with
> > > > > > 256bits mode?  From what I see from the spec (chap 6.5.2), all of
> > them
> > > > > > seems to be reserved as zeros, then I don't understand why bother
> > > > > > extending this to 256bits...  Did I miss something?
> > > > > >
> > >
> > > PRQ is extended to carry larger private data which requires 256bits
> > > mode. You can take a look at 7.5.1.1 Page Request Descriptor.
> > 
> > But we are talking about IQ (Invalidation Queue), not PRQ, right?
> > 
> > I see that Page Request Queue seems to always have 256bits descriptors
> > (chap 10.4.32, there is no Descriptor Width field, and I think it
> > means DW==1 always), however the Invalidation Queue seems to support
> > both modes (chap 10.4.23, there is Descriptor Width field, DW==0
> > should be the legacy mode, and DW==1 should be the new mode).  While,
> > none of the invalidate descriptors described in chap 6.5.2 is using
> > the upper 128bits even if 256bits mode is supported.
> > 
> 
> Page Group Response descriptor is composed by software through
> invalidation queue, which needs to carry same private data back
> (as in page request descriptor). it's described in 7.7.1.

I see the point.  Thanks, Kevin.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-02-12  6:46   ` Peter Xu
@ 2019-02-15  5:22     ` Yi Sun
  2019-02-15  5:39       ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Yi Sun @ 2019-02-15  5:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On 19-02-12 14:46:29, Peter Xu wrote:

[...]

> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 3664a00..447fdf3 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -2492,6 +2492,17 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
> >          }
> >          break;
> >  
> > +    /*
> > +     * TODO: the entity of below two cases will be implemented in future series.
> > +     * To make guest (which integrates scalable mode support patch set in
> > +     * iommu driver) work, just return true is enough so far.
> 
> When you repost, could you mention about what tests have you done?  I
> can think of some general usages:
> 
> Device matrix:
> 
> - virtio-net, vhost=on|off
> - device assignment
> 
> Configuration matrix:
> 
> - legacy/scalable
> - passthrough=on|off
> 
> I believe smoke test (like netperf a few seconds, or even ping) would
> be enough, cover some (or all, which would be very nice to have :) of
> above scenarios.
> 
Thanks for the test cases! I have covered some of them but I think I
can do more tests.

> > +     */
> > +    case VTD_INV_DESC_PC:
> > +        break;
> > +
> > +    case VTD_INV_DESC_PIOTLB:
> > +        break;
> > +
> >      case VTD_INV_DESC_WAIT:
> >          trace_vtd_inv_desc("wait", inv_desc.val[1], inv_desc.val[0]);
> >          if (!vtd_process_wait_desc(s, &inv_desc)) {
> > @@ -3051,6 +3062,7 @@ static Property vtd_properties[] = {
> >      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
> >                        VTD_HOST_ADDRESS_WIDTH),
> >      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
> > +    DEFINE_PROP_BOOL("scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
> 
> Let's start with x-scalable-mode?  Less burden for all.
> 
Sure.

> >      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> > @@ -3583,6 +3595,16 @@ static void vtd_init(IntelIOMMUState *s)
> >          s->cap |= VTD_CAP_CM;
> >      }
> >  
> > +    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> > +    if (s->scalable_mode) {
> > +        if (!s->caching_mode) {
> > +            error_report("Need to set caching-mode for scalable mode");
> 
> Could I ask why?
> 
My intention is to make guest explicitly send to make sure SLT shadowing
correctly.

For this point, I also have question. Why does legacy mode not check CM?
If CM is not set, may the DMA remapping be wrong because SLT cannot
match guest's latest change?

> > +            exit(1);
> > +        }
> > +        s->cap |= VTD_CAP_DWD | VTD_CAP_DRD;
> 
> Draining is supported now so this line is not needed.
> 
Got it, thanks!

> > +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> > +    }
> > +
> >      vtd_reset_caches(s);
> >  
> >      /* Define registers with default values and bit semantics */
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 2a753c5..b01953a 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -190,7 +190,9 @@
> >  #define VTD_ECAP_EIM                (1ULL << 4)
> >  #define VTD_ECAP_PT                 (1ULL << 6)
> >  #define VTD_ECAP_MHMV               (15ULL << 20)
> > +#define VTD_ECAP_SRS                (1ULL << 31)
> >  #define VTD_ECAP_SMTS               (1ULL << 43)
> > +#define VTD_ECAP_SLTS               (1ULL << 46)
> >  
> >  /* CAP_REG */
> >  /* (offset >> 4) << 24 */
> > @@ -209,6 +211,8 @@
> >  #define VTD_CAP_DRAIN_READ          (1ULL << 55)
> >  #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
> >  #define VTD_CAP_CM                  (1ULL << 7)
> > +#define VTD_CAP_DWD                 (1ULL << 54)
> > +#define VTD_CAP_DRD                 (1ULL << 55)
> 
> These can be dropped too (see VTD_CAP_DRAIN above).
> 
Thanks!

[...]

> > -- 
> > 1.9.1
> > 
> 
> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-02-15  5:22     ` Yi Sun
@ 2019-02-15  5:39       ` Peter Xu
  2019-02-15  7:44         ` Yi Sun
  2019-02-15  8:22         ` Jason Wang
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Xu @ 2019-02-15  5:39 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On Fri, Feb 15, 2019 at 01:22:34PM +0800, Yi Sun wrote:

[...]

> > > +    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> > > +    if (s->scalable_mode) {
> > > +        if (!s->caching_mode) {
> > > +            error_report("Need to set caching-mode for scalable mode");
> > 
> > Could I ask why?
> > 
> My intention is to make guest explicitly send to make sure SLT shadowing
> correctly.
> 
> For this point, I also have question. Why does legacy mode not check CM?
> If CM is not set, may the DMA remapping be wrong because SLT cannot
> match guest's latest change?

Because CM is currently only required for device assignment.  For
example, if you only have an emulated device like virtio-net-pci in
the guest, then you don't need the CM capability to let it work with
the vIOMMU.  That's because we'll walk the 2nd level IOMMU page table
only when the virtio-net-pci device does DMA, and QEMU can easily do
that (QEMU is emulating the DMA of the virtio-net-pci device, and QEMU
has the knowledge of the guest 2nd level IOMMU page tables).  Assigned
devices are special because the host hardware knows nothing about
guest 2nd level page tables, so QEMU needs to shadow them before DMA
starts.

That's also why I listed "device assignment" as a special case in the
test device matrix, because that's the only case where we can torture
the IOMMU page shadowing code a bit.

For the scalable mode I would suppose you will still allow it to work
without caching mode.  The example is the same as above - when there's
only emulated devices in the guest.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-02-15  5:39       ` Peter Xu
@ 2019-02-15  7:44         ` Yi Sun
  2019-02-15  8:22         ` Jason Wang
  1 sibling, 0 replies; 26+ messages in thread
From: Yi Sun @ 2019-02-15  7:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	kevin.tian, yi.l.liu, yi.y.sun

On 19-02-15 13:39:05, Peter Xu wrote:
> On Fri, Feb 15, 2019 at 01:22:34PM +0800, Yi Sun wrote:
> 
> [...]
> 
> > > > +    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> > > > +    if (s->scalable_mode) {
> > > > +        if (!s->caching_mode) {
> > > > +            error_report("Need to set caching-mode for scalable mode");
> > > 
> > > Could I ask why?
> > > 
> > My intention is to make guest explicitly send to make sure SLT shadowing
> > correctly.
> > 
> > For this point, I also have question. Why does legacy mode not check CM?
> > If CM is not set, may the DMA remapping be wrong because SLT cannot
> > match guest's latest change?
> 
> Because CM is currently only required for device assignment.  For
> example, if you only have an emulated device like virtio-net-pci in
> the guest, then you don't need the CM capability to let it work with
> the vIOMMU.  That's because we'll walk the 2nd level IOMMU page table
> only when the virtio-net-pci device does DMA, and QEMU can easily do
> that (QEMU is emulating the DMA of the virtio-net-pci device, and QEMU
> has the knowledge of the guest 2nd level IOMMU page tables).  Assigned
> devices are special because the host hardware knows nothing about
> guest 2nd level page tables, so QEMU needs to shadow them before DMA
> starts.
> 
Thanks for the explanation! As CM is must for device assignment, I may
remove the check in current codes. In fact, CM is optional for SM.

> That's also why I listed "device assignment" as a special case in the
> test device matrix, because that's the only case where we can torture
> the IOMMU page shadowing code a bit.
> 
> For the scalable mode I would suppose you will still allow it to work
> without caching mode.  The example is the same as above - when there's
> only emulated devices in the guest.
> 
Yes, I verified it by using virtio-net-pci without device assignment.
The netperf simple case runs well.

> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-02-15  5:39       ` Peter Xu
  2019-02-15  7:44         ` Yi Sun
@ 2019-02-15  8:22         ` Jason Wang
  1 sibling, 0 replies; 26+ messages in thread
From: Jason Wang @ 2019-02-15  8:22 UTC (permalink / raw)
  To: Peter Xu, Yi Sun
  Cc: kevin.tian, yi.l.liu, ehabkost, mst, qemu-devel, yi.y.sun, pbonzini, rth


On 2019/2/15 下午1:39, Peter Xu wrote:
> On Fri, Feb 15, 2019 at 01:22:34PM +0800, Yi Sun wrote:
>
> [...]
>
>>>> +    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
>>>> +    if (s->scalable_mode) {
>>>> +        if (!s->caching_mode) {
>>>> +            error_report("Need to set caching-mode for scalable mode");
>>> Could I ask why?
>>>
>> My intention is to make guest explicitly send to make sure SLT shadowing
>> correctly.
>>
>> For this point, I also have question. Why does legacy mode not check CM?
>> If CM is not set, may the DMA remapping be wrong because SLT cannot
>> match guest's latest change?
> Because CM is currently only required for device assignment.  For
> example, if you only have an emulated device like virtio-net-pci in
> the guest, then you don't need the CM capability to let it work with
> the vIOMMU.  That's because we'll walk the 2nd level IOMMU page table
> only when the virtio-net-pci device does DMA, and QEMU can easily do
> that (QEMU is emulating the DMA of the virtio-net-pci device, and QEMU
> has the knowledge of the guest 2nd level IOMMU page tables).  Assigned
> devices are special because the host hardware knows nothing about
> guest 2nd level page tables, so QEMU needs to shadow them before DMA
> starts.


+1

And there's dataplane (vhost) which implements device IOTLB that doesn't 
require shadow page table. The can ask for translation for some address 
themselves.


>
> That's also why I listed "device assignment" as a special case in the
> test device matrix, because that's the only case where we can torture
> the IOMMU page shadowing code a bit.
>
> For the scalable mode I would suppose you will still allow it to work
> without caching mode.  The example is the same as above - when there's
> only emulated devices in the guest.


Agree.

Thanks


>
> Regards,
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-02-15  8:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-30  5:09 [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Yi Sun
2019-01-30  5:09 ` [Qemu-devel] [RFC v1 1/3] intel_iommu: scalable mode emulation Yi Sun
2019-02-11 10:12   ` Peter Xu
2019-02-13  7:38     ` Yi Sun
2019-02-13  8:03       ` Peter Xu
2019-02-13  8:28         ` Peter Xu
2019-01-30  5:09 ` [Qemu-devel] [RFC v1 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
2019-02-12  6:27   ` Peter Xu
2019-02-13  9:00     ` Yi Sun
2019-02-13 10:42       ` Peter Xu
2019-02-14  1:52         ` Yi Sun
2019-02-14  3:24           ` Peter Xu
2019-02-14  6:27             ` Yi Sun
2019-02-14  7:13               ` Peter Xu
2019-02-14  7:35                 ` Tian, Kevin
2019-02-14  8:13                   ` Peter Xu
2019-02-14  8:22                     ` Tian, Kevin
2019-02-14  8:43                       ` Peter Xu
2019-01-30  5:09 ` [Qemu-devel] [RFC v1 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
2019-02-12  6:46   ` Peter Xu
2019-02-15  5:22     ` Yi Sun
2019-02-15  5:39       ` Peter Xu
2019-02-15  7:44         ` Yi Sun
2019-02-15  8:22         ` Jason Wang
2019-02-11 10:37 ` [Qemu-devel] [RFC v1 0/3] intel_iommu: support scalable mode Peter Xu
2019-02-13  5:46   ` Yi Sun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.