[Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU.
@ 2018-12-12 13:05 Yu Zhang
  2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width Yu Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 57+ messages in thread
From: Yu Zhang @ 2018-12-12 13:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, Peter Xu

Intel's upcoming processors will extend maximum linear address width to
57 bits, and introduce 5-level paging for CPU. Meanwhile, the platform
will also extend the maximum guest address width for IOMMU to 57 bits,
thus introducing the 5-level paging for 2nd level translation(See chapter
3 in Intel Virtualization Technology for Directed I/O). 

This patch series extends the current logic to support a wider address width.
A 5-level paging capable IOMMU(for 2nd level translation) can be rendered
with configuration "device intel-iommu,x-aw-bits=57".

Also, kvm-unit-tests were updated to verify this patch series. Patch for
the test was sent out at: https://www.spinics.net/lists/kvm/msg177425.html.

Note: this patch series checks the existance of 5-level paging in the host
and in the guest, and rejects configurations for 57-bit IOVA if either check
fails(VTD-d hardware shall not support 57-bit IOVA on platforms without CPU
5-level paging). However, current vIOMMU implementation still lacks logic to
check against the physical IOMMU capability, future enhancements are expected
to do this.

Changes in V3: 
- Address comments from Peter Xu: squash the 3rd patch in v2 into the 2nd
  patch in this version.
- Added "Reviewed-by: Peter Xu <peterx@redhat.com>"

Changes in V2: 
- Address comments from Peter Xu: add haw member in vtd_page_walk_info.
- Address comments from Peter Xu: only searches for 4K/2M/1G mappings in
iotlb are meaningful. 
- Address comments from Peter Xu: cover letter changes(e.g. mention the test
patch in kvm-unit-tests).
- Coding style changes.
---
Cc: "Michael S. Tsirkin" <mst@redhat.com> 
Cc: Igor Mammedov <imammedo@redhat.com> 
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com> 
Cc: Richard Henderson <rth@twiddle.net> 
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
---

Yu Zhang (2):
  intel-iommu: differentiate host address width from IOVA address width.
  intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

 hw/i386/acpi-build.c           |  2 +-
 hw/i386/intel_iommu.c          | 96 +++++++++++++++++++++++++++++-------------
 hw/i386/intel_iommu_internal.h | 10 ++++-
 include/hw/i386/intel_iommu.h  | 10 +++--
 4 files changed, 81 insertions(+), 37 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-12 13:05 [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
@ 2018-12-12 13:05 ` Yu Zhang
  2018-12-17 13:17   ` Igor Mammedov
  2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit " Yu Zhang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-12 13:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Igor Mammedov, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost, Peter Xu

Currently, vIOMMU is using the value of IOVA address width, instead of
the host address width(HAW) to calculate the number of reserved bits in
data structures such as root entries, context entries, and entries of
DMA paging structures etc.

However values of IOVA address width and of the HAW may not equal. For
example, a 48-bit IOVA can only be mapped to host addresses no wider than
46 bits. Using 48, instead of 46 to calculate the reserved bit may result
in an invalid IOVA being accepted.

To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
whose value is initialized based on the maximum physical address set to
guest CPU. Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
to clarify.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
Cc: "Michael S. Tsirkin" <mst@redhat.com> 
Cc: Igor Mammedov <imammedo@redhat.com> 
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com> 
Cc: Richard Henderson <rth@twiddle.net> 
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
---
 hw/i386/acpi-build.c          |  2 +-
 hw/i386/intel_iommu.c         | 55 ++++++++++++++++++++++++-------------------
 include/hw/i386/intel_iommu.h |  9 +++----
 3 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 236a20e..b989523 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2431,7 +2431,7 @@ build_dmar_q35(GArray *table_data, BIOSLinker *linker)
     }
 
     dmar = acpi_data_push(table_data, sizeof(*dmar));
-    dmar->host_address_width = intel_iommu->aw_bits - 1;
+    dmar->host_address_width = intel_iommu->haw_bits - 1;
     dmar->flags = dmar_flags;
 
     /* DMAR Remapping Hardware Unit Definition structure */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d97bcbc..0e88c63 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -707,7 +707,8 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
  */
 static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
                              uint64_t *slptep, uint32_t *slpte_level,
-                             bool *reads, bool *writes, uint8_t aw_bits)
+                             bool *reads, bool *writes, uint8_t aw_bits,
+                             uint8_t haw_bits)
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
@@ -760,7 +761,7 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
             *slpte_level = level;
             return 0;
         }
-        addr = vtd_get_slpte_addr(slpte, aw_bits);
+        addr = vtd_get_slpte_addr(slpte, haw_bits);
         level--;
     }
 }
@@ -783,6 +784,7 @@ typedef struct {
     void *private;
     bool notify_unmap;
     uint8_t aw;
+    uint8_t haw;
     uint16_t domain_id;
 } vtd_page_walk_info;
 
@@ -925,7 +927,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
              * This is a valid PDE (or even bigger than PDE).  We need
              * to walk one further level.
              */
-            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->haw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
         } else {
@@ -942,7 +944,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
             entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
             entry.addr_mask = ~subpage_mask;
             /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+            entry.translated_addr = vtd_get_slpte_addr(slpte, info->haw);
             ret = vtd_page_walk_one(&entry, info);
         }
 
@@ -1002,7 +1004,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return -VTD_FR_ROOT_ENTRY_P;
     }
 
-    if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD(s->aw_bits))) {
+    if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD(s->haw_bits))) {
         trace_vtd_re_invalid(re.rsvd, re.val);
         return -VTD_FR_ROOT_ENTRY_RSVD;
     }
@@ -1019,7 +1021,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
-               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
+               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(s->haw_bits))) {
         trace_vtd_ce_invalid(ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_RSVD;
     }
@@ -1056,6 +1058,7 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
         .private = (void *)&vtd_as->iommu,
         .notify_unmap = true,
         .aw = s->aw_bits,
+        .haw = s->haw_bits,
         .as = vtd_as,
         .domain_id = VTD_CONTEXT_ENTRY_DID(ce->hi),
     };
@@ -1360,7 +1363,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     }
 
     ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
-                               &reads, &writes, s->aw_bits);
+                               &reads, &writes, s->aw_bits, s->haw_bits);
     if (ret_fr) {
         ret_fr = -ret_fr;
         if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
@@ -1378,7 +1381,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 out:
     vtd_iommu_unlock(s);
     entry->iova = addr & page_mask;
-    entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) & page_mask;
+    entry->translated_addr = vtd_get_slpte_addr(slpte, s->haw_bits) & page_mask;
     entry->addr_mask = ~page_mask;
     entry->perm = access_flags;
     return true;
@@ -1396,7 +1399,7 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
 {
     s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
     s->root_extended = s->root & VTD_RTADDR_RTT;
-    s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
+    s->root &= VTD_RTADDR_ADDR_MASK(s->haw_bits);
 
     trace_vtd_reg_dmar_root(s->root, s->root_extended);
 }
@@ -1412,7 +1415,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
     uint64_t value = 0;
     value = vtd_get_quad_raw(s, DMAR_IRTA_REG);
     s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1);
-    s->intr_root = value & VTD_IRTA_ADDR_MASK(s->aw_bits);
+    s->intr_root = value & VTD_IRTA_ADDR_MASK(s->haw_bits);
     s->intr_eime = value & VTD_IRTA_EIME;
 
     /* Notify global invalidation */
@@ -1689,7 +1692,7 @@ static void vtd_handle_gcmd_qie(IntelIOMMUState *s, bool en)
     trace_vtd_inv_qi_enable(en);
 
     if (en) {
-        s->iq = iqa_val & VTD_IQA_IQA_MASK(s->aw_bits);
+        s->iq = iqa_val & VTD_IQA_IQA_MASK(s->haw_bits);
         /* 2^(x+8) entries */
         s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8);
         s->qi_enabled = true;
@@ -2629,7 +2632,7 @@ static Property vtd_properties[] = {
                             ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
     DEFINE_PROP_UINT8("x-aw-bits", IntelIOMMUState, aw_bits,
-                      VTD_HOST_ADDRESS_WIDTH),
+                      VTD_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -3080,6 +3083,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .private = (void *)n,
                 .notify_unmap = false,
                 .aw = s->aw_bits,
+                .haw = s->haw_bits,
                 .as = vtd_as,
                 .domain_id = VTD_CONTEXT_ENTRY_DID(ce.hi),
             };
@@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
 static void vtd_init(IntelIOMMUState *s)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
+    CPUState *cs = first_cpu;
+    X86CPU *cpu = X86_CPU(cs);
 
     memset(s->csr, 0, DMAR_REG_SIZE);
     memset(s->wmask, 0, DMAR_REG_SIZE);
@@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
              VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
-    if (s->aw_bits == VTD_HOST_AW_48BIT) {
+    if (s->aw_bits == VTD_AW_48BIT) {
         s->cap |= VTD_CAP_SAGAW_48bit;
     }
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
+    s->haw_bits = cpu->phys_bits;
 
     /*
      * Rsvd field masks for spte
      */
     vtd_paging_entry_rsvd_field[0] = ~0ULL;
-    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
-    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
-    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
-    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
-    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
-    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
-    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
-    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
+    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
 
     if (x86_iommu->intr_supported) {
         s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
@@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
     }
 
     /* Currently only address widths supported are 39 and 48 bits */
-    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
-        (s->aw_bits != VTD_HOST_AW_48BIT)) {
+    if ((s->aw_bits != VTD_AW_39BIT) &&
+        (s->aw_bits != VTD_AW_48BIT)) {
         error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
-                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
+                   VTD_AW_39BIT, VTD_AW_48BIT);
         return false;
     }
 
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index ed4e758..820451c 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -47,9 +47,9 @@
 #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
 
 #define DMAR_REG_SIZE               0x230
-#define VTD_HOST_AW_39BIT           39
-#define VTD_HOST_AW_48BIT           48
-#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
+#define VTD_AW_39BIT                39
+#define VTD_AW_48BIT                48
+#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
 #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
 
 #define DMAR_REPORT_F_INTR          (1)
@@ -244,7 +244,8 @@ struct IntelIOMMUState {
     bool intr_eime;                 /* Extended interrupt mode enabled */
     OnOffAuto intr_eim;             /* Toggle for EIM cabability */
     bool buggy_eim;                 /* Force buggy EIM unless eim=off */
-    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
+    uint8_t aw_bits;                /* IOVA address width (in bits) */
+    uint8_t haw_bits;               /* Hardware address width (in bits) */
 
     /*
      * Protects IOMMU states in general.  Currently it protects the
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-12 13:05 [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
  2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width Yu Zhang
@ 2018-12-12 13:05 ` Yu Zhang
  2018-12-17 13:29   ` Igor Mammedov
  2018-12-14  9:17 ` [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
  2019-01-15  4:02 ` Michael S. Tsirkin
  3 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-12 13:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Peter Xu

A 5-level paging capable VM may choose to use 57-bit IOVA address width.
E.g. guest applications may prefer to use its VA as IOVA when performing
VFIO map/unmap operations, to avoid the burden of managing the IOVA space.

This patch extends the current vIOMMU logic to cover the extended address
width. When creating a VM with 5-level paging feature, one can choose to
create a virtual VTD with 5-level paging capability, with configurations
like "-device intel-iommu,x-aw-bits=57".

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
 hw/i386/intel_iommu_internal.h | 10 ++++++--
 include/hw/i386/intel_iommu.h  |  1 +
 3 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0e88c63..871110c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
 
 /*
  * Rsvd field masks for spte:
- *     Index [1] to [4] 4k pages
- *     Index [5] to [8] large pages
+ *     Index [1] to [5] 4k pages
+ *     Index [6] to [10] large pages
  */
-static uint64_t vtd_paging_entry_rsvd_field[9];
+static uint64_t vtd_paging_entry_rsvd_field[11];
 
 static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
 {
     if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
         /* Maybe large page */
-        return slpte & vtd_paging_entry_rsvd_field[level + 4];
+        return slpte & vtd_paging_entry_rsvd_field[level + 5];
     } else {
         return slpte & vtd_paging_entry_rsvd_field[level];
     }
@@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
              VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
     if (s->aw_bits == VTD_AW_48BIT) {
         s->cap |= VTD_CAP_SAGAW_48bit;
+    } else if (s->aw_bits == VTD_AW_57BIT) {
+        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
     }
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
     s->haw_bits = cpu->phys_bits;
@@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
     vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
     vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
     vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
-    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
-    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
-    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
-    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
+    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
 
     if (x86_iommu->intr_supported) {
         s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
@@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static bool host_has_la57(void)
+{
+    uint32_t ecx, unused;
+
+    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
+    return ecx & CPUID_7_0_ECX_LA57;
+}
+
+static bool guest_has_la57(void)
+{
+    CPUState *cs = first_cpu;
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+
+    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
+}
+
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
@@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         }
     }
 
-    /* Currently only address widths supported are 39 and 48 bits */
+    /* Currently address widths supported are 39, 48, and 57 bits */
     if ((s->aw_bits != VTD_AW_39BIT) &&
-        (s->aw_bits != VTD_AW_48BIT)) {
-        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
-                   VTD_AW_39BIT, VTD_AW_48BIT);
+        (s->aw_bits != VTD_AW_48BIT) &&
+        (s->aw_bits != VTD_AW_57BIT)) {
+        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
+                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
+        return false;
+    }
+
+    if ((s->aw_bits == VTD_AW_57BIT) &&
+        !(host_has_la57() && guest_has_la57())) {
+        error_setg(errp, "Do not support 57-bit DMA address, unless both "
+                         "host and guest are capable of 5-level paging");
         return false;
     }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index d084099..2b29b6f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -114,8 +114,8 @@
                                      VTD_INTERRUPT_ADDR_FIRST + 1)
 
 /* The shift of source_id in the key of IOTLB hash table */
-#define VTD_IOTLB_SID_SHIFT         36
-#define VTD_IOTLB_LVL_SHIFT         52
+#define VTD_IOTLB_SID_SHIFT         45
+#define VTD_IOTLB_LVL_SHIFT         61
 #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
 
 /* IOTLB_REG */
@@ -212,6 +212,8 @@
 #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
  /* 48-bit AGAW, 4-level page-table */
 #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
+ /* 57-bit AGAW, 5-level page-table */
+#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
 
 /* IQT_REG */
 #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
@@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
         (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
         (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
+        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
         (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
@@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
         (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
+        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 820451c..7474c4f 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -49,6 +49,7 @@
 #define DMAR_REG_SIZE               0x230
 #define VTD_AW_39BIT                39
 #define VTD_AW_48BIT                48
+#define VTD_AW_57BIT                57
 #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
 #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU.
  2018-12-12 13:05 [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
  2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width Yu Zhang
  2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit " Yu Zhang
@ 2018-12-14  9:17 ` Yu Zhang
  2019-01-15  4:02 ` Michael S. Tsirkin
  3 siblings, 0 replies; 57+ messages in thread
From: Yu Zhang @ 2018-12-14  9:17 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Michael S. Tsirkin, Peter Xu,
	Paolo Bonzini, Igor Mammedov, Richard Henderson

Sorry, any comments for this series? Thanks. :)


B.R.

Yu

On 12/12/2018 9:05 PM, Yu Zhang wrote:
> Intel's upcoming processors will extend maximum linear address width to
> 57 bits, and introduce 5-level paging for CPU. Meanwhile, the platform
> will also extend the maximum guest address width for IOMMU to 57 bits,
> thus introducing the 5-level paging for 2nd level translation(See chapter
> 3 in Intel Virtualization Technology for Directed I/O).
>
> This patch series extends the current logic to support a wider address width.
> A 5-level paging capable IOMMU(for 2nd level translation) can be rendered
> with configuration "device intel-iommu,x-aw-bits=57".
>
> Also, kvm-unit-tests were updated to verify this patch series. Patch for
> the test was sent out at: https://www.spinics.net/lists/kvm/msg177425.html.
>
> Note: this patch series checks the existance of 5-level paging in the host
> and in the guest, and rejects configurations for 57-bit IOVA if either check
> fails(VTD-d hardware shall not support 57-bit IOVA on platforms without CPU
> 5-level paging). However, current vIOMMU implementation still lacks logic to
> check against the physical IOMMU capability, future enhancements are expected
> to do this.
>
> Changes in V3:
> - Address comments from Peter Xu: squash the 3rd patch in v2 into the 2nd
>    patch in this version.
> - Added "Reviewed-by: Peter Xu <peterx@redhat.com>"
>
> Changes in V2:
> - Address comments from Peter Xu: add haw member in vtd_page_walk_info.
> - Address comments from Peter Xu: only searches for 4K/2M/1G mappings in
> iotlb are meaningful.
> - Address comments from Peter Xu: cover letter changes(e.g. mention the test
> patch in kvm-unit-tests).
> - Coding style changes.
> ---
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> ---
>
> Yu Zhang (2):
>    intel-iommu: differentiate host address width from IOVA address width.
>    intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
>
>   hw/i386/acpi-build.c           |  2 +-
>   hw/i386/intel_iommu.c          | 96 +++++++++++++++++++++++++++++-------------
>   hw/i386/intel_iommu_internal.h | 10 ++++-
>   include/hw/i386/intel_iommu.h  | 10 +++--
>   4 files changed, 81 insertions(+), 37 deletions(-)
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width Yu Zhang
@ 2018-12-17 13:17   ` Igor Mammedov
  2018-12-18  9:27     ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Igor Mammedov @ 2018-12-17 13:17 UTC (permalink / raw)
  To: Yu Zhang
  Cc: qemu-devel, Michael S. Tsirkin, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Peter Xu

On Wed, 12 Dec 2018 21:05:38 +0800
Yu Zhang <yu.c.zhang@linux.intel.com> wrote:

> Currently, vIOMMU is using the value of IOVA address width, instead of
> the host address width(HAW) to calculate the number of reserved bits in
> data structures such as root entries, context entries, and entries of
> DMA paging structures etc.
> 
> However values of IOVA address width and of the HAW may not equal. For
> example, a 48-bit IOVA can only be mapped to host addresses no wider than
> 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> in an invalid IOVA being accepted.
> 
> To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> whose value is initialized based on the maximum physical address set to
> guest CPU.

> Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> to clarify.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
[...]

> @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>  static void vtd_init(IntelIOMMUState *s)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> +    CPUState *cs = first_cpu;
> +    X86CPU *cpu = X86_CPU(cs);
>  
>      memset(s->csr, 0, DMAR_REG_SIZE);
>      memset(s->wmask, 0, DMAR_REG_SIZE);
> @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
>               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
>               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> +    if (s->aw_bits == VTD_AW_48BIT) {
>          s->cap |= VTD_CAP_SAGAW_48bit;
>      }
>      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> +    s->haw_bits = cpu->phys_bits;
Is it possible to avoid accessing CPU fields directly or cpu altogether
and set phys_bits when iommu is created?

Perhaps Eduardo
 can suggest better approach, since he's more familiar with phys_bits topic

>      /*
>       * Rsvd field masks for spte
>       */
>      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
>  
>      if (x86_iommu->intr_supported) {
>          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>      }
>  
>      /* Currently only address widths supported are 39 and 48 bits */
> -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> +    if ((s->aw_bits != VTD_AW_39BIT) &&
> +        (s->aw_bits != VTD_AW_48BIT)) {
>          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> +                   VTD_AW_39BIT, VTD_AW_48BIT);
>          return false;
>      }
>  
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index ed4e758..820451c 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -47,9 +47,9 @@
>  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
>  
>  #define DMAR_REG_SIZE               0x230
> -#define VTD_HOST_AW_39BIT           39
> -#define VTD_HOST_AW_48BIT           48
> -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> +#define VTD_AW_39BIT                39
> +#define VTD_AW_48BIT                48
> +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
>  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
>  
>  #define DMAR_REPORT_F_INTR          (1)
> @@ -244,7 +244,8 @@ struct IntelIOMMUState {
>      bool intr_eime;                 /* Extended interrupt mode enabled */
>      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
>      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> +    uint8_t haw_bits;               /* Hardware address width (in bits) */
>  
>      /*
>       * Protects IOMMU states in general.  Currently it protects the

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit " Yu Zhang
@ 2018-12-17 13:29   ` Igor Mammedov
  2018-12-18  9:47     ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Igor Mammedov @ 2018-12-17 13:29 UTC (permalink / raw)
  To: Yu Zhang
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Wed, 12 Dec 2018 21:05:39 +0800
Yu Zhang <yu.c.zhang@linux.intel.com> wrote:

> A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> E.g. guest applications may prefer to use its VA as IOVA when performing
> VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> 
> This patch extends the current vIOMMU logic to cover the extended address
> width. When creating a VM with 5-level paging feature, one can choose to
> create a virtual VTD with 5-level paging capability, with configurations
> like "-device intel-iommu,x-aw-bits=57".
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
>  hw/i386/intel_iommu_internal.h | 10 ++++++--
>  include/hw/i386/intel_iommu.h  |  1 +
>  3 files changed, 50 insertions(+), 14 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 0e88c63..871110c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
>  
>  /*
>   * Rsvd field masks for spte:
> - *     Index [1] to [4] 4k pages
> - *     Index [5] to [8] large pages
> + *     Index [1] to [5] 4k pages
> + *     Index [6] to [10] large pages
>   */
> -static uint64_t vtd_paging_entry_rsvd_field[9];
> +static uint64_t vtd_paging_entry_rsvd_field[11];
>  
>  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>  {
>      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
>          /* Maybe large page */
> -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
>      } else {
>          return slpte & vtd_paging_entry_rsvd_field[level];
>      }
> @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
>               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
>      if (s->aw_bits == VTD_AW_48BIT) {
>          s->cap |= VTD_CAP_SAGAW_48bit;
> +    } else if (s->aw_bits == VTD_AW_57BIT) {
> +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
>      }
>      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
>      s->haw_bits = cpu->phys_bits;
> @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
>      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
>      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
>      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
>  
>      if (x86_iommu->intr_supported) {
>          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &vtd_as->as;
>  }
>  
> +static bool host_has_la57(void)
> +{
> +    uint32_t ecx, unused;
> +
> +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> +    return ecx & CPUID_7_0_ECX_LA57;
> +}
> +
> +static bool guest_has_la57(void)
> +{
> +    CPUState *cs = first_cpu;
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +
> +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> +}
another direct access to CPU fields,
I'd suggest to set this value when iommu is created
i.e. add 'la57' property and set from iommu owner.

>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>          }
>      }
>  
> -    /* Currently only address widths supported are 39 and 48 bits */
> +    /* Currently address widths supported are 39, 48, and 57 bits */
>      if ((s->aw_bits != VTD_AW_39BIT) &&
> -        (s->aw_bits != VTD_AW_48BIT)) {
> -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> -                   VTD_AW_39BIT, VTD_AW_48BIT);
> +        (s->aw_bits != VTD_AW_48BIT) &&
> +        (s->aw_bits != VTD_AW_57BIT)) {
> +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> +        return false;
> +    }
> +
> +    if ((s->aw_bits == VTD_AW_57BIT) &&
> +        !(host_has_la57() && guest_has_la57())) {
Does iommu supposed to work in TCG mode?
If yes then why it should care about host_has_la57()?

> +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> +                         "host and guest are capable of 5-level paging");
>          return false;
>      }
>  
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index d084099..2b29b6f 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -114,8 +114,8 @@
>                                       VTD_INTERRUPT_ADDR_FIRST + 1)
>  
>  /* The shift of source_id in the key of IOTLB hash table */
> -#define VTD_IOTLB_SID_SHIFT         36
> -#define VTD_IOTLB_LVL_SHIFT         52
> +#define VTD_IOTLB_SID_SHIFT         45
> +#define VTD_IOTLB_LVL_SHIFT         61
>  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
>  
>  /* IOTLB_REG */
> @@ -212,6 +212,8 @@
>  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
>   /* 48-bit AGAW, 4-level page-table */
>  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> + /* 57-bit AGAW, 5-level page-table */
> +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
>  
>  /* IQT_REG */
>  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
>          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
>          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
>          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
>          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
>          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>  
>  /* Information about page-selective IOTLB invalidate */
>  struct VTDIOTLBPageInvInfo {
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 820451c..7474c4f 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -49,6 +49,7 @@
>  #define DMAR_REG_SIZE               0x230
>  #define VTD_AW_39BIT                39
>  #define VTD_AW_48BIT                48
> +#define VTD_AW_57BIT                57
>  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
>  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
>  

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-17 13:17   ` Igor Mammedov
@ 2018-12-18  9:27     ` Yu Zhang
  2018-12-18 14:23       ` Michael S. Tsirkin
                         ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Yu Zhang @ 2018-12-18  9:27 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> On Wed, 12 Dec 2018 21:05:38 +0800
> Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> 
> > Currently, vIOMMU is using the value of IOVA address width, instead of
> > the host address width(HAW) to calculate the number of reserved bits in
> > data structures such as root entries, context entries, and entries of
> > DMA paging structures etc.
> > 
> > However values of IOVA address width and of the HAW may not equal. For
> > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > in an invalid IOVA being accepted.
> > 
> > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > whose value is initialized based on the maximum physical address set to
> > guest CPU.
> 
> > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > to clarify.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > ---
> [...]
> 
> > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >  static void vtd_init(IntelIOMMUState *s)
> >  {
> >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > +    CPUState *cs = first_cpu;
> > +    X86CPU *cpu = X86_CPU(cs);
> >  
> >      memset(s->csr, 0, DMAR_REG_SIZE);
> >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > +    if (s->aw_bits == VTD_AW_48BIT) {
> >          s->cap |= VTD_CAP_SAGAW_48bit;
> >      }
> >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > +    s->haw_bits = cpu->phys_bits;
> Is it possible to avoid accessing CPU fields directly or cpu altogether
> and set phys_bits when iommu is created?

Thanks for your comments, Igor.

Well, I guess you prefer not to query the CPU capabilities while deciding
the vIOMMU features. But to me, they are not that irrelevant.:)

Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
are referring to the same concept. In VM, both are the maximum guest physical
address width. If we do not check the CPU field here, we will still have to
check the CPU field in other places such as build_dmar_q35(), and reset the
s->haw_bits again.

Is this explanation convincing enough? :)

> 
> Perhaps Eduardo
>  can suggest better approach, since he's more familiar with phys_bits topic

@Eduardo, any comments? Thanks!

> 
> >      /*
> >       * Rsvd field masks for spte
> >       */
> >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> >  
> >      if (x86_iommu->intr_supported) {
> >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> >      }
> >  
> >      /* Currently only address widths supported are 39 and 48 bits */
> > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > +        (s->aw_bits != VTD_AW_48BIT)) {
> >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> >          return false;
> >      }
> >  
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index ed4e758..820451c 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -47,9 +47,9 @@
> >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> >  
> >  #define DMAR_REG_SIZE               0x230
> > -#define VTD_HOST_AW_39BIT           39
> > -#define VTD_HOST_AW_48BIT           48
> > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > +#define VTD_AW_39BIT                39
> > +#define VTD_AW_48BIT                48
> > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> >  
> >  #define DMAR_REPORT_F_INTR          (1)
> > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> >      bool intr_eime;                 /* Extended interrupt mode enabled */
> >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> >  
> >      /*
> >       * Protects IOMMU states in general.  Currently it protects the
> 
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-17 13:29   ` Igor Mammedov
@ 2018-12-18  9:47     ` Yu Zhang
  2018-12-18 10:01       ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-18  9:47 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> On Wed, 12 Dec 2018 21:05:39 +0800
> Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> 
> > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > E.g. guest applications may prefer to use its VA as IOVA when performing
> > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > 
> > This patch extends the current vIOMMU logic to cover the extended address
> > width. When creating a VM with 5-level paging feature, one can choose to
> > create a virtual VTD with 5-level paging capability, with configurations
> > like "-device intel-iommu,x-aw-bits=57".
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > ---
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> >  include/hw/i386/intel_iommu.h  |  1 +
> >  3 files changed, 50 insertions(+), 14 deletions(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 0e88c63..871110c 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> >  
> >  /*
> >   * Rsvd field masks for spte:
> > - *     Index [1] to [4] 4k pages
> > - *     Index [5] to [8] large pages
> > + *     Index [1] to [5] 4k pages
> > + *     Index [6] to [10] large pages
> >   */
> > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > +static uint64_t vtd_paging_entry_rsvd_field[11];
> >  
> >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> >  {
> >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> >          /* Maybe large page */
> > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> >      } else {
> >          return slpte & vtd_paging_entry_rsvd_field[level];
> >      }
> > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> >      if (s->aw_bits == VTD_AW_48BIT) {
> >          s->cap |= VTD_CAP_SAGAW_48bit;
> > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> >      }
> >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> >      s->haw_bits = cpu->phys_bits;
> > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> >  
> >      if (x86_iommu->intr_supported) {
> >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >      return &vtd_as->as;
> >  }
> >  
> > +static bool host_has_la57(void)
> > +{
> > +    uint32_t ecx, unused;
> > +
> > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > +    return ecx & CPUID_7_0_ECX_LA57;
> > +}
> > +
> > +static bool guest_has_la57(void)
> > +{
> > +    CPUState *cs = first_cpu;
> > +    X86CPU *cpu = X86_CPU(cs);
> > +    CPUX86State *env = &cpu->env;
> > +
> > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > +}
> another direct access to CPU fields,
> I'd suggest to set this value when iommu is created
> i.e. add 'la57' property and set from iommu owner.
> 

Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
that, because a 5-level capable vIOMMU can be created with properties
like "-device intel-iommu,x-aw-bits=57". 

The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
because I believe there shall be no 5-level IOMMU on platforms without LA57
CPUs. 

> >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> >  {
> >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> >          }
> >      }
> >  
> > -    /* Currently only address widths supported are 39 and 48 bits */
> > +    /* Currently address widths supported are 39, 48, and 57 bits */
> >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > -        (s->aw_bits != VTD_AW_48BIT)) {
> > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > +        (s->aw_bits != VTD_AW_48BIT) &&
> > +        (s->aw_bits != VTD_AW_57BIT)) {
> > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > +        return false;
> > +    }
> > +
> > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > +        !(host_has_la57() && guest_has_la57())) {
> Does iommu supposed to work in TCG mode?
> If yes then why it should care about host_has_la57()?
> 

Hmm... I did not take TCG mode into consideration. And host_has_la57() is
used to guarantee the host have la57 feature so that iommu shadowing works
for device assignment.

I guess iommu shall work in TCG mode(though I am not quite sure about this).
But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
we can:
1> check the 'ms->accel' in vtd_decide_config() and do not care about host
capability if it is TCG.
2> Or, we can choose to keep as it is, and add the check when 5-level paging
vIOMMU does have usage in TCG?

But as to the check of guest capability, I still believe it is necessary. As
said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.

> > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > +                         "host and guest are capable of 5-level paging");
> >          return false;
> >      }
> >  
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index d084099..2b29b6f 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -114,8 +114,8 @@
> >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> >  
> >  /* The shift of source_id in the key of IOTLB hash table */
> > -#define VTD_IOTLB_SID_SHIFT         36
> > -#define VTD_IOTLB_LVL_SHIFT         52
> > +#define VTD_IOTLB_SID_SHIFT         45
> > +#define VTD_IOTLB_LVL_SHIFT         61
> >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> >  
> >  /* IOTLB_REG */
> > @@ -212,6 +212,8 @@
> >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> >   /* 48-bit AGAW, 4-level page-table */
> >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > + /* 57-bit AGAW, 5-level page-table */
> > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> >  
> >  /* IQT_REG */
> >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> >  
> >  /* Information about page-selective IOTLB invalidate */
> >  struct VTDIOTLBPageInvInfo {
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 820451c..7474c4f 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -49,6 +49,7 @@
> >  #define DMAR_REG_SIZE               0x230
> >  #define VTD_AW_39BIT                39
> >  #define VTD_AW_48BIT                48
> > +#define VTD_AW_57BIT                57
> >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> >  
> 
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-18  9:47     ` Yu Zhang
@ 2018-12-18 10:01       ` Yu Zhang
  2018-12-18 12:43         ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-18 10:01 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > On Wed, 12 Dec 2018 21:05:39 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > 
> > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > 
> > > This patch extends the current vIOMMU logic to cover the extended address
> > > width. When creating a VM with 5-level paging feature, one can choose to
> > > create a virtual VTD with 5-level paging capability, with configurations
> > > like "-device intel-iommu,x-aw-bits=57".
> > > 
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > ---
> > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Richard Henderson <rth@twiddle.net>
> > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > >  include/hw/i386/intel_iommu.h  |  1 +
> > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 0e88c63..871110c 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > >  
> > >  /*
> > >   * Rsvd field masks for spte:
> > > - *     Index [1] to [4] 4k pages
> > > - *     Index [5] to [8] large pages
> > > + *     Index [1] to [5] 4k pages
> > > + *     Index [6] to [10] large pages
> > >   */
> > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > >  
> > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > >  {
> > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > >          /* Maybe large page */
> > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > >      } else {
> > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > >      }
> > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > >      if (s->aw_bits == VTD_AW_48BIT) {
> > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > >      }
> > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > >      s->haw_bits = cpu->phys_bits;
> > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > >  
> > >      if (x86_iommu->intr_supported) {
> > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > >      return &vtd_as->as;
> > >  }
> > >  
> > > +static bool host_has_la57(void)
> > > +{
> > > +    uint32_t ecx, unused;
> > > +
> > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > +}
> > > +
> > > +static bool guest_has_la57(void)
> > > +{
> > > +    CPUState *cs = first_cpu;
> > > +    X86CPU *cpu = X86_CPU(cs);
> > > +    CPUX86State *env = &cpu->env;
> > > +
> > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > +}
> > another direct access to CPU fields,
> > I'd suggest to set this value when iommu is created
> > i.e. add 'la57' property and set from iommu owner.
> > 
> 
> Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> that, because a 5-level capable vIOMMU can be created with properties
> like "-device intel-iommu,x-aw-bits=57". 
> 
> The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> because I believe there shall be no 5-level IOMMU on platforms without LA57
> CPUs. 
> 
> > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > >  {
> > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > >          }
> > >      }
> > >  
> > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > +        return false;
> > > +    }
> > > +
> > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > +        !(host_has_la57() && guest_has_la57())) {
> > Does iommu supposed to work in TCG mode?
> > If yes then why it should care about host_has_la57()?
> > 
> 
> Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> used to guarantee the host have la57 feature so that iommu shadowing works
> for device assignment.
> 
> I guess iommu shall work in TCG mode(though I am not quite sure about this).
> But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> we can:
> 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> capability if it is TCG.

For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
for the remind. :)

> 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> vIOMMU does have usage in TCG?
> 
> But as to the check of guest capability, I still believe it is necessary. As
> said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> 
> > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > +                         "host and guest are capable of 5-level paging");
> > >          return false;
> > >      }
> > >  
> > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > index d084099..2b29b6f 100644
> > > --- a/hw/i386/intel_iommu_internal.h
> > > +++ b/hw/i386/intel_iommu_internal.h
> > > @@ -114,8 +114,8 @@
> > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > >  
> > >  /* The shift of source_id in the key of IOTLB hash table */
> > > -#define VTD_IOTLB_SID_SHIFT         36
> > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > +#define VTD_IOTLB_SID_SHIFT         45
> > > +#define VTD_IOTLB_LVL_SHIFT         61
> > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > >  
> > >  /* IOTLB_REG */
> > > @@ -212,6 +212,8 @@
> > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > >   /* 48-bit AGAW, 4-level page-table */
> > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > + /* 57-bit AGAW, 5-level page-table */
> > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > >  
> > >  /* IQT_REG */
> > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > >  
> > >  /* Information about page-selective IOTLB invalidate */
> > >  struct VTDIOTLBPageInvInfo {
> > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > index 820451c..7474c4f 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -49,6 +49,7 @@
> > >  #define DMAR_REG_SIZE               0x230
> > >  #define VTD_AW_39BIT                39
> > >  #define VTD_AW_48BIT                48
> > > +#define VTD_AW_57BIT                57
> > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > >  
> > 
> > 
> 
> B.R.
> Yu
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-18 10:01       ` Yu Zhang
@ 2018-12-18 12:43         ` Michael S. Tsirkin
  2018-12-18 13:45           ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-18 12:43 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > 
> > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > 
> > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > like "-device intel-iommu,x-aw-bits=57".
> > > > 
> > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > Cc: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 0e88c63..871110c 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > >  
> > > >  /*
> > > >   * Rsvd field masks for spte:
> > > > - *     Index [1] to [4] 4k pages
> > > > - *     Index [5] to [8] large pages
> > > > + *     Index [1] to [5] 4k pages
> > > > + *     Index [6] to [10] large pages
> > > >   */
> > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > >  
> > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > >  {
> > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > >          /* Maybe large page */
> > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > >      } else {
> > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > >      }
> > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > >      }
> > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > >      s->haw_bits = cpu->phys_bits;
> > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > >  
> > > >      if (x86_iommu->intr_supported) {
> > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > >      return &vtd_as->as;
> > > >  }
> > > >  
> > > > +static bool host_has_la57(void)
> > > > +{
> > > > +    uint32_t ecx, unused;
> > > > +
> > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > +}
> > > > +
> > > > +static bool guest_has_la57(void)
> > > > +{
> > > > +    CPUState *cs = first_cpu;
> > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > +    CPUX86State *env = &cpu->env;
> > > > +
> > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > +}
> > > another direct access to CPU fields,
> > > I'd suggest to set this value when iommu is created
> > > i.e. add 'la57' property and set from iommu owner.
> > > 
> > 
> > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > that, because a 5-level capable vIOMMU can be created with properties
> > like "-device intel-iommu,x-aw-bits=57". 
> > 
> > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > CPUs. 

I don't necessarily see why these need to be connected.
If yes pls add code to explain.


> > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > >  {
> > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > >          }
> > > >      }
> > > >  
> > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > +        !(host_has_la57() && guest_has_la57())) {
> > > Does iommu supposed to work in TCG mode?
> > > If yes then why it should care about host_has_la57()?
> > > 
> > 
> > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > used to guarantee the host have la57 feature so that iommu shadowing works
> > for device assignment.
> > 
> > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > we can:
> > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > capability if it is TCG.
> 
> For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> for the remind. :)


This needs a big comment with an explanation though.
And probably a TODO to make it work under TCG ...

> > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > vIOMMU does have usage in TCG?
> > 
> > But as to the check of guest capability, I still believe it is necessary. As
> > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > 
> > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > +                         "host and guest are capable of 5-level paging");
> > > >          return false;
> > > >      }
> > > >  
> > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > index d084099..2b29b6f 100644
> > > > --- a/hw/i386/intel_iommu_internal.h
> > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > @@ -114,8 +114,8 @@
> > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > >  
> > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > >  
> > > >  /* IOTLB_REG */
> > > > @@ -212,6 +212,8 @@
> > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > >   /* 48-bit AGAW, 4-level page-table */
> > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > + /* 57-bit AGAW, 5-level page-table */
> > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > >  
> > > >  /* IQT_REG */
> > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > >  
> > > >  /* Information about page-selective IOTLB invalidate */
> > > >  struct VTDIOTLBPageInvInfo {
> > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > index 820451c..7474c4f 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -49,6 +49,7 @@
> > > >  #define DMAR_REG_SIZE               0x230
> > > >  #define VTD_AW_39BIT                39
> > > >  #define VTD_AW_48BIT                48
> > > > +#define VTD_AW_57BIT                57
> > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > >  
> > > 
> > > 
> > 
> > B.R.
> > Yu
> > 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-18 12:43         ` Michael S. Tsirkin
@ 2018-12-18 13:45           ` Yu Zhang
  2018-12-18 14:49             ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-18 13:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > 
> > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > 
> > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > 
> > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > index 0e88c63..871110c 100644
> > > > > --- a/hw/i386/intel_iommu.c
> > > > > +++ b/hw/i386/intel_iommu.c
> > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > >  
> > > > >  /*
> > > > >   * Rsvd field masks for spte:
> > > > > - *     Index [1] to [4] 4k pages
> > > > > - *     Index [5] to [8] large pages
> > > > > + *     Index [1] to [5] 4k pages
> > > > > + *     Index [6] to [10] large pages
> > > > >   */
> > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > >  
> > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > >  {
> > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > >          /* Maybe large page */
> > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > >      } else {
> > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > >      }
> > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > >      }
> > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > >      s->haw_bits = cpu->phys_bits;
> > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > >  
> > > > >      if (x86_iommu->intr_supported) {
> > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > >      return &vtd_as->as;
> > > > >  }
> > > > >  
> > > > > +static bool host_has_la57(void)
> > > > > +{
> > > > > +    uint32_t ecx, unused;
> > > > > +
> > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > +}
> > > > > +
> > > > > +static bool guest_has_la57(void)
> > > > > +{
> > > > > +    CPUState *cs = first_cpu;
> > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > +    CPUX86State *env = &cpu->env;
> > > > > +
> > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > +}
> > > > another direct access to CPU fields,
> > > > I'd suggest to set this value when iommu is created
> > > > i.e. add 'la57' property and set from iommu owner.
> > > > 
> > > 
> > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > that, because a 5-level capable vIOMMU can be created with properties
> > > like "-device intel-iommu,x-aw-bits=57". 
> > > 
> > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > CPUs. 
> 
> I don't necessarily see why these need to be connected.
> If yes pls add code to explain.

Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
have LA57 feature? I do not see any direct connection when asked to enable a 5-level
vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
value as an IOVA. And if guest has LA57, we should create a 5-level vIOMMU to the VM.
But if the VM even does not have LA57, any specific reason we should give it a 5-level
vIOMMU?

> 
> 
> > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > >  {
> > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > >          }
> > > > >      }
> > > > >  
> > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > +        return false;
> > > > > +    }
> > > > > +
> > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > Does iommu supposed to work in TCG mode?
> > > > If yes then why it should care about host_has_la57()?
> > > > 
> > > 
> > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > for device assignment.
> > > 
> > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > we can:
> > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > capability if it is TCG.
> > 
> > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > for the remind. :)
> 
> 
> This needs a big comment with an explanation though.
> And probably a TODO to make it work under TCG ...
> 

Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
though), and the condition would be sth. like:

    if ((s->aw_bits == VTD_AW_57BIT) &&
        kvm_enabled() &&
        !host_has_la57())  {

As you can see, though I remove the check of guest_has_la57(), I still kept the
check against host when KVM is enabled. I'm still ready to be convinced for any
requirement why we do not need the guest check. :) 

> > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > vIOMMU does have usage in TCG?
> > > 
> > > But as to the check of guest capability, I still believe it is necessary. As
> > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > 
> > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > +                         "host and guest are capable of 5-level paging");
> > > > >          return false;
> > > > >      }
> > > > >  
> > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > index d084099..2b29b6f 100644
> > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > @@ -114,8 +114,8 @@
> > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > >  
> > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > >  
> > > > >  /* IOTLB_REG */
> > > > > @@ -212,6 +212,8 @@
> > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > >  
> > > > >  /* IQT_REG */
> > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > >  
> > > > >  /* Information about page-selective IOTLB invalidate */
> > > > >  struct VTDIOTLBPageInvInfo {
> > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > index 820451c..7474c4f 100644
> > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > @@ -49,6 +49,7 @@
> > > > >  #define DMAR_REG_SIZE               0x230
> > > > >  #define VTD_AW_39BIT                39
> > > > >  #define VTD_AW_48BIT                48
> > > > > +#define VTD_AW_57BIT                57
> > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > >  
> > > > 
> > > > 
> > > 
> > > B.R.
> > > Yu
> > > 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-18  9:27     ` Yu Zhang
@ 2018-12-18 14:23       ` Michael S. Tsirkin
  2018-12-18 14:55       ` Igor Mammedov
  2018-12-20 20:58       ` Eduardo Habkost
  2 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-18 14:23 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 05:27:23PM +0800, Yu Zhang wrote:
> On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > On Wed, 12 Dec 2018 21:05:38 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > 
> > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > the host address width(HAW) to calculate the number of reserved bits in
> > > data structures such as root entries, context entries, and entries of
> > > DMA paging structures etc.
> > > 
> > > However values of IOVA address width and of the HAW may not equal. For
> > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > in an invalid IOVA being accepted.
> > > 
> > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > whose value is initialized based on the maximum physical address set to
> > > guest CPU.
> > 
> > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > to clarify.
> > > 
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > ---
> > [...]
> > 
> > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > >  static void vtd_init(IntelIOMMUState *s)
> > >  {
> > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > +    CPUState *cs = first_cpu;
> > > +    X86CPU *cpu = X86_CPU(cs);
> > >  
> > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > >      }
> > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > +    s->haw_bits = cpu->phys_bits;
> > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > and set phys_bits when iommu is created?
> 
> Thanks for your comments, Igor.
> 
> Well, I guess you prefer not to query the CPU capabilities while deciding
> the vIOMMU features. But to me, they are not that irrelevant.:)
> 
> Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> are referring to the same concept. In VM, both are the maximum guest physical
> address width. If we do not check the CPU field here, we will still have to
> check the CPU field in other places such as build_dmar_q35(), and reset the
> s->haw_bits again.
> 
> Is this explanation convincing enough? :)

So what happens if these don't match? I guess guest can configure the vtd
to put data into some memory which isn't then accessible to
the cpu, or cpu can use some memory not accessible to devices.

I guess some guests might be confused - is this what you
observe? If yes some comments that tell people which
guests get confused would be nice. Is windows happy? Is linux happy?



> > 
> > Perhaps Eduardo
> >  can suggest better approach, since he's more familiar with phys_bits topic
> 
> @Eduardo, any comments? Thanks!
> 
> > 
> > >      /*
> > >       * Rsvd field masks for spte
> > >       */
> > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > >  
> > >      if (x86_iommu->intr_supported) {
> > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > >      }
> > >  
> > >      /* Currently only address widths supported are 39 and 48 bits */
> > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > >          return false;
> > >      }
> > >  
> > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > index ed4e758..820451c 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -47,9 +47,9 @@
> > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > >  
> > >  #define DMAR_REG_SIZE               0x230
> > > -#define VTD_HOST_AW_39BIT           39
> > > -#define VTD_HOST_AW_48BIT           48
> > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > +#define VTD_AW_39BIT                39
> > > +#define VTD_AW_48BIT                48
> > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > >  
> > >  #define DMAR_REPORT_F_INTR          (1)
> > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > >  
> > >      /*
> > >       * Protects IOMMU states in general.  Currently it protects the
> > 
> > 
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-18 13:45           ` Yu Zhang
@ 2018-12-18 14:49             ` Michael S. Tsirkin
  2018-12-19  3:40               ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-18 14:49 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > 
> > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > 
> > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > 
> > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > 
> > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > index 0e88c63..871110c 100644
> > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > >  
> > > > > >  /*
> > > > > >   * Rsvd field masks for spte:
> > > > > > - *     Index [1] to [4] 4k pages
> > > > > > - *     Index [5] to [8] large pages
> > > > > > + *     Index [1] to [5] 4k pages
> > > > > > + *     Index [6] to [10] large pages
> > > > > >   */
> > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > >  
> > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > >  {
> > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > >          /* Maybe large page */
> > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > >      } else {
> > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > >      }
> > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > >      }
> > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > >  
> > > > > >      if (x86_iommu->intr_supported) {
> > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > >      return &vtd_as->as;
> > > > > >  }
> > > > > >  
> > > > > > +static bool host_has_la57(void)
> > > > > > +{
> > > > > > +    uint32_t ecx, unused;
> > > > > > +
> > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > +}
> > > > > > +
> > > > > > +static bool guest_has_la57(void)
> > > > > > +{
> > > > > > +    CPUState *cs = first_cpu;
> > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > +
> > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > +}
> > > > > another direct access to CPU fields,
> > > > > I'd suggest to set this value when iommu is created
> > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > 
> > > > 
> > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > 
> > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > CPUs. 
> > 
> > I don't necessarily see why these need to be connected.
> > If yes pls add code to explain.
> 
> Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> value as an IOVA.

Right but then that doesn't work on all hosts either.

> And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> But if the VM even does not have LA57, any specific reason we should give it a 5-level
> vIOMMU?

So the example you give is VTD address width < CPU aw. That is known
to be problematic for dpdk but not for other software and maybe dpdk
will learns how to cope. Given such hosts exist it might be
useful to support this at least for debugging.

Are there reasons to worry about VTD > CPU?


> > 
> > 
> > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > >  {
> > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > >          }
> > > > > >      }
> > > > > >  
> > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > +        return false;
> > > > > > +    }
> > > > > > +
> > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > Does iommu supposed to work in TCG mode?
> > > > > If yes then why it should care about host_has_la57()?
> > > > > 
> > > > 
> > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > for device assignment.
> > > > 
> > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > we can:
> > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > capability if it is TCG.
> > > 
> > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > for the remind. :)
> > 
> > 
> > This needs a big comment with an explanation though.
> > And probably a TODO to make it work under TCG ...
> > 
> 
> Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> though), and the condition would be sth. like:
> 
>     if ((s->aw_bits == VTD_AW_57BIT) &&
>         kvm_enabled() &&
>         !host_has_la57())  {
> 
> As you can see, though I remove the check of guest_has_la57(), I still kept the
> check against host when KVM is enabled. I'm still ready to be convinced for any
> requirement why we do not need the guest check. :) 


okay but then (repeating myself, sorry) pls add a comment that explains
what happens if you do not add this limitation.


> > > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > > vIOMMU does have usage in TCG?
> > > > 
> > > > But as to the check of guest capability, I still believe it is necessary. As
> > > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > > 
> > > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > > +                         "host and guest are capable of 5-level paging");
> > > > > >          return false;
> > > > > >      }
> > > > > >  
> > > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > > index d084099..2b29b6f 100644
> > > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > > @@ -114,8 +114,8 @@
> > > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > > >  
> > > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > > >  
> > > > > >  /* IOTLB_REG */
> > > > > > @@ -212,6 +212,8 @@
> > > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > >  
> > > > > >  /* IQT_REG */
> > > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > >  
> > > > > >  /* Information about page-selective IOTLB invalidate */
> > > > > >  struct VTDIOTLBPageInvInfo {
> > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > index 820451c..7474c4f 100644
> > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > @@ -49,6 +49,7 @@
> > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > >  #define VTD_AW_39BIT                39
> > > > > >  #define VTD_AW_48BIT                48
> > > > > > +#define VTD_AW_57BIT                57
> > > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > >  
> > > > > 
> > > > > 
> > > > 
> > > > B.R.
> > > > Yu
> > > > 
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-18  9:27     ` Yu Zhang
  2018-12-18 14:23       ` Michael S. Tsirkin
@ 2018-12-18 14:55       ` Igor Mammedov
  2018-12-18 14:58         ` Michael S. Tsirkin
  2018-12-19  2:57         ` Yu Zhang
  2018-12-20 20:58       ` Eduardo Habkost
  2 siblings, 2 replies; 57+ messages in thread
From: Igor Mammedov @ 2018-12-18 14:55 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, 18 Dec 2018 17:27:23 +0800
Yu Zhang <yu.c.zhang@linux.intel.com> wrote:

> On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > On Wed, 12 Dec 2018 21:05:38 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > 
> > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > the host address width(HAW) to calculate the number of reserved bits in
> > > data structures such as root entries, context entries, and entries of
> > > DMA paging structures etc.
> > > 
> > > However values of IOVA address width and of the HAW may not equal. For
> > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > in an invalid IOVA being accepted.
> > > 
> > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > whose value is initialized based on the maximum physical address set to
> > > guest CPU.
> > 
> > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > to clarify.
> > > 
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > ---
> > [...]
> > 
> > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > >  static void vtd_init(IntelIOMMUState *s)
> > >  {
> > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > +    CPUState *cs = first_cpu;
> > > +    X86CPU *cpu = X86_CPU(cs);
> > >  
> > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > >      }
> > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > +    s->haw_bits = cpu->phys_bits;
> > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > and set phys_bits when iommu is created?
> 
> Thanks for your comments, Igor.
> 
> Well, I guess you prefer not to query the CPU capabilities while deciding
> the vIOMMU features. But to me, they are not that irrelevant.:)
> 
> Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> are referring to the same concept. In VM, both are the maximum guest physical
> address width. If we do not check the CPU field here, we will still have to
> check the CPU field in other places such as build_dmar_q35(), and reset the
> s->haw_bits again.
> 
> Is this explanation convincing enough? :)
current build_dmar_q35() doesn't do it, it's all new code in this series that
contains not acceptable direct access from one device (iommu) to another (cpu).   
Proper way would be for the owner of iommu to fish limits from somewhere and set
values during iommu creation.

> > 
> > Perhaps Eduardo
> >  can suggest better approach, since he's more familiar with phys_bits topic
> 
> @Eduardo, any comments? Thanks!
> 
> > 
> > >      /*
> > >       * Rsvd field masks for spte
> > >       */
> > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > >  
> > >      if (x86_iommu->intr_supported) {
> > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > >      }
> > >  
> > >      /* Currently only address widths supported are 39 and 48 bits */
> > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > >          return false;
> > >      }
> > >  
> > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > index ed4e758..820451c 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -47,9 +47,9 @@
> > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > >  
> > >  #define DMAR_REG_SIZE               0x230
> > > -#define VTD_HOST_AW_39BIT           39
> > > -#define VTD_HOST_AW_48BIT           48
> > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > +#define VTD_AW_39BIT                39
> > > +#define VTD_AW_48BIT                48
> > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > >  
> > >  #define DMAR_REPORT_F_INTR          (1)
> > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > >  
> > >      /*
> > >       * Protects IOMMU states in general.  Currently it protects the
> > 
> > 
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-18 14:55       ` Igor Mammedov
@ 2018-12-18 14:58         ` Michael S. Tsirkin
  2018-12-19  3:03           ` Yu Zhang
  2018-12-19  2:57         ` Yu Zhang
  1 sibling, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-18 14:58 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Yu Zhang, Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Richard Henderson

On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> On Tue, 18 Dec 2018 17:27:23 +0800
> Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> 
> > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > 
> > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > data structures such as root entries, context entries, and entries of
> > > > DMA paging structures etc.
> > > > 
> > > > However values of IOVA address width and of the HAW may not equal. For
> > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > in an invalid IOVA being accepted.
> > > > 
> > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > whose value is initialized based on the maximum physical address set to
> > > > guest CPU.
> > > 
> > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > to clarify.
> > > > 
> > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > [...]
> > > 
> > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > >  static void vtd_init(IntelIOMMUState *s)
> > > >  {
> > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > +    CPUState *cs = first_cpu;
> > > > +    X86CPU *cpu = X86_CPU(cs);
> > > >  
> > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > >      }
> > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > +    s->haw_bits = cpu->phys_bits;
> > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > and set phys_bits when iommu is created?
> > 
> > Thanks for your comments, Igor.
> > 
> > Well, I guess you prefer not to query the CPU capabilities while deciding
> > the vIOMMU features. But to me, they are not that irrelevant.:)
> > 
> > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > are referring to the same concept. In VM, both are the maximum guest physical
> > address width. If we do not check the CPU field here, we will still have to
> > check the CPU field in other places such as build_dmar_q35(), and reset the
> > s->haw_bits again.
> > 
> > Is this explanation convincing enough? :)
> current build_dmar_q35() doesn't do it, it's all new code in this series that
> contains not acceptable direct access from one device (iommu) to another (cpu).   
> Proper way would be for the owner of iommu to fish limits from somewhere and set
> values during iommu creation.

Maybe it's a good idea to add documentation for now.

It would be nice not to push this stuff up the stack,
it's unfortunate that our internal APIs make it hard.


> > > 
> > > Perhaps Eduardo
> > >  can suggest better approach, since he's more familiar with phys_bits topic
> > 
> > @Eduardo, any comments? Thanks!
> > 
> > > 
> > > >      /*
> > > >       * Rsvd field masks for spte
> > > >       */
> > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > >  
> > > >      if (x86_iommu->intr_supported) {
> > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > >      }
> > > >  
> > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > >          return false;
> > > >      }
> > > >  
> > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > index ed4e758..820451c 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -47,9 +47,9 @@
> > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > >  
> > > >  #define DMAR_REG_SIZE               0x230
> > > > -#define VTD_HOST_AW_39BIT           39
> > > > -#define VTD_HOST_AW_48BIT           48
> > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > +#define VTD_AW_39BIT                39
> > > > +#define VTD_AW_48BIT                48
> > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > >  
> > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > >  
> > > >      /*
> > > >       * Protects IOMMU states in general.  Currently it protects the
> > > 
> > > 
> > 
> > B.R.
> > Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-18 14:55       ` Igor Mammedov
  2018-12-18 14:58         ` Michael S. Tsirkin
@ 2018-12-19  2:57         ` Yu Zhang
  2018-12-19 10:40           ` Igor Mammedov
  1 sibling, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-19  2:57 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> On Tue, 18 Dec 2018 17:27:23 +0800
> Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> 
> > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > 
> > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > data structures such as root entries, context entries, and entries of
> > > > DMA paging structures etc.
> > > > 
> > > > However values of IOVA address width and of the HAW may not equal. For
> > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > in an invalid IOVA being accepted.
> > > > 
> > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > whose value is initialized based on the maximum physical address set to
> > > > guest CPU.
> > > 
> > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > to clarify.
> > > > 
> > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > [...]
> > > 
> > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > >  static void vtd_init(IntelIOMMUState *s)
> > > >  {
> > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > +    CPUState *cs = first_cpu;
> > > > +    X86CPU *cpu = X86_CPU(cs);
> > > >  
> > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > >      }
> > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > +    s->haw_bits = cpu->phys_bits;
> > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > and set phys_bits when iommu is created?
> > 
> > Thanks for your comments, Igor.
> > 
> > Well, I guess you prefer not to query the CPU capabilities while deciding
> > the vIOMMU features. But to me, they are not that irrelevant.:)
> > 
> > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > are referring to the same concept. In VM, both are the maximum guest physical
> > address width. If we do not check the CPU field here, we will still have to
> > check the CPU field in other places such as build_dmar_q35(), and reset the
> > s->haw_bits again.
> > 
> > Is this explanation convincing enough? :)
> current build_dmar_q35() doesn't do it, it's all new code in this series that
> contains not acceptable direct access from one device (iommu) to another (cpu).   
> Proper way would be for the owner of iommu to fish limits from somewhere and set
> values during iommu creation.

Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
According to the spec, the host address width is the maximum physical address width,
yet current implementation is using the DMA address width. For me, this is not only
wrong, but also unsecure. For this point, I think we all agree this need to be fixed.

As to how to fix it - should we query the cpu fields, I still do not understand why
this is not acceptable. :)

I had thought of other approaches before, yet I did not choose:

1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its
physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
or not?

2> Another choice I had thought of is, to query the physical iommu. I abandoned this
idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.

So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)

> 
> > > 
> > > Perhaps Eduardo
> > >  can suggest better approach, since he's more familiar with phys_bits topic
> > 
> > @Eduardo, any comments? Thanks!
> > 
> > > 
> > > >      /*
> > > >       * Rsvd field masks for spte
> > > >       */
> > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > >  
> > > >      if (x86_iommu->intr_supported) {
> > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > >      }
> > > >  
> > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > >          return false;
> > > >      }
> > > >  
> > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > index ed4e758..820451c 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -47,9 +47,9 @@
> > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > >  
> > > >  #define DMAR_REG_SIZE               0x230
> > > > -#define VTD_HOST_AW_39BIT           39
> > > > -#define VTD_HOST_AW_48BIT           48
> > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > +#define VTD_AW_39BIT                39
> > > > +#define VTD_AW_48BIT                48
> > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > >  
> > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > >  
> > > >      /*
> > > >       * Protects IOMMU states in general.  Currently it protects the
> > > 
> > > 
> > 
> > B.R.
> > Yu
> 
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-18 14:58         ` Michael S. Tsirkin
@ 2018-12-19  3:03           ` Yu Zhang
  2018-12-19  3:12             ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-19  3:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 09:58:35AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > On Tue, 18 Dec 2018 17:27:23 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > 
> > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > 
> > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > data structures such as root entries, context entries, and entries of
> > > > > DMA paging structures etc.
> > > > > 
> > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > in an invalid IOVA being accepted.
> > > > > 
> > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > whose value is initialized based on the maximum physical address set to
> > > > > guest CPU.
> > > > 
> > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > to clarify.
> > > > > 
> > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > [...]
> > > > 
> > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > >  {
> > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > +    CPUState *cs = first_cpu;
> > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > >  
> > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > >      }
> > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > +    s->haw_bits = cpu->phys_bits;
> > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > and set phys_bits when iommu is created?
> > > 
> > > Thanks for your comments, Igor.
> > > 
> > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > 
> > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > are referring to the same concept. In VM, both are the maximum guest physical
> > > address width. If we do not check the CPU field here, we will still have to
> > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > s->haw_bits again.
> > > 
> > > Is this explanation convincing enough? :)
> > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > values during iommu creation.
> 
> Maybe it's a good idea to add documentation for now.

Thanks Michael. So what kind of documentation do you refer? 

> 
> It would be nice not to push this stuff up the stack,
> it's unfortunate that our internal APIs make it hard.

Sorry, I do not quite get it. What do you mean "internal APIs make it hard"? :)

> 
> 
> > > > 
> > > > Perhaps Eduardo
> > > >  can suggest better approach, since he's more familiar with phys_bits topic
> > > 
> > > @Eduardo, any comments? Thanks!
> > > 
> > > > 
> > > > >      /*
> > > > >       * Rsvd field masks for spte
> > > > >       */
> > > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > >  
> > > > >      if (x86_iommu->intr_supported) {
> > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > >      }
> > > > >  
> > > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > >          return false;
> > > > >      }
> > > > >  
> > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > index ed4e758..820451c 100644
> > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > @@ -47,9 +47,9 @@
> > > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > > >  
> > > > >  #define DMAR_REG_SIZE               0x230
> > > > > -#define VTD_HOST_AW_39BIT           39
> > > > > -#define VTD_HOST_AW_48BIT           48
> > > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > > +#define VTD_AW_39BIT                39
> > > > > +#define VTD_AW_48BIT                48
> > > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > >  
> > > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > > >  
> > > > >      /*
> > > > >       * Protects IOMMU states in general.  Currently it protects the
> > > > 
> > > > 
> > > 
> > > B.R.
> > > Yu
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-19  3:03           ` Yu Zhang
@ 2018-12-19  3:12             ` Michael S. Tsirkin
  2018-12-19  6:28               ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-19  3:12 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Wed, Dec 19, 2018 at 11:03:58AM +0800, Yu Zhang wrote:
> On Tue, Dec 18, 2018 at 09:58:35AM -0500, Michael S. Tsirkin wrote:
> > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > 
> > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > 
> > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > data structures such as root entries, context entries, and entries of
> > > > > > DMA paging structures etc.
> > > > > > 
> > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > in an invalid IOVA being accepted.
> > > > > > 
> > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > guest CPU.
> > > > > 
> > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > to clarify.
> > > > > > 
> > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > [...]
> > > > > 
> > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > >  {
> > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > +    CPUState *cs = first_cpu;
> > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > >  
> > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > >      }
> > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > +    s->haw_bits = cpu->phys_bits;
> > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > and set phys_bits when iommu is created?
> > > > 
> > > > Thanks for your comments, Igor.
> > > > 
> > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > 
> > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > address width. If we do not check the CPU field here, we will still have to
> > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > s->haw_bits again.
> > > > 
> > > > Is this explanation convincing enough? :)
> > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > values during iommu creation.
> > 
> > Maybe it's a good idea to add documentation for now.
> 
> Thanks Michael. So what kind of documentation do you refer? 

The idea would be to have two properties, AW for the CPU and
the IOMMU. In the documentation explain that they
should normally be set to the same value.

> > 
> > It would be nice not to push this stuff up the stack,
> > it's unfortunate that our internal APIs make it hard.
> 
> Sorry, I do not quite get it. What do you mean "internal APIs make it hard"? :)

The API doesn't actually guarantee any initialization order.
CPU happens to be initialized first but I do not
think there's a guarantee that it will keep being the case.
This makes it hard to get properties from one device
and use in another one.

> > 
> > 
> > > > > 
> > > > > Perhaps Eduardo
> > > > >  can suggest better approach, since he's more familiar with phys_bits topic
> > > > 
> > > > @Eduardo, any comments? Thanks!
> > > > 
> > > > > 
> > > > > >      /*
> > > > > >       * Rsvd field masks for spte
> > > > > >       */
> > > > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > >  
> > > > > >      if (x86_iommu->intr_supported) {
> > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > >      }
> > > > > >  
> > > > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > >          return false;
> > > > > >      }
> > > > > >  
> > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > index ed4e758..820451c 100644
> > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > @@ -47,9 +47,9 @@
> > > > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > > > >  
> > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > -#define VTD_HOST_AW_39BIT           39
> > > > > > -#define VTD_HOST_AW_48BIT           48
> > > > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > > > +#define VTD_AW_39BIT                39
> > > > > > +#define VTD_AW_48BIT                48
> > > > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > >  
> > > > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > > > >  
> > > > > >      /*
> > > > > >       * Protects IOMMU states in general.  Currently it protects the
> > > > > 
> > > > > 
> > > > 
> > > > B.R.
> > > > Yu
> > 
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-18 14:49             ` Michael S. Tsirkin
@ 2018-12-19  3:40               ` Yu Zhang
  2018-12-19  4:35                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-19  3:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Tue, Dec 18, 2018 at 09:49:02AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> > On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > 
> > > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > > 
> > > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > > 
> > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > ---
> > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > > ---
> > > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > index 0e88c63..871110c 100644
> > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > > >  
> > > > > > >  /*
> > > > > > >   * Rsvd field masks for spte:
> > > > > > > - *     Index [1] to [4] 4k pages
> > > > > > > - *     Index [5] to [8] large pages
> > > > > > > + *     Index [1] to [5] 4k pages
> > > > > > > + *     Index [6] to [10] large pages
> > > > > > >   */
> > > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > > >  
> > > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > >  {
> > > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > > >          /* Maybe large page */
> > > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > > >      } else {
> > > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > > >      }
> > > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > > >      }
> > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > >  
> > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > > >      return &vtd_as->as;
> > > > > > >  }
> > > > > > >  
> > > > > > > +static bool host_has_la57(void)
> > > > > > > +{
> > > > > > > +    uint32_t ecx, unused;
> > > > > > > +
> > > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static bool guest_has_la57(void)
> > > > > > > +{
> > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > > +
> > > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > > +}
> > > > > > another direct access to CPU fields,
> > > > > > I'd suggest to set this value when iommu is created
> > > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > > 
> > > > > 
> > > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > > 
> > > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > > CPUs. 
> > > 
> > > I don't necessarily see why these need to be connected.
> > > If yes pls add code to explain.
> > 
> > Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> > have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> > vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> > value as an IOVA.
> 
> Right but then that doesn't work on all hosts either.

Oh, the host already has 5-level IOMMU now. So I think DPDK in native shall work with that.

> 
> > And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> > But if the VM even does not have LA57, any specific reason we should give it a 5-level
> > vIOMMU?
> 
> So the example you give is VTD address width < CPU aw. That is known
> to be problematic for dpdk but not for other software and maybe dpdk
> will learns how to cope. Given such hosts exist it might be
> useful to support this at least for debugging.
> 
> Are there reasons to worry about VTD > CPU?

Well, I am not that worried(no usage case is one concern). I am OK to drop the guest check. :)

> 
> 
> > > 
> > > 
> > > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > >  {
> > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > >          }
> > > > > > >      }
> > > > > > >  
> > > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > > +        return false;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > > Does iommu supposed to work in TCG mode?
> > > > > > If yes then why it should care about host_has_la57()?
> > > > > > 
> > > > > 
> > > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > > for device assignment.
> > > > > 
> > > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > > we can:
> > > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > > capability if it is TCG.
> > > > 
> > > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > > for the remind. :)
> > > 
> > > 
> > > This needs a big comment with an explanation though.
> > > And probably a TODO to make it work under TCG ...
> > > 
> > 
> > Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> > though), and the condition would be sth. like:
> > 
> >     if ((s->aw_bits == VTD_AW_57BIT) &&
> >         kvm_enabled() &&
> >         !host_has_la57())  {
> > 
> > As you can see, though I remove the check of guest_has_la57(), I still kept the
> > check against host when KVM is enabled. I'm still ready to be convinced for any
> > requirement why we do not need the guest check. :) 
> 
> 
> okay but then (repeating myself, sorry) pls add a comment that explains
> what happens if you do not add this limitation.

How about below comments?
    /*
     * For KVM guests, the host capability of LA57 shall be available, so
     * that iommu shadowing works for device assignment scenario. But for
     * TCG mode, we do not need such restriction.
     */

BTW, I just tested the TCG mode, it works(with restriction of host capability removed).

> 
> 
> > > > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > > > vIOMMU does have usage in TCG?
> > > > > 
> > > > > But as to the check of guest capability, I still believe it is necessary. As
> > > > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > > > 
> > > > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > > > +                         "host and guest are capable of 5-level paging");
> > > > > > >          return false;
> > > > > > >      }
> > > > > > >  
> > > > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > > > index d084099..2b29b6f 100644
> > > > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > > > @@ -114,8 +114,8 @@
> > > > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > > > >  
> > > > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > > > >  
> > > > > > >  /* IOTLB_REG */
> > > > > > > @@ -212,6 +212,8 @@
> > > > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > >  
> > > > > > >  /* IQT_REG */
> > > > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > >  
> > > > > > >  /* Information about page-selective IOTLB invalidate */
> > > > > > >  struct VTDIOTLBPageInvInfo {
> > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > index 820451c..7474c4f 100644
> > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > @@ -49,6 +49,7 @@
> > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > >  #define VTD_AW_39BIT                39
> > > > > > >  #define VTD_AW_48BIT                48
> > > > > > > +#define VTD_AW_57BIT                57
> > > > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > >  
> > > > > > 
> > > > > > 
> > > > > 
> > > > > B.R.
> > > > > Yu
> > > > > 
> > 
> > B.R.
> > Yu
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-19  3:40               ` Yu Zhang
@ 2018-12-19  4:35                 ` Michael S. Tsirkin
  2018-12-19  5:57                   ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-19  4:35 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Wed, Dec 19, 2018 at 11:40:06AM +0800, Yu Zhang wrote:
> On Tue, Dec 18, 2018 at 09:49:02AM -0500, Michael S. Tsirkin wrote:
> > On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> > > On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > 
> > > > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > > > 
> > > > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > > > 
> > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > ---
> > > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > > > ---
> > > > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > > index 0e88c63..871110c 100644
> > > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > > > >  
> > > > > > > >  /*
> > > > > > > >   * Rsvd field masks for spte:
> > > > > > > > - *     Index [1] to [4] 4k pages
> > > > > > > > - *     Index [5] to [8] large pages
> > > > > > > > + *     Index [1] to [5] 4k pages
> > > > > > > > + *     Index [6] to [10] large pages
> > > > > > > >   */
> > > > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > > > >  
> > > > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > > >  {
> > > > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > > > >          /* Maybe large page */
> > > > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > > > >      } else {
> > > > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > > > >      }
> > > > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > > > >      }
> > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > >  
> > > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > > > >      return &vtd_as->as;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +static bool host_has_la57(void)
> > > > > > > > +{
> > > > > > > > +    uint32_t ecx, unused;
> > > > > > > > +
> > > > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static bool guest_has_la57(void)
> > > > > > > > +{
> > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > > > +
> > > > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > > > +}
> > > > > > > another direct access to CPU fields,
> > > > > > > I'd suggest to set this value when iommu is created
> > > > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > > > 
> > > > > > 
> > > > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > > > 
> > > > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > > > CPUs. 
> > > > 
> > > > I don't necessarily see why these need to be connected.
> > > > If yes pls add code to explain.
> > > 
> > > Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> > > have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> > > vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> > > value as an IOVA.
> > 
> > Right but then that doesn't work on all hosts either.
> 
> Oh, the host already has 5-level IOMMU now. So I think DPDK in native shall work with that.
> 
> > 
> > > And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> > > But if the VM even does not have LA57, any specific reason we should give it a 5-level
> > > vIOMMU?
> > 
> > So the example you give is VTD address width < CPU aw. That is known
> > to be problematic for dpdk but not for other software and maybe dpdk
> > will learns how to cope. Given such hosts exist it might be
> > useful to support this at least for debugging.
> > 
> > Are there reasons to worry about VTD > CPU?
> 
> Well, I am not that worried(no usage case is one concern). I am OK to drop the guest check. :)
> 
> > 
> > 
> > > > 
> > > > 
> > > > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > >  {
> > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > >          }
> > > > > > > >      }
> > > > > > > >  
> > > > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > > > +        return false;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > > > Does iommu supposed to work in TCG mode?
> > > > > > > If yes then why it should care about host_has_la57()?
> > > > > > > 
> > > > > > 
> > > > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > > > for device assignment.
> > > > > > 
> > > > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > > > we can:
> > > > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > > > capability if it is TCG.
> > > > > 
> > > > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > > > for the remind. :)
> > > > 
> > > > 
> > > > This needs a big comment with an explanation though.
> > > > And probably a TODO to make it work under TCG ...
> > > > 
> > > 
> > > Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> > > though), and the condition would be sth. like:
> > > 
> > >     if ((s->aw_bits == VTD_AW_57BIT) &&
> > >         kvm_enabled() &&
> > >         !host_has_la57())  {
> > > 
> > > As you can see, though I remove the check of guest_has_la57(), I still kept the
> > > check against host when KVM is enabled. I'm still ready to be convinced for any
> > > requirement why we do not need the guest check. :) 
> > 
> > 
> > okay but then (repeating myself, sorry) pls add a comment that explains
> > what happens if you do not add this limitation.
> 
> How about below comments?
>     /*
>      * For KVM guests, the host capability of LA57 shall be available,

So why is host CPU LA57 necessary for shadowing? Could you explain pls?

> so
>      * that iommu shadowing works for device assignment scenario. But for
>      * TCG mode, we do not need such restriction.
>      */
> 
> BTW, I just tested the TCG mode, it works(with restriction of host capability removed).
> 
> > 
> > 
> > > > > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > > > > vIOMMU does have usage in TCG?
> > > > > > 
> > > > > > But as to the check of guest capability, I still believe it is necessary. As
> > > > > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > > > > 
> > > > > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > > > > +                         "host and guest are capable of 5-level paging");
> > > > > > > >          return false;
> > > > > > > >      }
> > > > > > > >  
> > > > > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > > > > index d084099..2b29b6f 100644
> > > > > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > > > > @@ -114,8 +114,8 @@
> > > > > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > > > > >  
> > > > > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > > > > >  
> > > > > > > >  /* IOTLB_REG */
> > > > > > > > @@ -212,6 +212,8 @@
> > > > > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > >  
> > > > > > > >  /* IQT_REG */
> > > > > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > >  
> > > > > > > >  /* Information about page-selective IOTLB invalidate */
> > > > > > > >  struct VTDIOTLBPageInvInfo {
> > > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > > index 820451c..7474c4f 100644
> > > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > > @@ -49,6 +49,7 @@
> > > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > >  #define VTD_AW_39BIT                39
> > > > > > > >  #define VTD_AW_48BIT                48
> > > > > > > > +#define VTD_AW_57BIT                57
> > > > > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > > >  
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > B.R.
> > > > > > Yu
> > > > > > 
> > > 
> > > B.R.
> > > Yu
> > 
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-19  4:35                 ` Michael S. Tsirkin
@ 2018-12-19  5:57                   ` Yu Zhang
  2018-12-19 15:23                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-19  5:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Tue, Dec 18, 2018 at 11:35:34PM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 19, 2018 at 11:40:06AM +0800, Yu Zhang wrote:
> > On Tue, Dec 18, 2018 at 09:49:02AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> > > > On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > > > > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > > > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > > 
> > > > > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > > > > 
> > > > > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > > ---
> > > > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > > > index 0e88c63..871110c 100644
> > > > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > > > > >  
> > > > > > > > >  /*
> > > > > > > > >   * Rsvd field masks for spte:
> > > > > > > > > - *     Index [1] to [4] 4k pages
> > > > > > > > > - *     Index [5] to [8] large pages
> > > > > > > > > + *     Index [1] to [5] 4k pages
> > > > > > > > > + *     Index [6] to [10] large pages
> > > > > > > > >   */
> > > > > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > > > > >  
> > > > > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > > > >  {
> > > > > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > > > > >          /* Maybe large page */
> > > > > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > > > > >      } else {
> > > > > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > > > > >      }
> > > > > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > > > > >      }
> > > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > >  
> > > > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > > > > >      return &vtd_as->as;
> > > > > > > > >  }
> > > > > > > > >  
> > > > > > > > > +static bool host_has_la57(void)
> > > > > > > > > +{
> > > > > > > > > +    uint32_t ecx, unused;
> > > > > > > > > +
> > > > > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static bool guest_has_la57(void)
> > > > > > > > > +{
> > > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > > > > +
> > > > > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > > > > +}
> > > > > > > > another direct access to CPU fields,
> > > > > > > > I'd suggest to set this value when iommu is created
> > > > > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > > > > 
> > > > > > > 
> > > > > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > > > > 
> > > > > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > > > > CPUs. 
> > > > > 
> > > > > I don't necessarily see why these need to be connected.
> > > > > If yes pls add code to explain.
> > > > 
> > > > Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> > > > have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> > > > vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> > > > value as an IOVA.
> > > 
> > > Right but then that doesn't work on all hosts either.
> > 
> > Oh, the host already has 5-level IOMMU now. So I think DPDK in native shall work with that.
> > 
> > > 
> > > > And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> > > > But if the VM even does not have LA57, any specific reason we should give it a 5-level
> > > > vIOMMU?
> > > 
> > > So the example you give is VTD address width < CPU aw. That is known
> > > to be problematic for dpdk but not for other software and maybe dpdk
> > > will learns how to cope. Given such hosts exist it might be
> > > useful to support this at least for debugging.
> > > 
> > > Are there reasons to worry about VTD > CPU?
> > 
> > Well, I am not that worried(no usage case is one concern). I am OK to drop the guest check. :)
> > 
> > > 
> > > 
> > > > > 
> > > > > 
> > > > > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > >  {
> > > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > >          }
> > > > > > > > >      }
> > > > > > > > >  
> > > > > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > > > > +        return false;
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > > > > Does iommu supposed to work in TCG mode?
> > > > > > > > If yes then why it should care about host_has_la57()?
> > > > > > > > 
> > > > > > > 
> > > > > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > > > > for device assignment.
> > > > > > > 
> > > > > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > > > > we can:
> > > > > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > > > > capability if it is TCG.
> > > > > > 
> > > > > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > > > > for the remind. :)
> > > > > 
> > > > > 
> > > > > This needs a big comment with an explanation though.
> > > > > And probably a TODO to make it work under TCG ...
> > > > > 
> > > > 
> > > > Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> > > > though), and the condition would be sth. like:
> > > > 
> > > >     if ((s->aw_bits == VTD_AW_57BIT) &&
> > > >         kvm_enabled() &&
> > > >         !host_has_la57())  {
> > > > 
> > > > As you can see, though I remove the check of guest_has_la57(), I still kept the
> > > > check against host when KVM is enabled. I'm still ready to be convinced for any
> > > > requirement why we do not need the guest check. :) 
> > > 
> > > 
> > > okay but then (repeating myself, sorry) pls add a comment that explains
> > > what happens if you do not add this limitation.
> > 
> > How about below comments?
> >     /*
> >      * For KVM guests, the host capability of LA57 shall be available,
> 
> So why is host CPU LA57 necessary for shadowing? Could you explain pls?

Oh, let me try to explain the background here. :)

Currently, vIOMMU in qemu does not have logic to check against the hardware
IOMMU capability. E.g. when we create an IOMMU with 48 bit DMA address width,
qemu does not check if any physical IOMMU has such support. And the shadow
IOMMU logic will have problem if host IOMMU only supports 39 bit IOVA. And
we will have the same problem when it comes to 57 bit IOVA.

My previous discussion with Peter Xu reached an agreement that for now, we
just use the host cpu capability as a reference when trying to create a 5-level
vIOMMU, because 57 bit IOMMU hardware will not come until ICX platform(which
includes LA57). 

And the final correct solution should be to enumerate the capabilities of
hardware IOMMUs used by the assigned device, and reject if any dismatch is
found.

Maybe I should add a TODO in above comments, give the background explaination.

> 
> > so
> >      * that iommu shadowing works for device assignment scenario. But for
> >      * TCG mode, we do not need such restriction.
> >      */
> > 
> > BTW, I just tested the TCG mode, it works(with restriction of host capability removed).
> > 
> > > 
> > > 
> > > > > > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > > > > > vIOMMU does have usage in TCG?
> > > > > > > 
> > > > > > > But as to the check of guest capability, I still believe it is necessary. As
> > > > > > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > > > > > 
> > > > > > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > > > > > +                         "host and guest are capable of 5-level paging");
> > > > > > > > >          return false;
> > > > > > > > >      }
> > > > > > > > >  
> > > > > > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > > > > > index d084099..2b29b6f 100644
> > > > > > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > > > > > @@ -114,8 +114,8 @@
> > > > > > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > > > > > >  
> > > > > > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > > > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > > > > > >  
> > > > > > > > >  /* IOTLB_REG */
> > > > > > > > > @@ -212,6 +212,8 @@
> > > > > > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > > > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > >  
> > > > > > > > >  /* IQT_REG */
> > > > > > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > >  
> > > > > > > > >  /* Information about page-selective IOTLB invalidate */
> > > > > > > > >  struct VTDIOTLBPageInvInfo {
> > > > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > > > index 820451c..7474c4f 100644
> > > > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > > > @@ -49,6 +49,7 @@
> > > > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > > >  #define VTD_AW_39BIT                39
> > > > > > > > >  #define VTD_AW_48BIT                48
> > > > > > > > > +#define VTD_AW_57BIT                57
> > > > > > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > > > >  
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > B.R.
> > > > > > > Yu
> > > > > > > 
> > > > 
> > > > B.R.
> > > > Yu
> > > 
> > 
> > B.R.
> > Yu

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-19  3:12             ` Michael S. Tsirkin
@ 2018-12-19  6:28               ` Yu Zhang
  2018-12-19 15:30                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-19  6:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 10:12:45PM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 19, 2018 at 11:03:58AM +0800, Yu Zhang wrote:
> > On Tue, Dec 18, 2018 at 09:58:35AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > 
> > > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > 
> > > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > > data structures such as root entries, context entries, and entries of
> > > > > > > DMA paging structures etc.
> > > > > > > 
> > > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > > in an invalid IOVA being accepted.
> > > > > > > 
> > > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > > guest CPU.
> > > > > > 
> > > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > > to clarify.
> > > > > > > 
> > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > ---
> > > > > > [...]
> > > > > > 
> > > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > > >  {
> > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > >  
> > > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > >      }
> > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > +    s->haw_bits = cpu->phys_bits;
> > > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > > and set phys_bits when iommu is created?
> > > > > 
> > > > > Thanks for your comments, Igor.
> > > > > 
> > > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > > 
> > > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > > address width. If we do not check the CPU field here, we will still have to
> > > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > > s->haw_bits again.
> > > > > 
> > > > > Is this explanation convincing enough? :)
> > > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > > values during iommu creation.
> > > 
> > > Maybe it's a good idea to add documentation for now.
> > 
> > Thanks Michael. So what kind of documentation do you refer? 
> 
> The idea would be to have two properties, AW for the CPU and
> the IOMMU. In the documentation explain that they
> should normally be set to the same value.
> 
> > > 
> > > It would be nice not to push this stuff up the stack,
> > > it's unfortunate that our internal APIs make it hard.
> > 
> > Sorry, I do not quite get it. What do you mean "internal APIs make it hard"? :)
> 
> The API doesn't actually guarantee any initialization order.
> CPU happens to be initialized first but I do not
> think there's a guarantee that it will keep being the case.
> This makes it hard to get properties from one device
> and use in another one.
> 

Oops...
Then there can be no easy way in the runtime to gurantee this. BTW, could we
initialize CPU before other components? Is it hard to do, or not reasonable
to do so?

I have plan to draft a doc in qemu on 5-level paging topic(maybe after all the
enabling is done). But I don't this this is the proper place to put - as you
can see, this fix is not relevant to 5-level paging. So any suggestion about
the documentation?

> > > 
> > > 
> > > > > > 
> > > > > > Perhaps Eduardo
> > > > > >  can suggest better approach, since he's more familiar with phys_bits topic
> > > > > 
> > > > > @Eduardo, any comments? Thanks!
> > > > > 
> > > > > > 
> > > > > > >      /*
> > > > > > >       * Rsvd field masks for spte
> > > > > > >       */
> > > > > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > >  
> > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > >      }
> > > > > > >  
> > > > > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > >          return false;
> > > > > > >      }
> > > > > > >  
> > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > index ed4e758..820451c 100644
> > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > @@ -47,9 +47,9 @@
> > > > > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > > > > >  
> > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > -#define VTD_HOST_AW_39BIT           39
> > > > > > > -#define VTD_HOST_AW_48BIT           48
> > > > > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > > > > +#define VTD_AW_39BIT                39
> > > > > > > +#define VTD_AW_48BIT                48
> > > > > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > >  
> > > > > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > > > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > > > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > > > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > > > > >  
> > > > > > >      /*
> > > > > > >       * Protects IOMMU states in general.  Currently it protects the
> > > > > > 
> > > > > > 
> > > > > 
> > > > > B.R.
> > > > > Yu
> > > 
> > 
> > B.R.
> > Yu

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-19  2:57         ` Yu Zhang
@ 2018-12-19 10:40           ` Igor Mammedov
  2018-12-19 16:47             ` Michael S. Tsirkin
  2018-12-20 21:18             ` Eduardo Habkost
  0 siblings, 2 replies; 57+ messages in thread
From: Igor Mammedov @ 2018-12-19 10:40 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Wed, 19 Dec 2018 10:57:17 +0800
Yu Zhang <yu.c.zhang@linux.intel.com> wrote:

> On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > On Tue, 18 Dec 2018 17:27:23 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >   
> > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:  
> > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >   
> > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > data structures such as root entries, context entries, and entries of
> > > > > DMA paging structures etc.
> > > > > 
> > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > in an invalid IOVA being accepted.
> > > > > 
> > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > whose value is initialized based on the maximum physical address set to
> > > > > guest CPU.  
> > > >   
> > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > to clarify.
> > > > > 
> > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > ---  
> > > > [...]
> > > >   
> > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > >  {
> > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > +    CPUState *cs = first_cpu;
> > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > >  
> > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > >      }
> > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > +    s->haw_bits = cpu->phys_bits;  
> > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > and set phys_bits when iommu is created?  
> > > 
> > > Thanks for your comments, Igor.
> > > 
> > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > 
> > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > are referring to the same concept. In VM, both are the maximum guest physical
> > > address width. If we do not check the CPU field here, we will still have to
> > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > s->haw_bits again.
> > > 
> > > Is this explanation convincing enough? :)  
> > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > values during iommu creation.  
> 
> Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> According to the spec, the host address width is the maximum physical address width,
> yet current implementation is using the DMA address width. For me, this is not only
> wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> 
> As to how to fix it - should we query the cpu fields, I still do not understand why
> this is not acceptable. :)
> 
> I had thought of other approaches before, yet I did not choose:
> 
> 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its  
> physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> or not?
> 
> 2> Another choice I had thought of is, to query the physical iommu. I abandoned this  
> idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.

> So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)
Because accessing private fields of device from another random device is not robust
and a subject to breaking in unpredictable manner when field meaning or initialization
order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
access some piece of data from random device).

I've looked at intel-iommu code and how it's created so here is a way to do the thing
you need using proper interfaces:

1. add x-haw_bits property
2. include in your series patch
    '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
   machine and let _pre_plug handler to check and set x-haw_bits for machine level
4. you probably can use phys-bits/host-phys-bits properties to get data that you need
   also see how ms->possible_cpus, that's how you can get access to CPU from machine
   layer.

> >   
> > > > 
> > > > Perhaps Eduardo
> > > >  can suggest better approach, since he's more familiar with phys_bits topic  
> > > 
> > > @Eduardo, any comments? Thanks!
> > >   
> > > >   
> > > > >      /*
> > > > >       * Rsvd field masks for spte
> > > > >       */
> > > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > >  
> > > > >      if (x86_iommu->intr_supported) {
> > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > >      }
> > > > >  
> > > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > >          return false;
> > > > >      }
> > > > >  
> > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > index ed4e758..820451c 100644
> > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > @@ -47,9 +47,9 @@
> > > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > > >  
> > > > >  #define DMAR_REG_SIZE               0x230
> > > > > -#define VTD_HOST_AW_39BIT           39
> > > > > -#define VTD_HOST_AW_48BIT           48
> > > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > > +#define VTD_AW_39BIT                39
> > > > > +#define VTD_AW_48BIT                48
> > > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > >  
> > > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > > >  
> > > > >      /*
> > > > >       * Protects IOMMU states in general.  Currently it protects the  
> > > > 
> > > >   
> > > 
> > > B.R.
> > > Yu  
> > 
> >   
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-19  5:57                   ` Yu Zhang
@ 2018-12-19 15:23                     ` Michael S. Tsirkin
  2018-12-20  5:49                       ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-19 15:23 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Wed, Dec 19, 2018 at 01:57:43PM +0800, Yu Zhang wrote:
> On Tue, Dec 18, 2018 at 11:35:34PM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 19, 2018 at 11:40:06AM +0800, Yu Zhang wrote:
> > > On Tue, Dec 18, 2018 at 09:49:02AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> > > > > On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > > > > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > > > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > > > 
> > > > > > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > > > > > 
> > > > > > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > > > > index 0e88c63..871110c 100644
> > > > > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > > > > > >  
> > > > > > > > > >  /*
> > > > > > > > > >   * Rsvd field masks for spte:
> > > > > > > > > > - *     Index [1] to [4] 4k pages
> > > > > > > > > > - *     Index [5] to [8] large pages
> > > > > > > > > > + *     Index [1] to [5] 4k pages
> > > > > > > > > > + *     Index [6] to [10] large pages
> > > > > > > > > >   */
> > > > > > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > > > > > >  
> > > > > > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > > > > >  {
> > > > > > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > > > > > >          /* Maybe large page */
> > > > > > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > > > > > >      } else {
> > > > > > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > > > > > >      }
> > > > > > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > > > > > >      }
> > > > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > >  
> > > > > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > > > > > >      return &vtd_as->as;
> > > > > > > > > >  }
> > > > > > > > > >  
> > > > > > > > > > +static bool host_has_la57(void)
> > > > > > > > > > +{
> > > > > > > > > > +    uint32_t ecx, unused;
> > > > > > > > > > +
> > > > > > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static bool guest_has_la57(void)
> > > > > > > > > > +{
> > > > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > > > > > +
> > > > > > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > > > > > +}
> > > > > > > > > another direct access to CPU fields,
> > > > > > > > > I'd suggest to set this value when iommu is created
> > > > > > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > > > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > > > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > > > > > 
> > > > > > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > > > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > > > > > CPUs. 
> > > > > > 
> > > > > > I don't necessarily see why these need to be connected.
> > > > > > If yes pls add code to explain.
> > > > > 
> > > > > Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> > > > > have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> > > > > vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> > > > > value as an IOVA.
> > > > 
> > > > Right but then that doesn't work on all hosts either.
> > > 
> > > Oh, the host already has 5-level IOMMU now. So I think DPDK in native shall work with that.
> > > 
> > > > 
> > > > > And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> > > > > But if the VM even does not have LA57, any specific reason we should give it a 5-level
> > > > > vIOMMU?
> > > > 
> > > > So the example you give is VTD address width < CPU aw. That is known
> > > > to be problematic for dpdk but not for other software and maybe dpdk
> > > > will learns how to cope. Given such hosts exist it might be
> > > > useful to support this at least for debugging.
> > > > 
> > > > Are there reasons to worry about VTD > CPU?
> > > 
> > > Well, I am not that worried(no usage case is one concern). I am OK to drop the guest check. :)
> > > 
> > > > 
> > > > 
> > > > > > 
> > > > > > 
> > > > > > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > >  {
> > > > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > >          }
> > > > > > > > > >      }
> > > > > > > > > >  
> > > > > > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > > > > > +        return false;
> > > > > > > > > > +    }
> > > > > > > > > > +
> > > > > > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > > > > > Does iommu supposed to work in TCG mode?
> > > > > > > > > If yes then why it should care about host_has_la57()?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > > > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > > > > > for device assignment.
> > > > > > > > 
> > > > > > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > > > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > > > > > we can:
> > > > > > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > > > > > capability if it is TCG.
> > > > > > > 
> > > > > > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > > > > > for the remind. :)
> > > > > > 
> > > > > > 
> > > > > > This needs a big comment with an explanation though.
> > > > > > And probably a TODO to make it work under TCG ...
> > > > > > 
> > > > > 
> > > > > Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> > > > > though), and the condition would be sth. like:
> > > > > 
> > > > >     if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > >         kvm_enabled() &&
> > > > >         !host_has_la57())  {
> > > > > 
> > > > > As you can see, though I remove the check of guest_has_la57(), I still kept the
> > > > > check against host when KVM is enabled. I'm still ready to be convinced for any
> > > > > requirement why we do not need the guest check. :) 
> > > > 
> > > > 
> > > > okay but then (repeating myself, sorry) pls add a comment that explains
> > > > what happens if you do not add this limitation.
> > > 
> > > How about below comments?
> > >     /*
> > >      * For KVM guests, the host capability of LA57 shall be available,
> > 
> > So why is host CPU LA57 necessary for shadowing? Could you explain pls?
> 
> Oh, let me try to explain the background here. :)
> 
> Currently, vIOMMU in qemu does not have logic to check against the hardware
> IOMMU capability. E.g. when we create an IOMMU with 48 bit DMA address width,
> qemu does not check if any physical IOMMU has such support. And the shadow
> IOMMU logic will have problem if host IOMMU only supports 39 bit IOVA. And
> we will have the same problem when it comes to 57 bit IOVA.
> 
> My previous discussion with Peter Xu reached an agreement that for now, we
> just use the host cpu capability as a reference when trying to create a 5-level
> vIOMMU, because 57 bit IOMMU hardware will not come until ICX platform(which
> includes LA57). 
> 
> And the final correct solution should be to enumerate the capabilities of
> hardware IOMMUs used by the assigned device, and reject if any dismatch is
> found.

Right. And it's a hack because
1. CPU AW doesn't always match VTD AW
2. The limitation only applies to hardware devices, software ones are fine
So we need a patch for the host sysfs to expose the actual IOMMU AW to userspace.
QEMU could then look at the actual hardware features.
I'd like to see the actual patch doing that, even if we
add a hack based on CPU AW for existing systems.


But how is it working for TCG? It would seem that
VFIO with TCG would be just as broken as with KVM...

> Maybe I should add a TODO in above comments, give the background explaination.
>
> > 
> > > so
> > >      * that iommu shadowing works for device assignment scenario. But for
> > >      * TCG mode, we do not need such restriction.
> > >      */
> > > 
> > > BTW, I just tested the TCG mode, it works(with restriction of host capability removed).
> > > 
> > > > 
> > > > 
> > > > > > > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > > > > > > vIOMMU does have usage in TCG?
> > > > > > > > 
> > > > > > > > But as to the check of guest capability, I still believe it is necessary. As
> > > > > > > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > > > > > > 
> > > > > > > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > > > > > > +                         "host and guest are capable of 5-level paging");
> > > > > > > > > >          return false;
> > > > > > > > > >      }
> > > > > > > > > >  
> > > > > > > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > > > > > > index d084099..2b29b6f 100644
> > > > > > > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > > > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > > > > > > @@ -114,8 +114,8 @@
> > > > > > > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > > > > > > >  
> > > > > > > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > > > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > > > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > > > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > > > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > > > > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > > > > > > >  
> > > > > > > > > >  /* IOTLB_REG */
> > > > > > > > > > @@ -212,6 +212,8 @@
> > > > > > > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > > > > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > > > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > >  
> > > > > > > > > >  /* IQT_REG */
> > > > > > > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > > > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > > > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > >  
> > > > > > > > > >  /* Information about page-selective IOTLB invalidate */
> > > > > > > > > >  struct VTDIOTLBPageInvInfo {
> > > > > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > > > > index 820451c..7474c4f 100644
> > > > > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > > > > @@ -49,6 +49,7 @@
> > > > > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > > > >  #define VTD_AW_39BIT                39
> > > > > > > > > >  #define VTD_AW_48BIT                48
> > > > > > > > > > +#define VTD_AW_57BIT                57
> > > > > > > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > > > > >  
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > B.R.
> > > > > > > > Yu
> > > > > > > > 
> > > > > 
> > > > > B.R.
> > > > > Yu
> > > > 
> > > 
> > > B.R.
> > > Yu
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-19  6:28               ` Yu Zhang
@ 2018-12-19 15:30                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-19 15:30 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Wed, Dec 19, 2018 at 02:28:10PM +0800, Yu Zhang wrote:
> On Tue, Dec 18, 2018 at 10:12:45PM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 19, 2018 at 11:03:58AM +0800, Yu Zhang wrote:
> > > On Tue, Dec 18, 2018 at 09:58:35AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > > > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > 
> > > > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > > > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > 
> > > > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > > > data structures such as root entries, context entries, and entries of
> > > > > > > > DMA paging structures etc.
> > > > > > > > 
> > > > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > > > in an invalid IOVA being accepted.
> > > > > > > > 
> > > > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > > > guest CPU.
> > > > > > > 
> > > > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > > > to clarify.
> > > > > > > > 
> > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > ---
> > > > > > > [...]
> > > > > > > 
> > > > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > > > >  {
> > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > >  
> > > > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > >      }
> > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > +    s->haw_bits = cpu->phys_bits;
> > > > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > > > and set phys_bits when iommu is created?
> > > > > > 
> > > > > > Thanks for your comments, Igor.
> > > > > > 
> > > > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > > > 
> > > > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > > > address width. If we do not check the CPU field here, we will still have to
> > > > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > > > s->haw_bits again.
> > > > > > 
> > > > > > Is this explanation convincing enough? :)
> > > > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > > > values during iommu creation.
> > > > 
> > > > Maybe it's a good idea to add documentation for now.
> > > 
> > > Thanks Michael. So what kind of documentation do you refer? 
> > 
> > The idea would be to have two properties, AW for the CPU and
> > the IOMMU. In the documentation explain that they
> > should normally be set to the same value.
> > 
> > > > 
> > > > It would be nice not to push this stuff up the stack,
> > > > it's unfortunate that our internal APIs make it hard.
> > > 
> > > Sorry, I do not quite get it. What do you mean "internal APIs make it hard"? :)
> > 
> > The API doesn't actually guarantee any initialization order.
> > CPU happens to be initialized first but I do not
> > think there's a guarantee that it will keep being the case.
> > This makes it hard to get properties from one device
> > and use in another one.
> > 
> 
> Oops...
> Then there can be no easy way in the runtime to gurantee this. BTW, could we
> initialize CPU before other components? Is it hard to do, or not reasonable
> to do so?

I think we already happen to do it, but we lack a generic way to
describe the order of initialization at the QOM level. Instead for a
while now we've been trying to remove dependencies between devices.
Thus the general reluctance to add another dependency.
Given this one is more of a hack I'm not sure it qualifies
as a good reason to change that.


> I have plan to draft a doc in qemu on 5-level paging topic(maybe after all the
> enabling is done). But I don't this this is the proper place to put - as you
> can see, this fix is not relevant to 5-level paging. So any suggestion about
> the documentation?

Documentation for user-visible fetures generally belongs in the man page.


> > > > 
> > > > 
> > > > > > > 
> > > > > > > Perhaps Eduardo
> > > > > > >  can suggest better approach, since he's more familiar with phys_bits topic
> > > > > > 
> > > > > > @Eduardo, any comments? Thanks!
> > > > > > 
> > > > > > > 
> > > > > > > >      /*
> > > > > > > >       * Rsvd field masks for spte
> > > > > > > >       */
> > > > > > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > > > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > >  
> > > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > >      }
> > > > > > > >  
> > > > > > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > > > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > > > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > >          return false;
> > > > > > > >      }
> > > > > > > >  
> > > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > > index ed4e758..820451c 100644
> > > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > > @@ -47,9 +47,9 @@
> > > > > > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > > > > > >  
> > > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > > -#define VTD_HOST_AW_39BIT           39
> > > > > > > > -#define VTD_HOST_AW_48BIT           48
> > > > > > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > > > > > +#define VTD_AW_39BIT                39
> > > > > > > > +#define VTD_AW_48BIT                48
> > > > > > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > > >  
> > > > > > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > > > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > > > > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > > > > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > > > > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > > > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > > > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > > > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > > > > > >  
> > > > > > > >      /*
> > > > > > > >       * Protects IOMMU states in general.  Currently it protects the
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > B.R.
> > > > > > Yu
> > > > 
> > > 
> > > B.R.
> > > Yu
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-19 10:40           ` Igor Mammedov
@ 2018-12-19 16:47             ` Michael S. Tsirkin
  2018-12-20  5:59               ` Yu Zhang
  2018-12-20 21:18             ` Eduardo Habkost
  1 sibling, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-19 16:47 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Yu Zhang, Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Richard Henderson

On Wed, Dec 19, 2018 at 11:40:37AM +0100, Igor Mammedov wrote:
> On Wed, 19 Dec 2018 10:57:17 +0800
> Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> 
> > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >   
> > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:  
> > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > >   
> > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > data structures such as root entries, context entries, and entries of
> > > > > > DMA paging structures etc.
> > > > > > 
> > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > in an invalid IOVA being accepted.
> > > > > > 
> > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > guest CPU.  
> > > > >   
> > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > to clarify.
> > > > > > 
> > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > ---  
> > > > > [...]
> > > > >   
> > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > >  {
> > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > +    CPUState *cs = first_cpu;
> > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > >  
> > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > >      }
> > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > +    s->haw_bits = cpu->phys_bits;  
> > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > and set phys_bits when iommu is created?  
> > > > 
> > > > Thanks for your comments, Igor.
> > > > 
> > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > 
> > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > address width. If we do not check the CPU field here, we will still have to
> > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > s->haw_bits again.
> > > > 
> > > > Is this explanation convincing enough? :)  
> > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > values during iommu creation.  
> > 
> > Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> > According to the spec, the host address width is the maximum physical address width,
> > yet current implementation is using the DMA address width. For me, this is not only
> > wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> > 
> > As to how to fix it - should we query the cpu fields, I still do not understand why
> > this is not acceptable. :)
> > 
> > I had thought of other approaches before, yet I did not choose:
> > 
> > 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its  
> > physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> > this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> > or not?
> > 
> > 2> Another choice I had thought of is, to query the physical iommu. I abandoned this  
> > idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.
> 
> > So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)
> Because accessing private fields of device from another random device is not robust
> and a subject to breaking in unpredictable manner when field meaning or initialization
> order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
> access some piece of data from random device).
> 
> I've looked at intel-iommu code and how it's created so here is a way to do the thing
> you need using proper interfaces:
> 
> 1. add x-haw_bits property
> 2. include in your series patch
>     '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
> 3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
>    machine and let _pre_plug handler to check and set x-haw_bits for machine level
> 4. you probably can use phys-bits/host-phys-bits properties to get data that you need
>    also see how ms->possible_cpus, that's how you can get access to CPU from machine
>    layer.


But given it's all actually a hack trying to guess host CPU capabilities,
I would rather say 
1. add a host kernel interface to get it from VFIO
2. on a host where it's not there, and assuming we want to support old kernels,
   write a function returning these (do we? why? is the 5 level hardware already
   so widespread?), and call it at any time. No need to poke at the VCPU.



> > >   
> > > > > 
> > > > > Perhaps Eduardo
> > > > >  can suggest better approach, since he's more familiar with phys_bits topic  
> > > > 
> > > > @Eduardo, any comments? Thanks!
> > > >   
> > > > >   
> > > > > >      /*
> > > > > >       * Rsvd field masks for spte
> > > > > >       */
> > > > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > >  
> > > > > >      if (x86_iommu->intr_supported) {
> > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > >      }
> > > > > >  
> > > > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > >          return false;
> > > > > >      }
> > > > > >  
> > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > index ed4e758..820451c 100644
> > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > @@ -47,9 +47,9 @@
> > > > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > > > >  
> > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > -#define VTD_HOST_AW_39BIT           39
> > > > > > -#define VTD_HOST_AW_48BIT           48
> > > > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > > > +#define VTD_AW_39BIT                39
> > > > > > +#define VTD_AW_48BIT                48
> > > > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > >  
> > > > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > > > >  
> > > > > >      /*
> > > > > >       * Protects IOMMU states in general.  Currently it protects the  
> > > > > 
> > > > >   
> > > > 
> > > > B.R.
> > > > Yu  
> > > 
> > >   
> > 
> > B.R.
> > Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-19 15:23                     ` Michael S. Tsirkin
@ 2018-12-20  5:49                       ` Yu Zhang
  2018-12-20 18:28                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-20  5:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Wed, Dec 19, 2018 at 10:23:44AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 19, 2018 at 01:57:43PM +0800, Yu Zhang wrote:
> > On Tue, Dec 18, 2018 at 11:35:34PM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 19, 2018 at 11:40:06AM +0800, Yu Zhang wrote:
> > > > On Tue, Dec 18, 2018 at 09:49:02AM -0500, Michael S. Tsirkin wrote:
> > > > > On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> > > > > > On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > > > > > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > > > > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > > > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > > > > > > 
> > > > > > > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > > > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > > > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > > > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > > > > > index 0e88c63..871110c 100644
> > > > > > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > > > > > > >  
> > > > > > > > > > >  /*
> > > > > > > > > > >   * Rsvd field masks for spte:
> > > > > > > > > > > - *     Index [1] to [4] 4k pages
> > > > > > > > > > > - *     Index [5] to [8] large pages
> > > > > > > > > > > + *     Index [1] to [5] 4k pages
> > > > > > > > > > > + *     Index [6] to [10] large pages
> > > > > > > > > > >   */
> > > > > > > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > > > > > > >  
> > > > > > > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > > > > > >  {
> > > > > > > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > > > > > > >          /* Maybe large page */
> > > > > > > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > > > > > > >      } else {
> > > > > > > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > > > > > > >      }
> > > > > > > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > > > > > > >      }
> > > > > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > > >  
> > > > > > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > > > > > > >      return &vtd_as->as;
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > +static bool host_has_la57(void)
> > > > > > > > > > > +{
> > > > > > > > > > > +    uint32_t ecx, unused;
> > > > > > > > > > > +
> > > > > > > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static bool guest_has_la57(void)
> > > > > > > > > > > +{
> > > > > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > > > > > > +
> > > > > > > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > > > > > > +}
> > > > > > > > > > another direct access to CPU fields,
> > > > > > > > > > I'd suggest to set this value when iommu is created
> > > > > > > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > > > > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > > > > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > > > > > > 
> > > > > > > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > > > > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > > > > > > CPUs. 
> > > > > > > 
> > > > > > > I don't necessarily see why these need to be connected.
> > > > > > > If yes pls add code to explain.
> > > > > > 
> > > > > > Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> > > > > > have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> > > > > > vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> > > > > > value as an IOVA.
> > > > > 
> > > > > Right but then that doesn't work on all hosts either.
> > > > 
> > > > Oh, the host already has 5-level IOMMU now. So I think DPDK in native shall work with that.
> > > > 
> > > > > 
> > > > > > And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> > > > > > But if the VM even does not have LA57, any specific reason we should give it a 5-level
> > > > > > vIOMMU?
> > > > > 
> > > > > So the example you give is VTD address width < CPU aw. That is known
> > > > > to be problematic for dpdk but not for other software and maybe dpdk
> > > > > will learns how to cope. Given such hosts exist it might be
> > > > > useful to support this at least for debugging.
> > > > > 
> > > > > Are there reasons to worry about VTD > CPU?
> > > > 
> > > > Well, I am not that worried(no usage case is one concern). I am OK to drop the guest check. :)
> > > > 
> > > > > 
> > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > > >  {
> > > > > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > > >          }
> > > > > > > > > > >      }
> > > > > > > > > > >  
> > > > > > > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > > > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > > > > > > +        return false;
> > > > > > > > > > > +    }
> > > > > > > > > > > +
> > > > > > > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > > > > > > Does iommu supposed to work in TCG mode?
> > > > > > > > > > If yes then why it should care about host_has_la57()?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > > > > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > > > > > > for device assignment.
> > > > > > > > > 
> > > > > > > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > > > > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > > > > > > we can:
> > > > > > > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > > > > > > capability if it is TCG.
> > > > > > > > 
> > > > > > > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > > > > > > for the remind. :)
> > > > > > > 
> > > > > > > 
> > > > > > > This needs a big comment with an explanation though.
> > > > > > > And probably a TODO to make it work under TCG ...
> > > > > > > 
> > > > > > 
> > > > > > Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> > > > > > though), and the condition would be sth. like:
> > > > > > 
> > > > > >     if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > >         kvm_enabled() &&
> > > > > >         !host_has_la57())  {
> > > > > > 
> > > > > > As you can see, though I remove the check of guest_has_la57(), I still kept the
> > > > > > check against host when KVM is enabled. I'm still ready to be convinced for any
> > > > > > requirement why we do not need the guest check. :) 
> > > > > 
> > > > > 
> > > > > okay but then (repeating myself, sorry) pls add a comment that explains
> > > > > what happens if you do not add this limitation.
> > > > 
> > > > How about below comments?
> > > >     /*
> > > >      * For KVM guests, the host capability of LA57 shall be available,
> > > 
> > > So why is host CPU LA57 necessary for shadowing? Could you explain pls?
> > 
> > Oh, let me try to explain the background here. :)
> > 
> > Currently, vIOMMU in qemu does not have logic to check against the hardware
> > IOMMU capability. E.g. when we create an IOMMU with 48 bit DMA address width,
> > qemu does not check if any physical IOMMU has such support. And the shadow
> > IOMMU logic will have problem if host IOMMU only supports 39 bit IOVA. And
> > we will have the same problem when it comes to 57 bit IOVA.
> > 
> > My previous discussion with Peter Xu reached an agreement that for now, we
> > just use the host cpu capability as a reference when trying to create a 5-level
> > vIOMMU, because 57 bit IOMMU hardware will not come until ICX platform(which
> > includes LA57). 
> > 
> > And the final correct solution should be to enumerate the capabilities of
> > hardware IOMMUs used by the assigned device, and reject if any dismatch is
> > found.
> 
> Right. And it's a hack because
> 1. CPU AW doesn't always match VTD AW
> 2. The limitation only applies to hardware devices, software ones are fine
> So we need a patch for the host sysfs to expose the actual IOMMU AW to userspace.
> QEMU could then look at the actual hardware features.
> I'd like to see the actual patch doing that, even if we
> add a hack based on CPU AW for existing systems.
> 

Sure, I have plan to do so. And I am wondering, if this is a must for current
patchset to be accepted? I mean, after all, we already have the same problem
on existing platform. :)

> 
> But how is it working for TCG? It would seem that
> VFIO with TCG would be just as broken as with KVM...

Sorry, may I ask why TCG shall be broken? I had thought TCG does not need IOMMU
shadowing...

> 
> > Maybe I should add a TODO in above comments, give the background explaination.
> >
> > > 
> > > > so
> > > >      * that iommu shadowing works for device assignment scenario. But for
> > > >      * TCG mode, we do not need such restriction.
> > > >      */
> > > > 
> > > > BTW, I just tested the TCG mode, it works(with restriction of host capability removed).
> > > > 
> > > > > 
> > > > > 
> > > > > > > > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > > > > > > > vIOMMU does have usage in TCG?
> > > > > > > > > 
> > > > > > > > > But as to the check of guest capability, I still believe it is necessary. As
> > > > > > > > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > > > > > > > 
> > > > > > > > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > > > > > > > +                         "host and guest are capable of 5-level paging");
> > > > > > > > > > >          return false;
> > > > > > > > > > >      }
> > > > > > > > > > >  
> > > > > > > > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > > > > > > > index d084099..2b29b6f 100644
> > > > > > > > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > > > > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > > > > > > > @@ -114,8 +114,8 @@
> > > > > > > > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > > > > > > > >  
> > > > > > > > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > > > > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > > > > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > > > > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > > > > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > > > > > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > > > > > > > >  
> > > > > > > > > > >  /* IOTLB_REG */
> > > > > > > > > > > @@ -212,6 +212,8 @@
> > > > > > > > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > > > > > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > > > > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > > >  
> > > > > > > > > > >  /* IQT_REG */
> > > > > > > > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > > > > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > > > > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > >  
> > > > > > > > > > >  /* Information about page-selective IOTLB invalidate */
> > > > > > > > > > >  struct VTDIOTLBPageInvInfo {
> > > > > > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > > > > > index 820451c..7474c4f 100644
> > > > > > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > > > > > @@ -49,6 +49,7 @@
> > > > > > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > > > > >  #define VTD_AW_39BIT                39
> > > > > > > > > > >  #define VTD_AW_48BIT                48
> > > > > > > > > > > +#define VTD_AW_57BIT                57
> > > > > > > > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > B.R.
> > > > > > > > > Yu
> > > > > > > > > 
> > > > > > 
> > > > > > B.R.
> > > > > > Yu
> > > > > 
> > > > 
> > > > B.R.
> > > > Yu
> > 
> > B.R.
> > Yu
B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-19 16:47             ` Michael S. Tsirkin
@ 2018-12-20  5:59               ` Yu Zhang
  0 siblings, 0 replies; 57+ messages in thread
From: Yu Zhang @ 2018-12-20  5:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Wed, Dec 19, 2018 at 11:47:23AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 19, 2018 at 11:40:37AM +0100, Igor Mammedov wrote:
> > On Wed, 19 Dec 2018 10:57:17 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > 
> > > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >   
> > > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:  
> > > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > >   
> > > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > > data structures such as root entries, context entries, and entries of
> > > > > > > DMA paging structures etc.
> > > > > > > 
> > > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > > in an invalid IOVA being accepted.
> > > > > > > 
> > > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > > guest CPU.  
> > > > > >   
> > > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > > to clarify.
> > > > > > > 
> > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > ---  
> > > > > > [...]
> > > > > >   
> > > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > > >  {
> > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > >  
> > > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > >      }
> > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > +    s->haw_bits = cpu->phys_bits;  
> > > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > > and set phys_bits when iommu is created?  
> > > > > 
> > > > > Thanks for your comments, Igor.
> > > > > 
> > > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > > 
> > > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > > address width. If we do not check the CPU field here, we will still have to
> > > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > > s->haw_bits again.
> > > > > 
> > > > > Is this explanation convincing enough? :)  
> > > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > > values during iommu creation.  
> > > 
> > > Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> > > According to the spec, the host address width is the maximum physical address width,
> > > yet current implementation is using the DMA address width. For me, this is not only
> > > wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> > > 
> > > As to how to fix it - should we query the cpu fields, I still do not understand why
> > > this is not acceptable. :)
> > > 
> > > I had thought of other approaches before, yet I did not choose:
> > > 
> > > 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its  
> > > physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> > > this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> > > or not?
> > > 
> > > 2> Another choice I had thought of is, to query the physical iommu. I abandoned this  
> > > idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.
> > 
> > > So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)
> > Because accessing private fields of device from another random device is not robust
> > and a subject to breaking in unpredictable manner when field meaning or initialization
> > order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
> > access some piece of data from random device).
> > 
> > I've looked at intel-iommu code and how it's created so here is a way to do the thing
> > you need using proper interfaces:
> > 
> > 1. add x-haw_bits property
> > 2. include in your series patch
> >     '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
> > 3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
> >    machine and let _pre_plug handler to check and set x-haw_bits for machine level
> > 4. you probably can use phys-bits/host-phys-bits properties to get data that you need
> >    also see how ms->possible_cpus, that's how you can get access to CPU from machine
> >    layer.
> 
> 
> But given it's all actually a hack trying to guess host CPU capabilities,

Well, not exactly. :)

Unlike the 2nd patch in this series, which I used host cpu capability as a reference(though
not the final solution), what this patch cares about is the guest physical address width,
which can possibly be different than the physical one. E.g. we can create a VM with 39 bit
physical address on hosts whose address width is 46 bits, and in such case, 39 shall be the
address limitation in guest DMAR, instead of 46.

So I think Igor's proposal can meet all my requirement(I'll study the hotplug handler interface
to figure out).

> I would rather say 
> 1. add a host kernel interface to get it from VFIO
> 2. on a host where it's not there, and assuming we want to support old kernels,
>    write a function returning these (do we? why? is the 5 level hardware already
>    so widespread?), and call it at any time. No need to poke at the VCPU.
> 
> 
> 
> > > >   
> > > > > > 
> > > > > > Perhaps Eduardo
> > > > > >  can suggest better approach, since he's more familiar with phys_bits topic  
> > > > > 
> > > > > @Eduardo, any comments? Thanks!
> > > > >   
> > > > > >   
> > > > > > >      /*
> > > > > > >       * Rsvd field masks for spte
> > > > > > >       */
> > > > > > >      vtd_paging_entry_rsvd_field[0] = ~0ULL;
> > > > > > > -    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->aw_bits);
> > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->aw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[1] = VTD_SPTE_PAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > >  
> > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > @@ -3261,10 +3268,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > >      }
> > > > > > >  
> > > > > > >      /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > -    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> > > > > > > -        (s->aw_bits != VTD_HOST_AW_48BIT)) {
> > > > > > > +    if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > +        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > >          error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > -                   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
> > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > >          return false;
> > > > > > >      }
> > > > > > >  
> > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > index ed4e758..820451c 100644
> > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > @@ -47,9 +47,9 @@
> > > > > > >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> > > > > > >  
> > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > -#define VTD_HOST_AW_39BIT           39
> > > > > > > -#define VTD_HOST_AW_48BIT           48
> > > > > > > -#define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > > > > > > +#define VTD_AW_39BIT                39
> > > > > > > +#define VTD_AW_48BIT                48
> > > > > > > +#define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > >  
> > > > > > >  #define DMAR_REPORT_F_INTR          (1)
> > > > > > > @@ -244,7 +244,8 @@ struct IntelIOMMUState {
> > > > > > >      bool intr_eime;                 /* Extended interrupt mode enabled */
> > > > > > >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> > > > > > >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> > > > > > > -    uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > > > > > > +    uint8_t aw_bits;                /* IOVA address width (in bits) */
> > > > > > > +    uint8_t haw_bits;               /* Hardware address width (in bits) */
> > > > > > >  
> > > > > > >      /*
> > > > > > >       * Protects IOMMU states in general.  Currently it protects the  
> > > > > > 
> > > > > >   
> > > > > 
> > > > > B.R.
> > > > > Yu  
> > > > 
> > > >   
> > > 
> > > B.R.
> > > Yu

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-20  5:49                       ` Yu Zhang
@ 2018-12-20 18:28                         ` Michael S. Tsirkin
  2018-12-21 16:19                           ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-20 18:28 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Thu, Dec 20, 2018 at 01:49:21PM +0800, Yu Zhang wrote:
> On Wed, Dec 19, 2018 at 10:23:44AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 19, 2018 at 01:57:43PM +0800, Yu Zhang wrote:
> > > On Tue, Dec 18, 2018 at 11:35:34PM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 19, 2018 at 11:40:06AM +0800, Yu Zhang wrote:
> > > > > On Tue, Dec 18, 2018 at 09:49:02AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> > > > > > > On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > > > > > > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > > > > > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > > > > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > > > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > > > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > > > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > > > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > > > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > > > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > > > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > > > > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > > > > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > > > > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > > > > > > index 0e88c63..871110c 100644
> > > > > > > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > > > > > > > >  
> > > > > > > > > > > >  /*
> > > > > > > > > > > >   * Rsvd field masks for spte:
> > > > > > > > > > > > - *     Index [1] to [4] 4k pages
> > > > > > > > > > > > - *     Index [5] to [8] large pages
> > > > > > > > > > > > + *     Index [1] to [5] 4k pages
> > > > > > > > > > > > + *     Index [6] to [10] large pages
> > > > > > > > > > > >   */
> > > > > > > > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > > > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > > > > > > > >  
> > > > > > > > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > > > > > > >  {
> > > > > > > > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > > > > > > > >          /* Maybe large page */
> > > > > > > > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > > > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > > > > > > > >      } else {
> > > > > > > > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > > > > > > > >      }
> > > > > > > > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > > > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > > > > > > > >      }
> > > > > > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > > > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > >  
> > > > > > > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > > > > > > > >      return &vtd_as->as;
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > +static bool host_has_la57(void)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +    uint32_t ecx, unused;
> > > > > > > > > > > > +
> > > > > > > > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > > > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > > +static bool guest_has_la57(void)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > > > > > > > +
> > > > > > > > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > > > > > > > +}
> > > > > > > > > > > another direct access to CPU fields,
> > > > > > > > > > > I'd suggest to set this value when iommu is created
> > > > > > > > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > > > > > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > > > > > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > > > > > > > 
> > > > > > > > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > > > > > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > > > > > > > CPUs. 
> > > > > > > > 
> > > > > > > > I don't necessarily see why these need to be connected.
> > > > > > > > If yes pls add code to explain.
> > > > > > > 
> > > > > > > Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> > > > > > > have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> > > > > > > vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> > > > > > > value as an IOVA.
> > > > > > 
> > > > > > Right but then that doesn't work on all hosts either.
> > > > > 
> > > > > Oh, the host already has 5-level IOMMU now. So I think DPDK in native shall work with that.
> > > > > 
> > > > > > 
> > > > > > > And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> > > > > > > But if the VM even does not have LA57, any specific reason we should give it a 5-level
> > > > > > > vIOMMU?
> > > > > > 
> > > > > > So the example you give is VTD address width < CPU aw. That is known
> > > > > > to be problematic for dpdk but not for other software and maybe dpdk
> > > > > > will learns how to cope. Given such hosts exist it might be
> > > > > > useful to support this at least for debugging.
> > > > > > 
> > > > > > Are there reasons to worry about VTD > CPU?
> > > > > 
> > > > > Well, I am not that worried(no usage case is one concern). I am OK to drop the guest check. :)
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > > > >  {
> > > > > > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > > > >          }
> > > > > > > > > > > >      }
> > > > > > > > > > > >  
> > > > > > > > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > > > > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > > > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > > > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > > > > > > > +        return false;
> > > > > > > > > > > > +    }
> > > > > > > > > > > > +
> > > > > > > > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > > > > > > > Does iommu supposed to work in TCG mode?
> > > > > > > > > > > If yes then why it should care about host_has_la57()?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > > > > > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > > > > > > > for device assignment.
> > > > > > > > > > 
> > > > > > > > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > > > > > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > > > > > > > we can:
> > > > > > > > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > > > > > > > capability if it is TCG.
> > > > > > > > > 
> > > > > > > > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > > > > > > > for the remind. :)
> > > > > > > > 
> > > > > > > > 
> > > > > > > > This needs a big comment with an explanation though.
> > > > > > > > And probably a TODO to make it work under TCG ...
> > > > > > > > 
> > > > > > > 
> > > > > > > Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> > > > > > > though), and the condition would be sth. like:
> > > > > > > 
> > > > > > >     if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > >         kvm_enabled() &&
> > > > > > >         !host_has_la57())  {
> > > > > > > 
> > > > > > > As you can see, though I remove the check of guest_has_la57(), I still kept the
> > > > > > > check against host when KVM is enabled. I'm still ready to be convinced for any
> > > > > > > requirement why we do not need the guest check. :) 
> > > > > > 
> > > > > > 
> > > > > > okay but then (repeating myself, sorry) pls add a comment that explains
> > > > > > what happens if you do not add this limitation.
> > > > > 
> > > > > How about below comments?
> > > > >     /*
> > > > >      * For KVM guests, the host capability of LA57 shall be available,
> > > > 
> > > > So why is host CPU LA57 necessary for shadowing? Could you explain pls?
> > > 
> > > Oh, let me try to explain the background here. :)
> > > 
> > > Currently, vIOMMU in qemu does not have logic to check against the hardware
> > > IOMMU capability. E.g. when we create an IOMMU with 48 bit DMA address width,
> > > qemu does not check if any physical IOMMU has such support. And the shadow
> > > IOMMU logic will have problem if host IOMMU only supports 39 bit IOVA. And
> > > we will have the same problem when it comes to 57 bit IOVA.
> > > 
> > > My previous discussion with Peter Xu reached an agreement that for now, we
> > > just use the host cpu capability as a reference when trying to create a 5-level
> > > vIOMMU, because 57 bit IOMMU hardware will not come until ICX platform(which
> > > includes LA57). 
> > > 
> > > And the final correct solution should be to enumerate the capabilities of
> > > hardware IOMMUs used by the assigned device, and reject if any dismatch is
> > > found.
> > 
> > Right. And it's a hack because
> > 1. CPU AW doesn't always match VTD AW
> > 2. The limitation only applies to hardware devices, software ones are fine
> > So we need a patch for the host sysfs to expose the actual IOMMU AW to userspace.
> > QEMU could then look at the actual hardware features.
> > I'd like to see the actual patch doing that, even if we
> > add a hack based on CPU AW for existing systems.
> > 
> 
> Sure, I have plan to do so. And I am wondering, if this is a must for current
> patchset to be accepted? I mean, after all, we already have the same problem
> on existing platform. :)

I'd like to avoid poking at the CPU from VTD code. That's all.

> > 
> > But how is it working for TCG? It would seem that
> > VFIO with TCG would be just as broken as with KVM...
> 
> Sorry, may I ask why TCG shall be broken? I had thought TCG does not need IOMMU
> shadowing...

IOMMU shadowing is used for vfio. I do not think it matters whether it's
KVM or TCG.

> > 
> > > Maybe I should add a TODO in above comments, give the background explaination.
> > >
> > > > 
> > > > > so
> > > > >      * that iommu shadowing works for device assignment scenario. But for
> > > > >      * TCG mode, we do not need such restriction.
> > > > >      */
> > > > > 
> > > > > BTW, I just tested the TCG mode, it works(with restriction of host capability removed).
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > > > > 2> Or, we can choose to keep as it is, and add the check when 5-level paging
> > > > > > > > > > vIOMMU does have usage in TCG?
> > > > > > > > > > 
> > > > > > > > > > But as to the check of guest capability, I still believe it is necessary. As
> > > > > > > > > > said, a VM without LA57 feature shall not see a VT-d with 5-level IOMMU.
> > > > > > > > > > 
> > > > > > > > > > > > +        error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > > > > > > > > > > > +                         "host and guest are capable of 5-level paging");
> > > > > > > > > > > >          return false;
> > > > > > > > > > > >      }
> > > > > > > > > > > >  
> > > > > > > > > > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > > > > > > > > > index d084099..2b29b6f 100644
> > > > > > > > > > > > --- a/hw/i386/intel_iommu_internal.h
> > > > > > > > > > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > > > > > > > > > @@ -114,8 +114,8 @@
> > > > > > > > > > > >                                       VTD_INTERRUPT_ADDR_FIRST + 1)
> > > > > > > > > > > >  
> > > > > > > > > > > >  /* The shift of source_id in the key of IOTLB hash table */
> > > > > > > > > > > > -#define VTD_IOTLB_SID_SHIFT         36
> > > > > > > > > > > > -#define VTD_IOTLB_LVL_SHIFT         52
> > > > > > > > > > > > +#define VTD_IOTLB_SID_SHIFT         45
> > > > > > > > > > > > +#define VTD_IOTLB_LVL_SHIFT         61
> > > > > > > > > > > >  #define VTD_IOTLB_MAX_SIZE          1024    /* Max size of the hash table */
> > > > > > > > > > > >  
> > > > > > > > > > > >  /* IOTLB_REG */
> > > > > > > > > > > > @@ -212,6 +212,8 @@
> > > > > > > > > > > >  #define VTD_CAP_SAGAW_39bit         (0x2ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > > > >   /* 48-bit AGAW, 4-level page-table */
> > > > > > > > > > > >  #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > > > > + /* 57-bit AGAW, 5-level page-table */
> > > > > > > > > > > > +#define VTD_CAP_SAGAW_57bit         (0x8ULL << VTD_CAP_SAGAW_SHIFT)
> > > > > > > > > > > >  
> > > > > > > > > > > >  /* IQT_REG */
> > > > > > > > > > > >  #define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
> > > > > > > > > > > > @@ -379,6 +381,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > >  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> > > > > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > > +#define VTD_SPTE_PAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > >  #define VTD_SPTE_LPAGE_L1_RSVD_MASK(aw) \
> > > > > > > > > > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > >  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw) \
> > > > > > > > > > > > @@ -387,6 +391,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > > > > > > > > >          (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > >  #define VTD_SPTE_LPAGE_L4_RSVD_MASK(aw) \
> > > > > > > > > > > >          (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > > +#define VTD_SPTE_LPAGE_L5_RSVD_MASK(aw) \
> > > > > > > > > > > > +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> > > > > > > > > > > >  
> > > > > > > > > > > >  /* Information about page-selective IOTLB invalidate */
> > > > > > > > > > > >  struct VTDIOTLBPageInvInfo {
> > > > > > > > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > > > > > > > index 820451c..7474c4f 100644
> > > > > > > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > > > > > > @@ -49,6 +49,7 @@
> > > > > > > > > > > >  #define DMAR_REG_SIZE               0x230
> > > > > > > > > > > >  #define VTD_AW_39BIT                39
> > > > > > > > > > > >  #define VTD_AW_48BIT                48
> > > > > > > > > > > > +#define VTD_AW_57BIT                57
> > > > > > > > > > > >  #define VTD_ADDRESS_WIDTH           VTD_AW_39BIT
> > > > > > > > > > > >  #define VTD_HAW_MASK(aw)            ((1ULL << (aw)) - 1)
> > > > > > > > > > > >  
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > B.R.
> > > > > > > > > > Yu
> > > > > > > > > > 
> > > > > > > 
> > > > > > > B.R.
> > > > > > > Yu
> > > > > > 
> > > > > 
> > > > > B.R.
> > > > > Yu
> > > 
> > > B.R.
> > > Yu
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-18  9:27     ` Yu Zhang
  2018-12-18 14:23       ` Michael S. Tsirkin
  2018-12-18 14:55       ` Igor Mammedov
@ 2018-12-20 20:58       ` Eduardo Habkost
  2 siblings, 0 replies; 57+ messages in thread
From: Eduardo Habkost @ 2018-12-20 20:58 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Tue, Dec 18, 2018 at 05:27:23PM +0800, Yu Zhang wrote:
> On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:
> > On Wed, 12 Dec 2018 21:05:38 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > 
> > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > the host address width(HAW) to calculate the number of reserved bits in
> > > data structures such as root entries, context entries, and entries of
> > > DMA paging structures etc.
> > > 
> > > However values of IOVA address width and of the HAW may not equal. For
> > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > in an invalid IOVA being accepted.
> > > 
> > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > whose value is initialized based on the maximum physical address set to
> > > guest CPU.
> > 
> > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > to clarify.
> > > 
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > ---
> > [...]
> > 
> > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > >  static void vtd_init(IntelIOMMUState *s)
> > >  {
> > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > +    CPUState *cs = first_cpu;
> > > +    X86CPU *cpu = X86_CPU(cs);
> > >  
> > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > >      }
> > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > +    s->haw_bits = cpu->phys_bits;
> > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > and set phys_bits when iommu is created?
> 
> Thanks for your comments, Igor.
> 
> Well, I guess you prefer not to query the CPU capabilities while deciding
> the vIOMMU features. But to me, they are not that irrelevant.:)
> 
> Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> are referring to the same concept. In VM, both are the maximum guest physical
> address width. If we do not check the CPU field here, we will still have to
> check the CPU field in other places such as build_dmar_q35(), and reset the
> s->haw_bits again.
> 
> Is this explanation convincing enough? :)
> 
> > 
> > Perhaps Eduardo
> >  can suggest better approach, since he's more familiar with phys_bits topic
> 
> @Eduardo, any comments? Thanks!

Configuring IOMMU phys-bits automatically depending on the
configured CPU is OK, but accessing first_cpu directly in iommu
code is.  I suggest delegating this to the machine object, e.g.:

  uint32_t pc_max_phys_bits(PCMachineState *pcms)
  {
      return object_property_get_uint(OBJECT(first_cpu), "phys-bits", &error_abort);
  }

as the machine itself is responsible for creating the CPU
objects, and I believe there are other places in PC code where we
do physical address calculations that could be affected by the
physical address space size.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-19 10:40           ` Igor Mammedov
  2018-12-19 16:47             ` Michael S. Tsirkin
@ 2018-12-20 21:18             ` Eduardo Habkost
  2018-12-21 14:13               ` Igor Mammedov
  1 sibling, 1 reply; 57+ messages in thread
From: Eduardo Habkost @ 2018-12-20 21:18 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Yu Zhang, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Wed, Dec 19, 2018 at 11:40:37AM +0100, Igor Mammedov wrote:
> On Wed, 19 Dec 2018 10:57:17 +0800
> Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> 
> > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:
> > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >   
> > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:  
> > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > >   
> > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > data structures such as root entries, context entries, and entries of
> > > > > > DMA paging structures etc.
> > > > > > 
> > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > in an invalid IOVA being accepted.
> > > > > > 
> > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > guest CPU.  
> > > > >   
> > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > to clarify.
> > > > > > 
> > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > ---  
> > > > > [...]
> > > > >   
> > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > >  {
> > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > +    CPUState *cs = first_cpu;
> > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > >  
> > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > >      }
> > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > +    s->haw_bits = cpu->phys_bits;  
> > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > and set phys_bits when iommu is created?  
> > > > 
> > > > Thanks for your comments, Igor.
> > > > 
> > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > 
> > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > address width. If we do not check the CPU field here, we will still have to
> > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > s->haw_bits again.
> > > > 
> > > > Is this explanation convincing enough? :)  
> > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > values during iommu creation.  
> > 
> > Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> > According to the spec, the host address width is the maximum physical address width,
> > yet current implementation is using the DMA address width. For me, this is not only
> > wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> > 
> > As to how to fix it - should we query the cpu fields, I still do not understand why
> > this is not acceptable. :)
> > 
> > I had thought of other approaches before, yet I did not choose:
> > 
> > 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its  
> > physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> > this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> > or not?
> > 
> > 2> Another choice I had thought of is, to query the physical iommu. I abandoned this  
> > idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.
> 
> > So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)
> Because accessing private fields of device from another random device is not robust
> and a subject to breaking in unpredictable manner when field meaning or initialization
> order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
> access some piece of data from random device).
> 

With either the solution below or the one I proposed, we still
have a ordering problem: if we want "-cpu ...,phys-bits=..." to
affect the IOMMU device, we will need the CPU objects to be
created before IOMMU realize.

At least both proposals make the initialization ordering
explicitly a responsibility of the machine code.  In either case,
I don't think we will start creating all CPU objects after device
realize any time soon.


> I've looked at intel-iommu code and how it's created so here is a way to do the thing
> you need using proper interfaces:
> 
> 1. add x-haw_bits property
> 2. include in your series patch
>     '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
> 3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
>    machine and let _pre_plug handler to check and set x-haw_bits for machine level

Wow, that's a very complex way to pass a single integer from
machine code to device code.  If this is the only way to do that,
we really need to take a step back and rethink our API design.

What's wrong with having a simple
  uint32_t pc_max_phys_bits(PCMachineState*)
function?

> 4. you probably can use phys-bits/host-phys-bits properties to get data that you need
>    also see how ms->possible_cpus, that's how you can get access to CPU from machine
>    layer.
> 
[...]

-- 
Eduardo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-20 21:18             ` Eduardo Habkost
@ 2018-12-21 14:13               ` Igor Mammedov
  2018-12-21 16:09                 ` Yu Zhang
  2018-12-27 14:54                 ` Eduardo Habkost
  0 siblings, 2 replies; 57+ messages in thread
From: Igor Mammedov @ 2018-12-21 14:13 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, Yu Zhang,
	Paolo Bonzini, Richard Henderson

On Thu, 20 Dec 2018 19:18:01 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Wed, Dec 19, 2018 at 11:40:37AM +0100, Igor Mammedov wrote:
> > On Wed, 19 Dec 2018 10:57:17 +0800
> > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >   
> > > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:  
> > > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >     
> > > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:    
> > > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > >     
> > > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > > data structures such as root entries, context entries, and entries of
> > > > > > > DMA paging structures etc.
> > > > > > > 
> > > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > > in an invalid IOVA being accepted.
> > > > > > > 
> > > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > > guest CPU.    
> > > > > >     
> > > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > > to clarify.
> > > > > > > 
> > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > ---    
> > > > > > [...]
> > > > > >     
> > > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > > >  {
> > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > >  
> > > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > >      }
> > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > +    s->haw_bits = cpu->phys_bits;    
> > > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > > and set phys_bits when iommu is created?    
> > > > > 
> > > > > Thanks for your comments, Igor.
> > > > > 
> > > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > > 
> > > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > > address width. If we do not check the CPU field here, we will still have to
> > > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > > s->haw_bits again.
> > > > > 
> > > > > Is this explanation convincing enough? :)    
> > > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > > values during iommu creation.    
> > > 
> > > Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> > > According to the spec, the host address width is the maximum physical address width,
> > > yet current implementation is using the DMA address width. For me, this is not only
> > > wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> > > 
> > > As to how to fix it - should we query the cpu fields, I still do not understand why
> > > this is not acceptable. :)
> > > 
> > > I had thought of other approaches before, yet I did not choose:
> > >   
> > > 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its    
> > > physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> > > this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> > > or not?
> > >   
> > > 2> Another choice I had thought of is, to query the physical iommu. I abandoned this    
> > > idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.  
> >   
> > > So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)  
> > Because accessing private fields of device from another random device is not robust
> > and a subject to breaking in unpredictable manner when field meaning or initialization
> > order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
> > access some piece of data from random device).
> >   
> 
> With either the solution below or the one I proposed, we still
> have a ordering problem: if we want "-cpu ...,phys-bits=..." to
As Michael said, it's questionable if iommu should rely on guest's
phys-bits at all, but that aside we should use proper interfaces
and hierarchy to initialize devices, see below why I dislike
simplistic pc_max_phys_bits().

> affect the IOMMU device, we will need the CPU objects to be
> created before IOMMU realize.
> 
> At least both proposals make the initialization ordering
> explicitly a responsibility of the machine code.  In either case,
> I don't think we will start creating all CPU objects after device
> realize any time soon.
> 
> 
> > I've looked at intel-iommu code and how it's created so here is a way to do the thing
> > you need using proper interfaces:
> > 
> > 1. add x-haw_bits property
> > 2. include in your series patch
> >     '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
> > 3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
> >    machine and let _pre_plug handler to check and set x-haw_bits for machine level  
> 
> Wow, that's a very complex way to pass a single integer from
> machine code to device code.  If this is the only way to do that,
> we really need to take a step back and rethink our API design.
> 
> What's wrong with having a simple
>   uint32_t pc_max_phys_bits(PCMachineState*)
> function?
As suggested, it would be only aesthetic change for accessing first_cpu from
random device at random time. IOMMU would still access cpu instance directly
no matter how much wrappers one would use so it's still the same hack.
If phys_bits were changing during VM lifecycle and iommu needed to use
updated value then using pc_max_phys_bits() might be justified as
we don't have interfaces to handle that but that's not the case here.

I suggested a typical way (albeit a bit complex) to handle device
initialization in cases where bus plug handler is not sufficient.
It follows proper hierarchy without any layer violations and can fail
gracefully even if we start creating CPUs later using only '-device cpufoo'
without need to fix iommu code to handle that (it would fail creating
iommu with clear error that CPU isn't available and all user have to
do is to fix CLI to make sure that CPU is created before iommu).

So I'd prefer if we used exiting pattern for device initialization
instead of hacks whenever it is possible.

> 
> > 4. you probably can use phys-bits/host-phys-bits properties to get data that you need
> >    also see how ms->possible_cpus, that's how you can get access to CPU from machine
> >    layer.
> >   
> [...]
> 
PS:
Another thing I'd like to draw your attention to (since you recently looked at
phys-bits) is about host/guest phys_bits and if it's safe from migration pov
between hosts with different limits.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-21 14:13               ` Igor Mammedov
@ 2018-12-21 16:09                 ` Yu Zhang
  2018-12-21 17:04                   ` Michael S. Tsirkin
  2018-12-27 14:54                 ` Eduardo Habkost
  1 sibling, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-21 16:09 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Fri, Dec 21, 2018 at 03:13:25PM +0100, Igor Mammedov wrote:
> On Thu, 20 Dec 2018 19:18:01 -0200
> Eduardo Habkost <ehabkost@redhat.com> wrote:
> 
> > On Wed, Dec 19, 2018 at 11:40:37AM +0100, Igor Mammedov wrote:
> > > On Wed, 19 Dec 2018 10:57:17 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >   
> > > > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:  
> > > > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > >     
> > > > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:    
> > > > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > >     
> > > > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > > > data structures such as root entries, context entries, and entries of
> > > > > > > > DMA paging structures etc.
> > > > > > > > 
> > > > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > > > in an invalid IOVA being accepted.
> > > > > > > > 
> > > > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > > > guest CPU.    
> > > > > > >     
> > > > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > > > to clarify.
> > > > > > > > 
> > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > ---    
> > > > > > > [...]
> > > > > > >     
> > > > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > > > >  {
> > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > >  
> > > > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > >      }
> > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > +    s->haw_bits = cpu->phys_bits;    
> > > > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > > > and set phys_bits when iommu is created?    
> > > > > > 
> > > > > > Thanks for your comments, Igor.
> > > > > > 
> > > > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > > > 
> > > > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > > > address width. If we do not check the CPU field here, we will still have to
> > > > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > > > s->haw_bits again.
> > > > > > 
> > > > > > Is this explanation convincing enough? :)    
> > > > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > > > values during iommu creation.    
> > > > 
> > > > Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> > > > According to the spec, the host address width is the maximum physical address width,
> > > > yet current implementation is using the DMA address width. For me, this is not only
> > > > wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> > > > 
> > > > As to how to fix it - should we query the cpu fields, I still do not understand why
> > > > this is not acceptable. :)
> > > > 
> > > > I had thought of other approaches before, yet I did not choose:
> > > >   
> > > > 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its    
> > > > physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> > > > this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> > > > or not?
> > > >   
> > > > 2> Another choice I had thought of is, to query the physical iommu. I abandoned this    
> > > > idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.  
> > >   
> > > > So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)  
> > > Because accessing private fields of device from another random device is not robust
> > > and a subject to breaking in unpredictable manner when field meaning or initialization
> > > order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
> > > access some piece of data from random device).
> > >   
> > 
> > With either the solution below or the one I proposed, we still
> > have a ordering problem: if we want "-cpu ...,phys-bits=..." to
> As Michael said, it's questionable if iommu should rely on guest's
> phys-bits at all, but that aside we should use proper interfaces
> and hierarchy to initialize devices, see below why I dislike
> simplistic pc_max_phys_bits().

Well, my understanding of the vt-d spec is that the address limitation in
DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
there's any different in the native scenario. :)

> 
> > affect the IOMMU device, we will need the CPU objects to be
> > created before IOMMU realize.
> > 
> > At least both proposals make the initialization ordering
> > explicitly a responsibility of the machine code.  In either case,
> > I don't think we will start creating all CPU objects after device
> > realize any time soon.
> > 
> > 
> > > I've looked at intel-iommu code and how it's created so here is a way to do the thing
> > > you need using proper interfaces:
> > > 
> > > 1. add x-haw_bits property
> > > 2. include in your series patch
> > >     '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
> > > 3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
> > >    machine and let _pre_plug handler to check and set x-haw_bits for machine level  
> > 
> > Wow, that's a very complex way to pass a single integer from
> > machine code to device code.  If this is the only way to do that,
> > we really need to take a step back and rethink our API design.
> > 
> > What's wrong with having a simple
> >   uint32_t pc_max_phys_bits(PCMachineState*)
> > function?
> As suggested, it would be only aesthetic change for accessing first_cpu from
> random device at random time. IOMMU would still access cpu instance directly
> no matter how much wrappers one would use so it's still the same hack.
> If phys_bits were changing during VM lifecycle and iommu needed to use
> updated value then using pc_max_phys_bits() might be justified as
> we don't have interfaces to handle that but that's not the case here.
> 
> I suggested a typical way (albeit a bit complex) to handle device
> initialization in cases where bus plug handler is not sufficient.
> It follows proper hierarchy without any layer violations and can fail
> gracefully even if we start creating CPUs later using only '-device cpufoo'
> without need to fix iommu code to handle that (it would fail creating
> iommu with clear error that CPU isn't available and all user have to
> do is to fix CLI to make sure that CPU is created before iommu).
> 
> So I'd prefer if we used exiting pattern for device initialization
> instead of hacks whenever it is possible.

Thanks, Igor. I kind of understand your concern here. And I am wondering,
the phys-bits shall be a configuration used by the VM, not just vCPU. So,
instead of trying to deduce this value from the 1st created vCPU, or to
guarantee the order of vCPU & vIOMMU creation, is there any possibility
we move a max-phys-bits in the MachineState, and derive the 'phys-bits'
in vCPU and 'haw-bits' in vIOMMU from MachineState later in their creation
process respectively?

> 
> > 
> > > 4. you probably can use phys-bits/host-phys-bits properties to get data that you need
> > >    also see how ms->possible_cpus, that's how you can get access to CPU from machine
> > >    layer.
> > >   
> > [...]
> > 
> PS:
> Another thing I'd like to draw your attention to (since you recently looked at
> phys-bits) is about host/guest phys_bits and if it's safe from migration pov
> between hosts with different limits.
> 

Good point, and thanks for the remind. Edurado, Paolo, and I discussed this
before. And indeed, it is a bit tricky... :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-20 18:28                         ` Michael S. Tsirkin
@ 2018-12-21 16:19                           ` Yu Zhang
  2018-12-21 17:15                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-21 16:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Thu, Dec 20, 2018 at 01:28:21PM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 20, 2018 at 01:49:21PM +0800, Yu Zhang wrote:
> > On Wed, Dec 19, 2018 at 10:23:44AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 19, 2018 at 01:57:43PM +0800, Yu Zhang wrote:
> > > > On Tue, Dec 18, 2018 at 11:35:34PM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 19, 2018 at 11:40:06AM +0800, Yu Zhang wrote:
> > > > > > On Tue, Dec 18, 2018 at 09:49:02AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Dec 18, 2018 at 09:45:41PM +0800, Yu Zhang wrote:
> > > > > > > > On Tue, Dec 18, 2018 at 07:43:28AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Dec 18, 2018 at 06:01:16PM +0800, Yu Zhang wrote:
> > > > > > > > > > On Tue, Dec 18, 2018 at 05:47:14PM +0800, Yu Zhang wrote:
> > > > > > > > > > > On Mon, Dec 17, 2018 at 02:29:02PM +0100, Igor Mammedov wrote:
> > > > > > > > > > > > On Wed, 12 Dec 2018 21:05:39 +0800
> > > > > > > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > > > > > > > > > > E.g. guest applications may prefer to use its VA as IOVA when performing
> > > > > > > > > > > > > VFIO map/unmap operations, to avoid the burden of managing the IOVA space.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This patch extends the current vIOMMU logic to cover the extended address
> > > > > > > > > > > > > width. When creating a VM with 5-level paging feature, one can choose to
> > > > > > > > > > > > > create a virtual VTD with 5-level paging capability, with configurations
> > > > > > > > > > > > > like "-device intel-iommu,x-aw-bits=57".
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > > > > > > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > > > > > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > > > > > > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > > > > > > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > > > > > > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++----------
> > > > > > > > > > > > >  hw/i386/intel_iommu_internal.h | 10 ++++++--
> > > > > > > > > > > > >  include/hw/i386/intel_iommu.h  |  1 +
> > > > > > > > > > > > >  3 files changed, 50 insertions(+), 14 deletions(-)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > > > > > > > index 0e88c63..871110c 100644
> > > > > > > > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > > > > > > > @@ -664,16 +664,16 @@ static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  /*
> > > > > > > > > > > > >   * Rsvd field masks for spte:
> > > > > > > > > > > > > - *     Index [1] to [4] 4k pages
> > > > > > > > > > > > > - *     Index [5] to [8] large pages
> > > > > > > > > > > > > + *     Index [1] to [5] 4k pages
> > > > > > > > > > > > > + *     Index [6] to [10] large pages
> > > > > > > > > > > > >   */
> > > > > > > > > > > > > -static uint64_t vtd_paging_entry_rsvd_field[9];
> > > > > > > > > > > > > +static uint64_t vtd_paging_entry_rsvd_field[11];
> > > > > > > > > > > > >  
> > > > > > > > > > > > >  static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > > > > > > > >  {
> > > > > > > > > > > > >      if (slpte & VTD_SL_PT_PAGE_SIZE_MASK) {
> > > > > > > > > > > > >          /* Maybe large page */
> > > > > > > > > > > > > -        return slpte & vtd_paging_entry_rsvd_field[level + 4];
> > > > > > > > > > > > > +        return slpte & vtd_paging_entry_rsvd_field[level + 5];
> > > > > > > > > > > > >      } else {
> > > > > > > > > > > > >          return slpte & vtd_paging_entry_rsvd_field[level];
> > > > > > > > > > > > >      }
> > > > > > > > > > > > > @@ -3127,6 +3127,8 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > > > > > >      if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > > > > > > > +    } else if (s->aw_bits == VTD_AW_57BIT) {
> > > > > > > > > > > > > +        s->cap |= VTD_CAP_SAGAW_57bit | VTD_CAP_SAGAW_48bit;
> > > > > > > > > > > > >      }
> > > > > > > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > > > > > >      s->haw_bits = cpu->phys_bits;
> > > > > > > > > > > > > @@ -3139,10 +3141,12 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > > > > > >      vtd_paging_entry_rsvd_field[2] = VTD_SPTE_PAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > >      vtd_paging_entry_rsvd_field[3] = VTD_SPTE_PAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > >      vtd_paging_entry_rsvd_field[4] = VTD_SPTE_PAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > -    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[5] = VTD_SPTE_PAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[6] = VTD_SPTE_LPAGE_L1_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[7] = VTD_SPTE_LPAGE_L2_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[8] = VTD_SPTE_LPAGE_L3_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[9] = VTD_SPTE_LPAGE_L4_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > > +    vtd_paging_entry_rsvd_field[10] = VTD_SPTE_LPAGE_L5_RSVD_MASK(s->haw_bits);
> > > > > > > > > > > > >  
> > > > > > > > > > > > >      if (x86_iommu->intr_supported) {
> > > > > > > > > > > > >          s->ecap |= VTD_ECAP_IR | VTD_ECAP_MHMV;
> > > > > > > > > > > > > @@ -3241,6 +3245,23 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> > > > > > > > > > > > >      return &vtd_as->as;
> > > > > > > > > > > > >  }
> > > > > > > > > > > > >  
> > > > > > > > > > > > > +static bool host_has_la57(void)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +    uint32_t ecx, unused;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +    host_cpuid(7, 0, &unused, &unused, &ecx, &unused);
> > > > > > > > > > > > > +    return ecx & CPUID_7_0_ECX_LA57;
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +static bool guest_has_la57(void)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > > > > > > > +    CPUX86State *env = &cpu->env;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +    return env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_LA57;
> > > > > > > > > > > > > +}
> > > > > > > > > > > > another direct access to CPU fields,
> > > > > > > > > > > > I'd suggest to set this value when iommu is created
> > > > > > > > > > > > i.e. add 'la57' property and set from iommu owner.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Sorry, do you mean "-device intel-iommu,la57"? I think we do not need
> > > > > > > > > > > that, because a 5-level capable vIOMMU can be created with properties
> > > > > > > > > > > like "-device intel-iommu,x-aw-bits=57". 
> > > > > > > > > > > 
> > > > > > > > > > > The guest CPU fields are checked to make sure the VM has LA57 CPU feature,
> > > > > > > > > > > because I believe there shall be no 5-level IOMMU on platforms without LA57
> > > > > > > > > > > CPUs. 
> > > > > > > > > 
> > > > > > > > > I don't necessarily see why these need to be connected.
> > > > > > > > > If yes pls add code to explain.
> > > > > > > > 
> > > > > > > > Sorry, do you mean the VM shall be able to see a 5-level IOMMU even it does not
> > > > > > > > have LA57 feature? I do not see any direct connection when asked to enable a 5-level
> > > > > > > > vIOMMU at first, but I was told(and checked) that DPDK in the VM may choose a VA
> > > > > > > > value as an IOVA.
> > > > > > > 
> > > > > > > Right but then that doesn't work on all hosts either.
> > > > > > 
> > > > > > Oh, the host already has 5-level IOMMU now. So I think DPDK in native shall work with that.
> > > > > > 
> > > > > > > 
> > > > > > > > And if guest has LA57, we should create a 5-level vIOMMU to the VM.
> > > > > > > > But if the VM even does not have LA57, any specific reason we should give it a 5-level
> > > > > > > > vIOMMU?
> > > > > > > 
> > > > > > > So the example you give is VTD address width < CPU aw. That is known
> > > > > > > to be problematic for dpdk but not for other software and maybe dpdk
> > > > > > > will learns how to cope. Given such hosts exist it might be
> > > > > > > useful to support this at least for debugging.
> > > > > > > 
> > > > > > > Are there reasons to worry about VTD > CPU?
> > > > > > 
> > > > > > Well, I am not that worried(no usage case is one concern). I am OK to drop the guest check. :)
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > > >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > > > > >  {
> > > > > > > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > > > > > > @@ -3267,11 +3288,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > > > > > > > > > > > >          }
> > > > > > > > > > > > >      }
> > > > > > > > > > > > >  
> > > > > > > > > > > > > -    /* Currently only address widths supported are 39 and 48 bits */
> > > > > > > > > > > > > +    /* Currently address widths supported are 39, 48, and 57 bits */
> > > > > > > > > > > > >      if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > > > > > > > > > -        (s->aw_bits != VTD_AW_48BIT)) {
> > > > > > > > > > > > > -        error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > > > > > > > > > > -                   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > > > > > > > > > +        (s->aw_bits != VTD_AW_48BIT) &&
> > > > > > > > > > > > > +        (s->aw_bits != VTD_AW_57BIT)) {
> > > > > > > > > > > > > +        error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > > > > > > > > > > > > +                   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > > > > > > > > > +        return false;
> > > > > > > > > > > > > +    }
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +    if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > > > > > > > +        !(host_has_la57() && guest_has_la57())) {
> > > > > > > > > > > > Does iommu supposed to work in TCG mode?
> > > > > > > > > > > > If yes then why it should care about host_has_la57()?
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Hmm... I did not take TCG mode into consideration. And host_has_la57() is
> > > > > > > > > > > used to guarantee the host have la57 feature so that iommu shadowing works
> > > > > > > > > > > for device assignment.
> > > > > > > > > > > 
> > > > > > > > > > > I guess iommu shall work in TCG mode(though I am not quite sure about this).
> > > > > > > > > > > But I do not have any usage case of a 5-level vIOMMU in TCG in mind. So maybe
> > > > > > > > > > > we can:
> > > > > > > > > > > 1> check the 'ms->accel' in vtd_decide_config() and do not care about host
> > > > > > > > > > > capability if it is TCG.
> > > > > > > > > > 
> > > > > > > > > > For choice 1, kvm_enabled() might be used instead of ms->accel. Thanks Peter
> > > > > > > > > > for the remind. :)
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > This needs a big comment with an explanation though.
> > > > > > > > > And probably a TODO to make it work under TCG ...
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks, Michael. For choice 1, I believe it should work for TCG(will need test
> > > > > > > > though), and the condition would be sth. like:
> > > > > > > > 
> > > > > > > >     if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > > > >         kvm_enabled() &&
> > > > > > > >         !host_has_la57())  {
> > > > > > > > 
> > > > > > > > As you can see, though I remove the check of guest_has_la57(), I still kept the
> > > > > > > > check against host when KVM is enabled. I'm still ready to be convinced for any
> > > > > > > > requirement why we do not need the guest check. :) 
> > > > > > > 
> > > > > > > 
> > > > > > > okay but then (repeating myself, sorry) pls add a comment that explains
> > > > > > > what happens if you do not add this limitation.
> > > > > > 
> > > > > > How about below comments?
> > > > > >     /*
> > > > > >      * For KVM guests, the host capability of LA57 shall be available,
> > > > > 
> > > > > So why is host CPU LA57 necessary for shadowing? Could you explain pls?
> > > > 
> > > > Oh, let me try to explain the background here. :)
> > > > 
> > > > Currently, vIOMMU in qemu does not have logic to check against the hardware
> > > > IOMMU capability. E.g. when we create an IOMMU with 48 bit DMA address width,
> > > > qemu does not check if any physical IOMMU has such support. And the shadow
> > > > IOMMU logic will have problem if host IOMMU only supports 39 bit IOVA. And
> > > > we will have the same problem when it comes to 57 bit IOVA.
> > > > 
> > > > My previous discussion with Peter Xu reached an agreement that for now, we
> > > > just use the host cpu capability as a reference when trying to create a 5-level
> > > > vIOMMU, because 57 bit IOMMU hardware will not come until ICX platform(which
> > > > includes LA57). 
> > > > 
> > > > And the final correct solution should be to enumerate the capabilities of
> > > > hardware IOMMUs used by the assigned device, and reject if any dismatch is
> > > > found.
> > > 
> > > Right. And it's a hack because
> > > 1. CPU AW doesn't always match VTD AW
> > > 2. The limitation only applies to hardware devices, software ones are fine
> > > So we need a patch for the host sysfs to expose the actual IOMMU AW to userspace.
> > > QEMU could then look at the actual hardware features.
> > > I'd like to see the actual patch doing that, even if we
> > > add a hack based on CPU AW for existing systems.
> > > 
> > 
> > Sure, I have plan to do so. And I am wondering, if this is a must for current
> > patchset to be accepted? I mean, after all, we already have the same problem
> > on existing platform. :)
> 
> I'd like to avoid poking at the CPU from VTD code. That's all.

OK. So for the short term，how about I remove the check of host cpu, and add a TODO
in the comments in vtd_decide_config()? 

As to the check against hardware IOMMU, Peter once had a proposal in
http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg02281.html

Do you have any comment or suggestion on Peter's proposal? I still do not quite know
how to do it for now...

[...]


B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-21 16:09                 ` Yu Zhang
@ 2018-12-21 17:04                   ` Michael S. Tsirkin
  2018-12-21 17:37                     ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-21 17:04 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> Well, my understanding of the vt-d spec is that the address limitation in
> DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> there's any different in the native scenario. :)

I think native machines exist on which the two values are different.
Is that true?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-21 16:19                           ` Yu Zhang
@ 2018-12-21 17:15                             ` Michael S. Tsirkin
  2018-12-21 17:34                               ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-21 17:15 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Sat, Dec 22, 2018 at 12:19:20AM +0800, Yu Zhang wrote:
> > I'd like to avoid poking at the CPU from VTD code. That's all.
> 
> OK. So for the short term，how about I remove the check of host cpu, and add a TODO
> in the comments in vtd_decide_config()? 

My question would be what happens on an incorrect use?
And how does user figure out which values to set?

> As to the check against hardware IOMMU, Peter once had a proposal in
> http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg02281.html
> 
> Do you have any comment or suggestion on Peter's proposal?

Sounds reasonable to me. Do we do it on vfio attach or unconditionally?


> I still do not quite know
> how to do it for now...
> 
> [...]
> 
> 
> B.R.
> Yu



-- 
MST

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-21 17:15                             ` Michael S. Tsirkin
@ 2018-12-21 17:34                               ` Yu Zhang
  2018-12-21 18:10                                 ` Michael S. Tsirkin
  2018-12-25  1:59                                 ` Tian, Kevin
  0 siblings, 2 replies; 57+ messages in thread
From: Yu Zhang @ 2018-12-21 17:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Fri, Dec 21, 2018 at 12:15:26PM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 22, 2018 at 12:19:20AM +0800, Yu Zhang wrote:
> > > I'd like to avoid poking at the CPU from VTD code. That's all.
> > 
> > OK. So for the short term，how about I remove the check of host cpu, and add a TODO
> > in the comments in vtd_decide_config()? 
> 
> My question would be what happens on an incorrect use?

I believe the vfio_dma_map will return failure for an incorrect use.

> And how does user figure out which values to set?

Well, for now I don't think user can figure out. E.g. if we expose a vIOMMU with
48-bit IOVA capability, yet host only supports 39-bit IOVA, vfio shall return failure,
but the user does not know whose fault it is.

> 
> > As to the check against hardware IOMMU, Peter once had a proposal in
> > http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg02281.html
> > 
> > Do you have any comment or suggestion on Peter's proposal?
> 
> Sounds reasonable to me. Do we do it on vfio attach or unconditionally?
> 

I guess on vfio attach? Will need more thinking in it.

> 
> > I still do not quite know
> > how to do it for now...
> > 
> > [...]
> > 
> > 
> > B.R.
> > Yu
> 
> 
> 
> -- 
> MST

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-21 17:04                   ` Michael S. Tsirkin
@ 2018-12-21 17:37                     ` Yu Zhang
  2018-12-21 19:02                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-21 17:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > Well, my understanding of the vt-d spec is that the address limitation in
> > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > there's any different in the native scenario. :)
> 
> I think native machines exist on which the two values are different.
> Is that true?

I think the answer is not. My understanding is that HAW(host address wdith) is
the maximum physical address width a CPU can detects(by cpuid.0x80000008).

I agree there are some addresses the CPU does not touch, but they are still in
the physical address space, and there's only one physical address space...

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-21 17:34                               ` Yu Zhang
@ 2018-12-21 18:10                                 ` Michael S. Tsirkin
  2018-12-22  0:41                                   ` Yu Zhang
  2018-12-25  1:59                                 ` Tian, Kevin
  1 sibling, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-21 18:10 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Sat, Dec 22, 2018 at 01:34:01AM +0800, Yu Zhang wrote:
> On Fri, Dec 21, 2018 at 12:15:26PM -0500, Michael S. Tsirkin wrote:
> > On Sat, Dec 22, 2018 at 12:19:20AM +0800, Yu Zhang wrote:
> > > > I'd like to avoid poking at the CPU from VTD code. That's all.
> > > 
> > > OK. So for the short term，how about I remove the check of host cpu, and add a TODO
> > > in the comments in vtd_decide_config()? 
> > 
> > My question would be what happens on an incorrect use?
> 
> I believe the vfio_dma_map will return failure for an incorrect use.
> 
> > And how does user figure out which values to set?
> 
> Well, for now I don't think user can figure out. E.g. if we expose a vIOMMU with
> 48-bit IOVA capability, yet host only supports 39-bit IOVA, vfio shall return failure,
> but the user does not know whose fault it is.
> > 
> > > As to the check against hardware IOMMU, Peter once had a proposal in
> > > http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg02281.html
> > > 
> > > Do you have any comment or suggestion on Peter's proposal?
> > 
> > Sounds reasonable to me. Do we do it on vfio attach or unconditionally?
> > 
> 
> I guess on vfio attach? Will need more thinking in it.


Things like live migration (e.g. after hot removal of the vfio device)
are also concerns.

> > 
> > > I still do not quite know
> > > how to do it for now...
> > > 
> > > [...]
> > > 
> > > 
> > > B.R.
> > > Yu
> > 
> > 
> > 
> > -- 
> > MST
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-21 17:37                     ` Yu Zhang
@ 2018-12-21 19:02                       ` Michael S. Tsirkin
  2018-12-21 20:01                         ` Eduardo Habkost
  2018-12-22  1:11                         ` Yu Zhang
  0 siblings, 2 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-21 19:02 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > Well, my understanding of the vt-d spec is that the address limitation in
> > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > there's any different in the native scenario. :)
> > 
> > I think native machines exist on which the two values are different.
> > Is that true?
> 
> I think the answer is not. My understanding is that HAW(host address wdith) is
> the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> 
> I agree there are some addresses the CPU does not touch, but they are still in
> the physical address space, and there's only one physical address space...
> 
> B.R.
> Yu

Ouch I thought we are talking about the virtual address size.
I think I did have a box where VTD's virtual address size was
smaller than CPU's.
For physical one - we just need to make it as big as max supported
memory right?

-- 
MST

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-21 19:02                       ` Michael S. Tsirkin
@ 2018-12-21 20:01                         ` Eduardo Habkost
  2018-12-22  1:11                         ` Yu Zhang
  1 sibling, 0 replies; 57+ messages in thread
From: Eduardo Habkost @ 2018-12-21 20:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yu Zhang, Igor Mammedov, qemu-devel, Peter Xu, Paolo Bonzini,
	Richard Henderson

On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > there's any different in the native scenario. :)
> > > 
> > > I think native machines exist on which the two values are different.
> > > Is that true?
> > 
> > I think the answer is not. My understanding is that HAW(host address wdith) is
> > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > 
> > I agree there are some addresses the CPU does not touch, but they are still in
> > the physical address space, and there's only one physical address space...
> > 
> > B.R.
> > Yu
> 
> Ouch I thought we are talking about the virtual address size.
> I think I did have a box where VTD's virtual address size was
> smaller than CPU's.
> For physical one - we just need to make it as big as max supported
> memory right?

What exactly do you mean by "max supported memory"?

-- 
Eduardo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-21 18:10                                 ` Michael S. Tsirkin
@ 2018-12-22  0:41                                   ` Yu Zhang
  2018-12-25 17:00                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-22  0:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Fri, Dec 21, 2018 at 01:10:13PM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 22, 2018 at 01:34:01AM +0800, Yu Zhang wrote:
> > On Fri, Dec 21, 2018 at 12:15:26PM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Dec 22, 2018 at 12:19:20AM +0800, Yu Zhang wrote:
> > > > > I'd like to avoid poking at the CPU from VTD code. That's all.
> > > > 
> > > > OK. So for the short term，how about I remove the check of host cpu, and add a TODO
> > > > in the comments in vtd_decide_config()? 
> > > 
> > > My question would be what happens on an incorrect use?
> > 
> > I believe the vfio_dma_map will return failure for an incorrect use.
> > 
> > > And how does user figure out which values to set?
> > 
> > Well, for now I don't think user can figure out. E.g. if we expose a vIOMMU with
> > 48-bit IOVA capability, yet host only supports 39-bit IOVA, vfio shall return failure,
> > but the user does not know whose fault it is.
> > > 
> > > > As to the check against hardware IOMMU, Peter once had a proposal in
> > > > http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg02281.html
> > > > 
> > > > Do you have any comment or suggestion on Peter's proposal?
> > > 
> > > Sounds reasonable to me. Do we do it on vfio attach or unconditionally?
> > > 
> > 
> > I guess on vfio attach? Will need more thinking in it.
> 
> 
> Things like live migration (e.g. after hot removal of the vfio device)
> are also concerns.

Sorry, why live migration shall be a problem? I mean, if the DMA address
width of vIOMMU does not match the host IOMMU's, we can just stop creating
the VM, there's no live migration. 

> 
> > > 
> > > > I still do not quite know
> > > > how to do it for now...
> > > > 
> > > > [...]
> > > > 
> > > > 
> > > > B.R.
> > > > Yu
> > > 
> > > 
> > > 
> > > -- 
> > > MST
> > 
> > B.R.
> > Yu

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-21 19:02                       ` Michael S. Tsirkin
  2018-12-21 20:01                         ` Eduardo Habkost
@ 2018-12-22  1:11                         ` Yu Zhang
  2018-12-25 16:56                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-22  1:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > there's any different in the native scenario. :)
> > > 
> > > I think native machines exist on which the two values are different.
> > > Is that true?
> > 
> > I think the answer is not. My understanding is that HAW(host address wdith) is
> > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > 
> > I agree there are some addresses the CPU does not touch, but they are still in
> > the physical address space, and there's only one physical address space...
> > 
> > B.R.
> > Yu
> 
> Ouch I thought we are talking about the virtual address size.
> I think I did have a box where VTD's virtual address size was
> smaller than CPU's.
> For physical one - we just need to make it as big as max supported
> memory right?

Well, my understanding of the physical one is the maximum physical address
width. Sorry, this explain seems nonsense... I mean, it's not just about
the max supported memory, but also covers MMIO. It shall be detectable
from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
size. One common usage of this value is to tell the paging structure entries(
CPU's or IOMMU's) which bits shall be reserved. There are also some registers
e.g. apic base reg etc, whose contents are physical addresses, therefore also
need to follow the similar requirement for the reserved bits.

So I think the correct direction might be to define this property in the
machine status level, instead of the CPU level. Is this reasonable to you?

> 
> -- 
> MST

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-21 17:34                               ` Yu Zhang
  2018-12-21 18:10                                 ` Michael S. Tsirkin
@ 2018-12-25  1:59                                 ` Tian, Kevin
  1 sibling, 0 replies; 57+ messages in thread
From: Tian, Kevin @ 2018-12-25  1:59 UTC (permalink / raw)
  To: Yu Zhang, Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson, Liu, Yi L

> From: Yu Zhang
> Sent: Saturday, December 22, 2018 1:34 AM
> 
[...]
> >
> > > As to the check against hardware IOMMU, Peter once had a proposal in
> > > http://lists.nongnu.org/archive/html/qemu-devel/2018-
> 11/msg02281.html
> > >
> > > Do you have any comment or suggestion on Peter's proposal?
> >
> > Sounds reasonable to me. Do we do it on vfio attach or unconditionally?
> >
> 
> I guess on vfio attach? Will need more thinking in it.
> 

either way is not perfect. Unconditional check doesn't make sense if
there is no vfio device attached, while vfio attach might happen late
(e.g. hotplug) after vIOMMU is initialized...

Basically there are two checks to be concerned. One is the check at 
boot time, which decides vIOMMU capabilities. The other is the check 
at vfio attach, which decides whether attachment can succeed (i.e.
whether the vIOMMU capabilities which are used by device are indeed
supported by hardware). 

Possibly we can make boot-time check configurable. 

If boot-time check is turned on, vIOMMU capabilities are always a 
subset of pIOMMU regardless of whether vfio device is attached. 
check on vfio attach may be skipped since it will always pass. virtual 
devices also bear with same limitation of pIOMMU. 

If boot-time check is off, vIOMMU capabilities are always specified 
by end user, which might be different from pIOMMU. virtual devices 
can use any capability, but vfio attach may fail if required vIOMMU
capabilities are not supported by pIOMMU.

btw 5 level is just one example of demanding check with pIOMMU. 
There are more when emulating VT-d scalable mode (+Yi).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-22  1:11                         ` Yu Zhang
@ 2018-12-25 16:56                           ` Michael S. Tsirkin
  2018-12-26  5:30                             ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-25 16:56 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Igor Mammedov, Eduardo Habkost, qemu-devel, Peter Xu,
	Paolo Bonzini, Richard Henderson

On Sat, Dec 22, 2018 at 09:11:26AM +0800, Yu Zhang wrote:
> On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> > On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > > there's any different in the native scenario. :)
> > > > 
> > > > I think native machines exist on which the two values are different.
> > > > Is that true?
> > > 
> > > I think the answer is not. My understanding is that HAW(host address wdith) is
> > > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > > 
> > > I agree there are some addresses the CPU does not touch, but they are still in
> > > the physical address space, and there's only one physical address space...
> > > 
> > > B.R.
> > > Yu
> > 
> > Ouch I thought we are talking about the virtual address size.
> > I think I did have a box where VTD's virtual address size was
> > smaller than CPU's.
> > For physical one - we just need to make it as big as max supported
> > memory right?
> 
> Well, my understanding of the physical one is the maximum physical address
> width. Sorry, this explain seems nonsense... I mean, it's not just about
> the max supported memory, but also covers MMIO. It shall be detectable
> from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
> size. One common usage of this value is to tell the paging structure entries(
> CPU's or IOMMU's) which bits shall be reserved. There are also some registers
> e.g. apic base reg etc, whose contents are physical addresses, therefore also
> need to follow the similar requirement for the reserved bits.
> 
> So I think the correct direction might be to define this property in the
> machine status level, instead of the CPU level. Is this reasonable to you?

At that level yes. But isn't this already specified by "pci-hole64-end"?



> > 
> > -- 
> > MST
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-22  0:41                                   ` Yu Zhang
@ 2018-12-25 17:00                                     ` Michael S. Tsirkin
  2018-12-26  5:58                                       ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2018-12-25 17:00 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Sat, Dec 22, 2018 at 08:41:37AM +0800, Yu Zhang wrote:
> On Fri, Dec 21, 2018 at 01:10:13PM -0500, Michael S. Tsirkin wrote:
> > On Sat, Dec 22, 2018 at 01:34:01AM +0800, Yu Zhang wrote:
> > > On Fri, Dec 21, 2018 at 12:15:26PM -0500, Michael S. Tsirkin wrote:
> > > > On Sat, Dec 22, 2018 at 12:19:20AM +0800, Yu Zhang wrote:
> > > > > > I'd like to avoid poking at the CPU from VTD code. That's all.
> > > > > 
> > > > > OK. So for the short term，how about I remove the check of host cpu, and add a TODO
> > > > > in the comments in vtd_decide_config()? 
> > > > 
> > > > My question would be what happens on an incorrect use?
> > > 
> > > I believe the vfio_dma_map will return failure for an incorrect use.
> > > 
> > > > And how does user figure out which values to set?
> > > 
> > > Well, for now I don't think user can figure out. E.g. if we expose a vIOMMU with
> > > 48-bit IOVA capability, yet host only supports 39-bit IOVA, vfio shall return failure,
> > > but the user does not know whose fault it is.
> > > > 
> > > > > As to the check against hardware IOMMU, Peter once had a proposal in
> > > > > http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg02281.html
> > > > > 
> > > > > Do you have any comment or suggestion on Peter's proposal?
> > > > 
> > > > Sounds reasonable to me. Do we do it on vfio attach or unconditionally?
> > > > 
> > > 
> > > I guess on vfio attach? Will need more thinking in it.
> > 
> > 
> > Things like live migration (e.g. after hot removal of the vfio device)
> > are also concerns.
> 
> Sorry, why live migration shall be a problem? I mean, if the DMA address
> width of vIOMMU does not match the host IOMMU's, we can just stop creating
> the VM, there's no live migration. 

I don't see code like this though.

Also management needs to somehow be able to figure out that migration
will fail. It's not nice to transfer all memory and then have it fail
when viommu is migrated.  So from that POV a flag is better. It can be
validated agains host capabilities.

We can still have something like aw=host just like cpu host.

> > 
> > > > 
> > > > > I still do not quite know
> > > > > how to do it for now...
> > > > > 
> > > > > [...]
> > > > > 
> > > > > 
> > > > > B.R.
> > > > > Yu
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > MST
> > > 
> > > B.R.
> > > Yu
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-25 16:56                           ` Michael S. Tsirkin
@ 2018-12-26  5:30                             ` Yu Zhang
  2018-12-27 15:14                               ` Eduardo Habkost
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-26  5:30 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eduardo Habkost
  Cc: qemu-devel, Peter Xu, Paolo Bonzini, Igor Mammedov, Richard Henderson

On Tue, Dec 25, 2018 at 11:56:19AM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 22, 2018 at 09:11:26AM +0800, Yu Zhang wrote:
> > On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > > > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > > > there's any different in the native scenario. :)
> > > > > 
> > > > > I think native machines exist on which the two values are different.
> > > > > Is that true?
> > > > 
> > > > I think the answer is not. My understanding is that HAW(host address wdith) is
> > > > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > > > 
> > > > I agree there are some addresses the CPU does not touch, but they are still in
> > > > the physical address space, and there's only one physical address space...
> > > > 
> > > > B.R.
> > > > Yu
> > > 
> > > Ouch I thought we are talking about the virtual address size.
> > > I think I did have a box where VTD's virtual address size was
> > > smaller than CPU's.
> > > For physical one - we just need to make it as big as max supported
> > > memory right?
> > 
> > Well, my understanding of the physical one is the maximum physical address
> > width. Sorry, this explain seems nonsense... I mean, it's not just about
> > the max supported memory, but also covers MMIO. It shall be detectable
> > from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
> > size. One common usage of this value is to tell the paging structure entries(
> > CPU's or IOMMU's) which bits shall be reserved. There are also some registers
> > e.g. apic base reg etc, whose contents are physical addresses, therefore also
> > need to follow the similar requirement for the reserved bits.
> > 
> > So I think the correct direction might be to define this property in the
> > machine status level, instead of the CPU level. Is this reasonable to you?
> 
> At that level yes. But isn't this already specified by "pci-hole64-end"?

But this value is set by guest firmware? Will PCI hotplug change this address?

@Eduardo, do you have any plan to calculate the phys-bits by "pci-hole64-end"?
Or introduce another property, say "max-phys-bits" in machine status?

> 
> 
> 
> > > 
> > > -- 
> > > MST
> > 
> > B.R.
> > Yu
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
  2018-12-25 17:00                                     ` Michael S. Tsirkin
@ 2018-12-26  5:58                                       ` Yu Zhang
  0 siblings, 0 replies; 57+ messages in thread
From: Yu Zhang @ 2018-12-26  5:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Tue, Dec 25, 2018 at 12:00:08PM -0500, Michael S. Tsirkin wrote:
> On Sat, Dec 22, 2018 at 08:41:37AM +0800, Yu Zhang wrote:
> > On Fri, Dec 21, 2018 at 01:10:13PM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Dec 22, 2018 at 01:34:01AM +0800, Yu Zhang wrote:
> > > > On Fri, Dec 21, 2018 at 12:15:26PM -0500, Michael S. Tsirkin wrote:
> > > > > On Sat, Dec 22, 2018 at 12:19:20AM +0800, Yu Zhang wrote:
> > > > > > > I'd like to avoid poking at the CPU from VTD code. That's all.
> > > > > > 
> > > > > > OK. So for the short term，how about I remove the check of host cpu, and add a TODO
> > > > > > in the comments in vtd_decide_config()? 
> > > > > 
> > > > > My question would be what happens on an incorrect use?
> > > > 
> > > > I believe the vfio_dma_map will return failure for an incorrect use.
> > > > 
> > > > > And how does user figure out which values to set?
> > > > 
> > > > Well, for now I don't think user can figure out. E.g. if we expose a vIOMMU with
> > > > 48-bit IOVA capability, yet host only supports 39-bit IOVA, vfio shall return failure,
> > > > but the user does not know whose fault it is.
> > > > > 
> > > > > > As to the check against hardware IOMMU, Peter once had a proposal in
> > > > > > http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg02281.html
> > > > > > 
> > > > > > Do you have any comment or suggestion on Peter's proposal?
> > > > > 
> > > > > Sounds reasonable to me. Do we do it on vfio attach or unconditionally?
> > > > > 
> > > > 
> > > > I guess on vfio attach? Will need more thinking in it.
> > > 
> > > 
> > > Things like live migration (e.g. after hot removal of the vfio device)
> > > are also concerns.
> > 
> > Sorry, why live migration shall be a problem? I mean, if the DMA address
> > width of vIOMMU does not match the host IOMMU's, we can just stop creating
> > the VM, there's no live migration. 
> 
> I don't see code like this though.
> 
> Also management needs to somehow be able to figure out that migration
> will fail. It's not nice to transfer all memory and then have it fail
> when viommu is migrated.  So from that POV a flag is better. It can be
> validated agains host capabilities.
> 
> We can still have something like aw=host just like cpu host.

Well, I think vIOMMU's requirement is kind of different:
1> the vIOMMU could be an emulated one, and there can be no physical
IOMMU underneath. And the emulated device can still use this vIOMMU;
2> there might be multiple physical IOMMUs on one platform, I am not
sure if all these IOMMUs will have the same capability setting.

So I think we should have a more generic solution, to check the host
capability, e.g. like Kevin's and Peter's suggestion. It's not just
about 5-level vIOMMU, existing 4-level vIOMMU and future virtual SVM
have similar requirement. :)

> 
> > > 
> > > > > 
> > > > > > I still do not quite know
> > > > > > how to do it for now...
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > 
> > > > > > B.R.
> > > > > > Yu
> > > > > 
> > > > > 
> > > > > 
> > > > > -- 
> > > > > MST
> > > > 
> > > > B.R.
> > > > Yu
> > 
> > B.R.
> > Yu
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-21 14:13               ` Igor Mammedov
  2018-12-21 16:09                 ` Yu Zhang
@ 2018-12-27 14:54                 ` Eduardo Habkost
  2018-12-28 11:42                   ` Igor Mammedov
  1 sibling, 1 reply; 57+ messages in thread
From: Eduardo Habkost @ 2018-12-27 14:54 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, Yu Zhang,
	Paolo Bonzini, Richard Henderson

On Fri, Dec 21, 2018 at 03:13:25PM +0100, Igor Mammedov wrote:
> On Thu, 20 Dec 2018 19:18:01 -0200
> Eduardo Habkost <ehabkost@redhat.com> wrote:
> 
> > On Wed, Dec 19, 2018 at 11:40:37AM +0100, Igor Mammedov wrote:
> > > On Wed, 19 Dec 2018 10:57:17 +0800
> > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >   
> > > > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:  
> > > > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > >     
> > > > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:    
> > > > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > >     
> > > > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > > > data structures such as root entries, context entries, and entries of
> > > > > > > > DMA paging structures etc.
> > > > > > > > 
> > > > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > > > in an invalid IOVA being accepted.
> > > > > > > > 
> > > > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > > > guest CPU.    
> > > > > > >     
> > > > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > > > to clarify.
> > > > > > > > 
> > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > ---    
> > > > > > > [...]
> > > > > > >     
> > > > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > > > >  {
> > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > >  
> > > > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > >      }
> > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > +    s->haw_bits = cpu->phys_bits;    
> > > > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > > > and set phys_bits when iommu is created?    
> > > > > > 
> > > > > > Thanks for your comments, Igor.
> > > > > > 
> > > > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > > > 
> > > > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > > > address width. If we do not check the CPU field here, we will still have to
> > > > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > > > s->haw_bits again.
> > > > > > 
> > > > > > Is this explanation convincing enough? :)    
> > > > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > > > values during iommu creation.    
> > > > 
> > > > Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> > > > According to the spec, the host address width is the maximum physical address width,
> > > > yet current implementation is using the DMA address width. For me, this is not only
> > > > wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> > > > 
> > > > As to how to fix it - should we query the cpu fields, I still do not understand why
> > > > this is not acceptable. :)
> > > > 
> > > > I had thought of other approaches before, yet I did not choose:
> > > >   
> > > > 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its    
> > > > physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> > > > this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> > > > or not?
> > > >   
> > > > 2> Another choice I had thought of is, to query the physical iommu. I abandoned this    
> > > > idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.  
> > >   
> > > > So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)  
> > > Because accessing private fields of device from another random device is not robust
> > > and a subject to breaking in unpredictable manner when field meaning or initialization
> > > order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
> > > access some piece of data from random device).
> > >   
> > 
> > With either the solution below or the one I proposed, we still
> > have a ordering problem: if we want "-cpu ...,phys-bits=..." to
> As Michael said, it's questionable if iommu should rely on guest's
> phys-bits at all,

Agreed, this is not clear.  I don't know yet if we really want to
make "-cpu" affect other devices.  Probably not.

>                   but that aside we should use proper interfaces
> and hierarchy to initialize devices, see below why I dislike
> simplistic pc_max_phys_bits().

What do you mean by proper interfaces and hierarchy?

pc_max_phys_bits() is simple, and that's supposed to be a good
thing.

> 
> > affect the IOMMU device, we will need the CPU objects to be
> > created before IOMMU realize.
> > 
> > At least both proposals make the initialization ordering
> > explicitly a responsibility of the machine code.  In either case,
> > I don't think we will start creating all CPU objects after device
> > realize any time soon.
> > 
> > 
> > > I've looked at intel-iommu code and how it's created so here is a way to do the thing
> > > you need using proper interfaces:
> > > 
> > > 1. add x-haw_bits property
> > > 2. include in your series patch
> > >     '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
> > > 3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
> > >    machine and let _pre_plug handler to check and set x-haw_bits for machine level  
> > 
> > Wow, that's a very complex way to pass a single integer from
> > machine code to device code.  If this is the only way to do that,
> > we really need to take a step back and rethink our API design.
> > 
> > What's wrong with having a simple
> >   uint32_t pc_max_phys_bits(PCMachineState*)
> > function?
> As suggested, it would be only aesthetic change for accessing first_cpu from
> random device at random time. IOMMU would still access cpu instance directly
> no matter how much wrappers one would use so it's still the same hack.
> If phys_bits were changing during VM lifecycle and iommu needed to use
> updated value then using pc_max_phys_bits() might be justified as
> we don't have interfaces to handle that but that's not the case here.

I don't understand what you mean here.  Which "interfaces to
handle that" you are talking about?

> 
> I suggested a typical way (albeit a bit complex) to handle device
> initialization in cases where bus plug handler is not sufficient.
> It follows proper hierarchy without any layer violations and can fail
> gracefully even if we start creating CPUs later using only '-device cpufoo'
> without need to fix iommu code to handle that (it would fail creating
> iommu with clear error that CPU isn't available and all user have to
> do is to fix CLI to make sure that CPU is created before iommu).

What do you mean by "proper hierarchy" and "layer violations"?
What exactly is wrong with having device code talking to the
machine object?

You do have a point about "-device cpufoo": making "-cpu" affect
iommu phys-bits is probably not a good idea after all.

> 
> So I'd prefer if we used exiting pattern for device initialization
> instead of hacks whenever it is possible.

Why do you describe it as a hack?  It's just C code calling a C
function.  I don't see any problem in having device code talking
to the machine code to get a bit of information.


> 
> > 
> > > 4. you probably can use phys-bits/host-phys-bits properties to get data that you need
> > >    also see how ms->possible_cpus, that's how you can get access to CPU from machine
> > >    layer.
> > >   
> > [...]
> > 
> PS:
> Another thing I'd like to draw your attention to (since you recently looked at
> phys-bits) is about host/guest phys_bits and if it's safe from migration pov
> between hosts with different limits.
> 
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-26  5:30                             ` Yu Zhang
@ 2018-12-27 15:14                               ` Eduardo Habkost
  2018-12-28  2:32                                 ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Eduardo Habkost @ 2018-12-27 15:14 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Wed, Dec 26, 2018 at 01:30:00PM +0800, Yu Zhang wrote:
> On Tue, Dec 25, 2018 at 11:56:19AM -0500, Michael S. Tsirkin wrote:
> > On Sat, Dec 22, 2018 at 09:11:26AM +0800, Yu Zhang wrote:
> > > On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> > > > On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > > > > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > > > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > > > > there's any different in the native scenario. :)
> > > > > > 
> > > > > > I think native machines exist on which the two values are different.
> > > > > > Is that true?
> > > > > 
> > > > > I think the answer is not. My understanding is that HAW(host address wdith) is
> > > > > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > > > > 
> > > > > I agree there are some addresses the CPU does not touch, but they are still in
> > > > > the physical address space, and there's only one physical address space...
> > > > > 
> > > > > B.R.
> > > > > Yu
> > > > 
> > > > Ouch I thought we are talking about the virtual address size.
> > > > I think I did have a box where VTD's virtual address size was
> > > > smaller than CPU's.
> > > > For physical one - we just need to make it as big as max supported
> > > > memory right?
> > > 
> > > Well, my understanding of the physical one is the maximum physical address
> > > width. Sorry, this explain seems nonsense... I mean, it's not just about
> > > the max supported memory, but also covers MMIO. It shall be detectable
> > > from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
> > > size. One common usage of this value is to tell the paging structure entries(
> > > CPU's or IOMMU's) which bits shall be reserved. There are also some registers
> > > e.g. apic base reg etc, whose contents are physical addresses, therefore also
> > > need to follow the similar requirement for the reserved bits.
> > > 
> > > So I think the correct direction might be to define this property in the
> > > machine status level, instead of the CPU level. Is this reasonable to you?
> > 
> > At that level yes. But isn't this already specified by "pci-hole64-end"?
> 
> But this value is set by guest firmware? Will PCI hotplug change this address?
> 
> @Eduardo, do you have any plan to calculate the phys-bits by "pci-hole64-end"?
> Or introduce another property, say "max-phys-bits" in machine status?

I agree it may make sense to make the machine code control
phys-bits instead of the CPU object.  A machine property sounds
like the simplest solution.

But I don't think we can have a meaningful discussion about
implementation if we don't agree about the command-line
interface.  We must decide what will happen to the CPU and iommu
physical address width in cases like:

  $QEMU -device intel-iommu
  $QEMU -cpu ...,phys-bits=50 -device intel-iommu
  $QEMU -cpu ...,host-phys-bits=on -device intel-iommu
  $QEMU -machine phys-bits=50 -device intel-iommu
  $QEMU -machine phys-bits=50 -cpu ...,phys-bits=48 -device intel-iommu

-- 
Eduardo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-27 15:14                               ` Eduardo Habkost
@ 2018-12-28  2:32                                 ` Yu Zhang
  2018-12-29  1:29                                   ` Eduardo Habkost
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2018-12-28  2:32 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Thu, Dec 27, 2018 at 01:14:11PM -0200, Eduardo Habkost wrote:
> On Wed, Dec 26, 2018 at 01:30:00PM +0800, Yu Zhang wrote:
> > On Tue, Dec 25, 2018 at 11:56:19AM -0500, Michael S. Tsirkin wrote:
> > > On Sat, Dec 22, 2018 at 09:11:26AM +0800, Yu Zhang wrote:
> > > > On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> > > > > On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > > > > > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > > > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > > > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > > > > > there's any different in the native scenario. :)
> > > > > > > 
> > > > > > > I think native machines exist on which the two values are different.
> > > > > > > Is that true?
> > > > > > 
> > > > > > I think the answer is not. My understanding is that HAW(host address wdith) is
> > > > > > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > > > > > 
> > > > > > I agree there are some addresses the CPU does not touch, but they are still in
> > > > > > the physical address space, and there's only one physical address space...
> > > > > > 
> > > > > > B.R.
> > > > > > Yu
> > > > > 
> > > > > Ouch I thought we are talking about the virtual address size.
> > > > > I think I did have a box where VTD's virtual address size was
> > > > > smaller than CPU's.
> > > > > For physical one - we just need to make it as big as max supported
> > > > > memory right?
> > > > 
> > > > Well, my understanding of the physical one is the maximum physical address
> > > > width. Sorry, this explain seems nonsense... I mean, it's not just about
> > > > the max supported memory, but also covers MMIO. It shall be detectable
> > > > from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
> > > > size. One common usage of this value is to tell the paging structure entries(
> > > > CPU's or IOMMU's) which bits shall be reserved. There are also some registers
> > > > e.g. apic base reg etc, whose contents are physical addresses, therefore also
> > > > need to follow the similar requirement for the reserved bits.
> > > > 
> > > > So I think the correct direction might be to define this property in the
> > > > machine status level, instead of the CPU level. Is this reasonable to you?
> > > 
> > > At that level yes. But isn't this already specified by "pci-hole64-end"?
> > 
> > But this value is set by guest firmware? Will PCI hotplug change this address?
> > 
> > @Eduardo, do you have any plan to calculate the phys-bits by "pci-hole64-end"?
> > Or introduce another property, say "max-phys-bits" in machine status?
> 
> I agree it may make sense to make the machine code control
> phys-bits instead of the CPU object.  A machine property sounds
> like the simplest solution.
> 
> But I don't think we can have a meaningful discussion about
> implementation if we don't agree about the command-line
> interface.  We must decide what will happen to the CPU and iommu
> physical address width in cases like:

Thanks, Eduardo.

What about we just use "-machine phys-bits=52", and remove the
"phys-bits" from CPU parameter?

> 
>   $QEMU -device intel-iommu
>   $QEMU -cpu ...,phys-bits=50 -device intel-iommu
>   $QEMU -cpu ...,host-phys-bits=on -device intel-iommu
>   $QEMU -machine phys-bits=50 -device intel-iommu
>   $QEMU -machine phys-bits=50 -cpu ...,phys-bits=48 -device intel-iommu
> 
> -- 
> Eduardo
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-27 14:54                 ` Eduardo Habkost
@ 2018-12-28 11:42                   ` Igor Mammedov
  0 siblings, 0 replies; 57+ messages in thread
From: Igor Mammedov @ 2018-12-28 11:42 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, Yu Zhang,
	Paolo Bonzini, Richard Henderson

On Thu, 27 Dec 2018 12:54:02 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Fri, Dec 21, 2018 at 03:13:25PM +0100, Igor Mammedov wrote:
> > On Thu, 20 Dec 2018 19:18:01 -0200
> > Eduardo Habkost <ehabkost@redhat.com> wrote:
> > 
> > > On Wed, Dec 19, 2018 at 11:40:37AM +0100, Igor Mammedov wrote:
> > > > On Wed, 19 Dec 2018 10:57:17 +0800
> > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >   
> > > > > On Tue, Dec 18, 2018 at 03:55:36PM +0100, Igor Mammedov wrote:  
> > > > > > On Tue, 18 Dec 2018 17:27:23 +0800
> > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > >     
> > > > > > > On Mon, Dec 17, 2018 at 02:17:40PM +0100, Igor Mammedov wrote:    
> > > > > > > > On Wed, 12 Dec 2018 21:05:38 +0800
> > > > > > > > Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > > > > >     
> > > > > > > > > Currently, vIOMMU is using the value of IOVA address width, instead of
> > > > > > > > > the host address width(HAW) to calculate the number of reserved bits in
> > > > > > > > > data structures such as root entries, context entries, and entries of
> > > > > > > > > DMA paging structures etc.
> > > > > > > > > 
> > > > > > > > > However values of IOVA address width and of the HAW may not equal. For
> > > > > > > > > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > > > > > > > > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > > > > > > > > in an invalid IOVA being accepted.
> > > > > > > > > 
> > > > > > > > > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > > > > > > > > whose value is initialized based on the maximum physical address set to
> > > > > > > > > guest CPU.    
> > > > > > > >     
> > > > > > > > > Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > > > > > > > > to clarify.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > > > > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > > > > > > ---    
> > > > > > > > [...]
> > > > > > > >     
> > > > > > > > > @@ -3100,6 +3104,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> > > > > > > > >  static void vtd_init(IntelIOMMUState *s)
> > > > > > > > >  {
> > > > > > > > >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> > > > > > > > > +    CPUState *cs = first_cpu;
> > > > > > > > > +    X86CPU *cpu = X86_CPU(cs);
> > > > > > > > >  
> > > > > > > > >      memset(s->csr, 0, DMAR_REG_SIZE);
> > > > > > > > >      memset(s->wmask, 0, DMAR_REG_SIZE);
> > > > > > > > > @@ -3119,23 +3125,24 @@ static void vtd_init(IntelIOMMUState *s)
> > > > > > > > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> > > > > > > > >               VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > > > > > > >               VTD_CAP_SAGAW_39bit | VTD_CAP_MGAW(s->aw_bits);
> > > > > > > > > -    if (s->aw_bits == VTD_HOST_AW_48BIT) {
> > > > > > > > > +    if (s->aw_bits == VTD_AW_48BIT) {
> > > > > > > > >          s->cap |= VTD_CAP_SAGAW_48bit;
> > > > > > > > >      }
> > > > > > > > >      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
> > > > > > > > > +    s->haw_bits = cpu->phys_bits;    
> > > > > > > > Is it possible to avoid accessing CPU fields directly or cpu altogether
> > > > > > > > and set phys_bits when iommu is created?    
> > > > > > > 
> > > > > > > Thanks for your comments, Igor.
> > > > > > > 
> > > > > > > Well, I guess you prefer not to query the CPU capabilities while deciding
> > > > > > > the vIOMMU features. But to me, they are not that irrelevant.:)
> > > > > > > 
> > > > > > > Here the hardware address width in vt-d, and the one in cpuid.MAXPHYSADDR
> > > > > > > are referring to the same concept. In VM, both are the maximum guest physical
> > > > > > > address width. If we do not check the CPU field here, we will still have to
> > > > > > > check the CPU field in other places such as build_dmar_q35(), and reset the
> > > > > > > s->haw_bits again.
> > > > > > > 
> > > > > > > Is this explanation convincing enough? :)    
> > > > > > current build_dmar_q35() doesn't do it, it's all new code in this series that
> > > > > > contains not acceptable direct access from one device (iommu) to another (cpu).   
> > > > > > Proper way would be for the owner of iommu to fish limits from somewhere and set
> > > > > > values during iommu creation.    
> > > > > 
> > > > > Well, current build_dmar_q35() doesn't do it, because it is using the incorrect value. :)
> > > > > According to the spec, the host address width is the maximum physical address width,
> > > > > yet current implementation is using the DMA address width. For me, this is not only
> > > > > wrong, but also unsecure. For this point, I think we all agree this need to be fixed.
> > > > > 
> > > > > As to how to fix it - should we query the cpu fields, I still do not understand why
> > > > > this is not acceptable. :)
> > > > > 
> > > > > I had thought of other approaches before, yet I did not choose:
> > > > >   
> > > > > 1> Introduce a new parameter, say, "x-haw-bits" which is used for iommu to limit its    
> > > > > physical address width(similar to the "x-aw-bits" for IOVA). But what should we check
> > > > > this parameter or not? What if this parameter is set to sth. different than the "phys-bits"
> > > > > or not?
> > > > >   
> > > > > 2> Another choice I had thought of is, to query the physical iommu. I abandoned this    
> > > > > idea because my understanding is that vIOMMU is not a passthrued device, it is emulated.  
> > > >   
> > > > > So Igor, may I ask why you think checking against the cpu fields so not acceptable? :)  
> > > > Because accessing private fields of device from another random device is not robust
> > > > and a subject to breaking in unpredictable manner when field meaning or initialization
> > > > order changes. (analogy to baremetal: one does not solder wire to a CPU die to let
> > > > access some piece of data from random device).
> > > >   
> > > 
> > > With either the solution below or the one I proposed, we still
> > > have a ordering problem: if we want "-cpu ...,phys-bits=..." to
> > As Michael said, it's questionable if iommu should rely on guest's
> > phys-bits at all,
> 
> Agreed, this is not clear.  I don't know yet if we really want to
> make "-cpu" affect other devices.  Probably not.
> 
> >                   but that aside we should use proper interfaces
> > and hierarchy to initialize devices, see below why I dislike
> > simplistic pc_max_phys_bits().
> 
> What do you mean by proper interfaces and hierarchy?
set properties on created iommu object by one of it's parents
(SysBus or machine)

> 
> pc_max_phys_bits() is simple, and that's supposed to be a good
> thing.
first_cpu->phys_bits even simpler, shouldn't we use it then?

it only seems simple, but with this approach one would end up
creating a bunch of custom APIs for every little thing then try
to generalize them to share with other machine types pushing APIs
to generic machine where it is irrelevant for the most machines.
So one end ups with a lot hard to manage APIs that are called at
random times by devices.

> > > affect the IOMMU device, we will need the CPU objects to be
> > > created before IOMMU realize.
> > > 
> > > At least both proposals make the initialization ordering
> > > explicitly a responsibility of the machine code.  In either case,
> > > I don't think we will start creating all CPU objects after device
> > > realize any time soon.
> > > 
> > > 
> > > > I've looked at intel-iommu code and how it's created so here is a way to do the thing
> > > > you need using proper interfaces:
> > > > 
> > > > 1. add x-haw_bits property
> > > > 2. include in your series patch
> > > >     '[Qemu-devel] [PATCH] qdev: let machine hotplug handler to override  bus hotplug handler'
> > > > 3. add your iommu to pc_get_hotpug_handler() to redirect plug flow to
> > > >    machine and let _pre_plug handler to check and set x-haw_bits for machine level  
> > > 
> > > Wow, that's a very complex way to pass a single integer from
> > > machine code to device code.  If this is the only way to do that,
> > > we really need to take a step back and rethink our API design.
> > > 
> > > What's wrong with having a simple
> > >   uint32_t pc_max_phys_bits(PCMachineState*)
> > > function?
> > As suggested, it would be only aesthetic change for accessing first_cpu from
> > random device at random time. IOMMU would still access cpu instance directly
> > no matter how much wrappers one would use so it's still the same hack.
> > If phys_bits were changing during VM lifecycle and iommu needed to use
> > updated value then using pc_max_phys_bits() might be justified as
> > we don't have interfaces to handle that but that's not the case here.
> 
> I don't understand what you mean here.  Which "interfaces to
> handle that" you are talking about?
There is HotplugHandler and its pre_plug() handler to initialize
being created device before it's realize() method is called.
In iommu case, I suggest machine to set iommu::phys_bits property
when device is created at pre_plug() time and fail gracefully if it's
not possible. It's a bit more complex than pc_max_phys_bits() but
follows QOM device life-cycle just as expected without unnecessary
relations.

> > I suggested a typical way (albeit a bit complex) to handle device
> > initialization in cases where bus plug handler is not sufficient.
> > It follows proper hierarchy without any layer violations and can fail
> > gracefully even if we start creating CPUs later using only '-device cpufoo'
> > without need to fix iommu code to handle that (it would fail creating
> > iommu with clear error that CPU isn't available and all user have to
> > do is to fix CLI to make sure that CPU is created before iommu).
> 
> What do you mean by "proper hierarchy" and "layer violations"?

it means that an object shouldn't reach to the parent to fetch 
a random bit of configuration (ideally child shouldn't be aware of
parent's existence at all), and then it's responsibility of parent
to configure child during it's creation and set all necessary
properties/resources for the child to function properly.

That model was used in qdev and it's still true with QOM device
models we use now. Difference is that instead of configuring
device fields directly, QOM device approach uses more unified
object_new/set properties/realize work-flow.

> What exactly is wrong with having device code talking to the
> machine object?

it breaks device abstraction boundaries and introduces unnecessary
relationship between devices instead of reusing existing device
initialization framework.

It will also be a problem if we start isolating device models/backends
into separate processes as one would have to add/maintain/secure ABIs
for 'simple' APIs where property setting ABI would be sufficient. 


> You do have a point about "-device cpufoo": making "-cpu" affect
> iommu phys-bits is probably not a good idea after all.
> 
> > 
> > So I'd prefer if we used exiting pattern for device initialization
> > instead of hacks whenever it is possible.
> 
> Why do you describe it as a hack?  It's just C code calling a C
> function.  I don't see any problem in having device code talking
> to the machine code to get a bit of information.

(Well, we can get rid of a bunch of properties and query QemuOpts
directly from each device whenever configuration info is needed,
it's just a C code calling C functions after all)

Writing and calling random set of C functions at random time is fine
if we give up on modeling devices as QOM objects (QEMU was like that
at some point), but that becomes unmanageable as complexity grows.
 
 
> > > > 4. you probably can use phys-bits/host-phys-bits properties to get data that you need
> > > >    also see how ms->possible_cpus, that's how you can get access to CPU from machine
> > > >    layer.
> > > >   
> > > [...]
> > > 
> > PS:
> > Another thing I'd like to draw your attention to (since you recently looked at
> > phys-bits) is about host/guest phys_bits and if it's safe from migration pov
> > between hosts with different limits.
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-28  2:32                                 ` Yu Zhang
@ 2018-12-29  1:29                                   ` Eduardo Habkost
  2019-01-15  7:13                                     ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Eduardo Habkost @ 2018-12-29  1:29 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Michael S. Tsirkin, qemu-devel, Peter Xu, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On Fri, Dec 28, 2018 at 10:32:59AM +0800, Yu Zhang wrote:
> On Thu, Dec 27, 2018 at 01:14:11PM -0200, Eduardo Habkost wrote:
> > On Wed, Dec 26, 2018 at 01:30:00PM +0800, Yu Zhang wrote:
> > > On Tue, Dec 25, 2018 at 11:56:19AM -0500, Michael S. Tsirkin wrote:
> > > > On Sat, Dec 22, 2018 at 09:11:26AM +0800, Yu Zhang wrote:
> > > > > On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> > > > > > On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > > > > > > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > > > > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > > > > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > > > > > > there's any different in the native scenario. :)
> > > > > > > > 
> > > > > > > > I think native machines exist on which the two values are different.
> > > > > > > > Is that true?
> > > > > > > 
> > > > > > > I think the answer is not. My understanding is that HAW(host address wdith) is
> > > > > > > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > > > > > > 
> > > > > > > I agree there are some addresses the CPU does not touch, but they are still in
> > > > > > > the physical address space, and there's only one physical address space...
> > > > > > > 
> > > > > > > B.R.
> > > > > > > Yu
> > > > > > 
> > > > > > Ouch I thought we are talking about the virtual address size.
> > > > > > I think I did have a box where VTD's virtual address size was
> > > > > > smaller than CPU's.
> > > > > > For physical one - we just need to make it as big as max supported
> > > > > > memory right?
> > > > > 
> > > > > Well, my understanding of the physical one is the maximum physical address
> > > > > width. Sorry, this explain seems nonsense... I mean, it's not just about
> > > > > the max supported memory, but also covers MMIO. It shall be detectable
> > > > > from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
> > > > > size. One common usage of this value is to tell the paging structure entries(
> > > > > CPU's or IOMMU's) which bits shall be reserved. There are also some registers
> > > > > e.g. apic base reg etc, whose contents are physical addresses, therefore also
> > > > > need to follow the similar requirement for the reserved bits.
> > > > > 
> > > > > So I think the correct direction might be to define this property in the
> > > > > machine status level, instead of the CPU level. Is this reasonable to you?
> > > > 
> > > > At that level yes. But isn't this already specified by "pci-hole64-end"?
> > > 
> > > But this value is set by guest firmware? Will PCI hotplug change this address?
> > > 
> > > @Eduardo, do you have any plan to calculate the phys-bits by "pci-hole64-end"?
> > > Or introduce another property, say "max-phys-bits" in machine status?
> > 
> > I agree it may make sense to make the machine code control
> > phys-bits instead of the CPU object.  A machine property sounds
> > like the simplest solution.
> > 
> > But I don't think we can have a meaningful discussion about
> > implementation if we don't agree about the command-line
> > interface.  We must decide what will happen to the CPU and iommu
> > physical address width in cases like:
> 
> Thanks, Eduardo.
> 
> What about we just use "-machine phys-bits=52", and remove the
> "phys-bits" from CPU parameter?

Maybe we can deprecate it, but we can't remove it immediately.
We still need to decide what to do on the cases below, while the
option is still available.

> 
> > 
> >   $QEMU -device intel-iommu
> >   $QEMU -cpu ...,phys-bits=50 -device intel-iommu
> >   $QEMU -cpu ...,host-phys-bits=on -device intel-iommu
> >   $QEMU -machine phys-bits=50 -device intel-iommu
> >   $QEMU -machine phys-bits=50 -cpu ...,phys-bits=48 -device intel-iommu
> > 
> > -- 
> > Eduardo
> > 
> 
> B.R.
> Yu

-- 
Eduardo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU.
  2018-12-12 13:05 [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
                   ` (2 preceding siblings ...)
  2018-12-14  9:17 ` [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
@ 2019-01-15  4:02 ` Michael S. Tsirkin
  2019-01-15  7:27   ` Yu Zhang
  3 siblings, 1 reply; 57+ messages in thread
From: Michael S. Tsirkin @ 2019-01-15  4:02 UTC (permalink / raw)
  To: Yu Zhang
  Cc: qemu-devel, Igor Mammedov, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Peter Xu

On Wed, Dec 12, 2018 at 09:05:37PM +0800, Yu Zhang wrote:
> Intel's upcoming processors will extend maximum linear address width to
> 57 bits, and introduce 5-level paging for CPU. Meanwhile, the platform
> will also extend the maximum guest address width for IOMMU to 57 bits,
> thus introducing the 5-level paging for 2nd level translation(See chapter
> 3 in Intel Virtualization Technology for Directed I/O). 
> 
> This patch series extends the current logic to support a wider address width.
> A 5-level paging capable IOMMU(for 2nd level translation) can be rendered
> with configuration "device intel-iommu,x-aw-bits=57".
> 
> Also, kvm-unit-tests were updated to verify this patch series. Patch for
> the test was sent out at: https://www.spinics.net/lists/kvm/msg177425.html.
> 
> Note: this patch series checks the existance of 5-level paging in the host
> and in the guest, and rejects configurations for 57-bit IOVA if either check
> fails(VTD-d hardware shall not support 57-bit IOVA on platforms without CPU
> 5-level paging). However, current vIOMMU implementation still lacks logic to
> check against the physical IOMMU capability, future enhancements are expected
> to do this.
> 
> Changes in V3: 
> - Address comments from Peter Xu: squash the 3rd patch in v2 into the 2nd
>   patch in this version.
> - Added "Reviewed-by: Peter Xu <peterx@redhat.com>"
> 
> Changes in V2: 
> - Address comments from Peter Xu: add haw member in vtd_page_walk_info.
> - Address comments from Peter Xu: only searches for 4K/2M/1G mappings in
> iotlb are meaningful. 
> - Address comments from Peter Xu: cover letter changes(e.g. mention the test
> patch in kvm-unit-tests).
> - Coding style changes.
> ---
> Cc: "Michael S. Tsirkin" <mst@redhat.com> 
> Cc: Igor Mammedov <imammedo@redhat.com> 
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com> 
> Cc: Richard Henderson <rth@twiddle.net> 
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>


OK is this going anywhere?
How about dropping cpu flags probing for now, you can
always revisit it later.
Will make it maybe a bit less user friendly but OTOH
uncontriversial...

> ---
> 
> Yu Zhang (2):
>   intel-iommu: differentiate host address width from IOVA address width.
>   intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
> 
>  hw/i386/acpi-build.c           |  2 +-
>  hw/i386/intel_iommu.c          | 96 +++++++++++++++++++++++++++++-------------
>  hw/i386/intel_iommu_internal.h | 10 ++++-
>  include/hw/i386/intel_iommu.h  | 10 +++--
>  4 files changed, 81 insertions(+), 37 deletions(-)
> 
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2018-12-29  1:29                                   ` Eduardo Habkost
@ 2019-01-15  7:13                                     ` Yu Zhang
  2019-01-18  7:10                                       ` Yu Zhang
  0 siblings, 1 reply; 57+ messages in thread
From: Yu Zhang @ 2019-01-15  7:13 UTC (permalink / raw)
  To: Eduardo Habkost, Michael S. Tsirkin, Igor Mammedov, Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Richard Henderson

On Fri, Dec 28, 2018 at 11:29:41PM -0200, Eduardo Habkost wrote:
> On Fri, Dec 28, 2018 at 10:32:59AM +0800, Yu Zhang wrote:
> > On Thu, Dec 27, 2018 at 01:14:11PM -0200, Eduardo Habkost wrote:
> > > On Wed, Dec 26, 2018 at 01:30:00PM +0800, Yu Zhang wrote:
> > > > On Tue, Dec 25, 2018 at 11:56:19AM -0500, Michael S. Tsirkin wrote:
> > > > > On Sat, Dec 22, 2018 at 09:11:26AM +0800, Yu Zhang wrote:
> > > > > > On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > > > > > > > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > > > > > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > > > > > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > > > > > > > there's any different in the native scenario. :)
> > > > > > > > > 
> > > > > > > > > I think native machines exist on which the two values are different.
> > > > > > > > > Is that true?
> > > > > > > > 
> > > > > > > > I think the answer is not. My understanding is that HAW(host address wdith) is
> > > > > > > > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > > > > > > > 
> > > > > > > > I agree there are some addresses the CPU does not touch, but they are still in
> > > > > > > > the physical address space, and there's only one physical address space...
> > > > > > > > 
> > > > > > > > B.R.
> > > > > > > > Yu
> > > > > > > 
> > > > > > > Ouch I thought we are talking about the virtual address size.
> > > > > > > I think I did have a box where VTD's virtual address size was
> > > > > > > smaller than CPU's.
> > > > > > > For physical one - we just need to make it as big as max supported
> > > > > > > memory right?
> > > > > > 
> > > > > > Well, my understanding of the physical one is the maximum physical address
> > > > > > width. Sorry, this explain seems nonsense... I mean, it's not just about
> > > > > > the max supported memory, but also covers MMIO. It shall be detectable
> > > > > > from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
> > > > > > size. One common usage of this value is to tell the paging structure entries(
> > > > > > CPU's or IOMMU's) which bits shall be reserved. There are also some registers
> > > > > > e.g. apic base reg etc, whose contents are physical addresses, therefore also
> > > > > > need to follow the similar requirement for the reserved bits.
> > > > > > 
> > > > > > So I think the correct direction might be to define this property in the
> > > > > > machine status level, instead of the CPU level. Is this reasonable to you?
> > > > > 
> > > > > At that level yes. But isn't this already specified by "pci-hole64-end"?
> > > > 
> > > > But this value is set by guest firmware? Will PCI hotplug change this address?
> > > > 
> > > > @Eduardo, do you have any plan to calculate the phys-bits by "pci-hole64-end"?
> > > > Or introduce another property, say "max-phys-bits" in machine status?
> > > 
> > > I agree it may make sense to make the machine code control
> > > phys-bits instead of the CPU object.  A machine property sounds
> > > like the simplest solution.
> > > 
> > > But I don't think we can have a meaningful discussion about
> > > implementation if we don't agree about the command-line
> > > interface.  We must decide what will happen to the CPU and iommu
> > > physical address width in cases like:
> > 
> > Thanks, Eduardo.
> > 
> > What about we just use "-machine phys-bits=52", and remove the
> > "phys-bits" from CPU parameter?
> 
> Maybe we can deprecate it, but we can't remove it immediately.
> We still need to decide what to do on the cases below, while the
> option is still available.

I saw the ACPI DMAR is ininitialized in acpi_build(), which is called
by pc_machine_done(). I guess this is done after the initialization of
vCPU and vIOMMU.

So I am wondering, instead of moving "phys-bits" from X86CPU into the
MachineState, maybe we could:

1> Define a "phys_bits" in MachineState or PCMachineState(not sure which
one is more suitable).

2> Set ms->phys_bits in x86_cpu_realizefn().

3> Since DMAR is created after vCPU creation, we can build DMAR table
with ms->phys_bits.

4> Also, we can reset the hardware address width for vIOMMU(and the
vtd_paging_entry_rsvd_field array) in pc_machine_done(), based on the value
of ms->phys_bits, or from ACPI DMAR table(from spec point of view, address
width limitation of IOMMU shall come from DMAR, yet I have not figured out
any simple approach to probe the ACPI property). 

This way, we do not need worry about the initialization sequence of vCPU
and vIOMMU, and both DMAR and IOMMU setting are from the machine level which
follows the spec.

Any comments? :)

B.R.
Yu

> 
> > 
> > > 
> > >   $QEMU -device intel-iommu
> > >   $QEMU -cpu ...,phys-bits=50 -device intel-iommu
> > >   $QEMU -cpu ...,host-phys-bits=on -device intel-iommu
> > >   $QEMU -machine phys-bits=50 -device intel-iommu
> > >   $QEMU -machine phys-bits=50 -cpu ...,phys-bits=48 -device intel-iommu
> > > 
> > > -- 
> > > Eduardo
> > > 
> > 
> > B.R.
> > Yu
> 
> -- 
> Eduardo
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU.
  2019-01-15  4:02 ` Michael S. Tsirkin
@ 2019-01-15  7:27   ` Yu Zhang
  0 siblings, 0 replies; 57+ messages in thread
From: Yu Zhang @ 2019-01-15  7:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, qemu-devel, Peter Xu, Paolo Bonzini,
	Igor Mammedov, Richard Henderson

On Mon, Jan 14, 2019 at 11:02:28PM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 12, 2018 at 09:05:37PM +0800, Yu Zhang wrote:
> > Intel's upcoming processors will extend maximum linear address width to
> > 57 bits, and introduce 5-level paging for CPU. Meanwhile, the platform
> > will also extend the maximum guest address width for IOMMU to 57 bits,
> > thus introducing the 5-level paging for 2nd level translation(See chapter
> > 3 in Intel Virtualization Technology for Directed I/O). 
> > 
> > This patch series extends the current logic to support a wider address width.
> > A 5-level paging capable IOMMU(for 2nd level translation) can be rendered
> > with configuration "device intel-iommu,x-aw-bits=57".
> > 
> > Also, kvm-unit-tests were updated to verify this patch series. Patch for
> > the test was sent out at: https://www.spinics.net/lists/kvm/msg177425.html.
> > 
> > Note: this patch series checks the existance of 5-level paging in the host
> > and in the guest, and rejects configurations for 57-bit IOVA if either check
> > fails(VTD-d hardware shall not support 57-bit IOVA on platforms without CPU
> > 5-level paging). However, current vIOMMU implementation still lacks logic to
> > check against the physical IOMMU capability, future enhancements are expected
> > to do this.
> > 
> > Changes in V3: 
> > - Address comments from Peter Xu: squash the 3rd patch in v2 into the 2nd
> >   patch in this version.
> > - Added "Reviewed-by: Peter Xu <peterx@redhat.com>"
> > 
> > Changes in V2: 
> > - Address comments from Peter Xu: add haw member in vtd_page_walk_info.
> > - Address comments from Peter Xu: only searches for 4K/2M/1G mappings in
> > iotlb are meaningful. 
> > - Address comments from Peter Xu: cover letter changes(e.g. mention the test
> > patch in kvm-unit-tests).
> > - Coding style changes.
> > ---
> > Cc: "Michael S. Tsirkin" <mst@redhat.com> 
> > Cc: Igor Mammedov <imammedo@redhat.com> 
> > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com> 
> > Cc: Richard Henderson <rth@twiddle.net> 
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Cc: Peter Xu <peterx@redhat.com>
> 
> 
> OK is this going anywhere?
> How about dropping cpu flags probing for now, you can
> always revisit it later.
> Will make it maybe a bit less user friendly but OTOH
> uncontriversial...

Thanks Michael, and sorry for the late reply.

Sure. For patch 2/2, I'd like to drop the cpu check.

And we are working on another patch to check the host capability.
This is supposed to be done by sysfs similar to Peter's previous
suggestion. One exception is that our plan is to use the minimal
capability of all host VT-d hardware. For example, allow 4-level
vIOMMU as long as there is a VT-d hardware do not support 5-level,
in case we offered a 5-level vIOMMU, yet to find later a hotplugged
device is binded to a 4-level VT-d hardware. This patch is not ready
yet, because we also would like to cover the requirement of scalable
mode. So for now, I'm more inclined to just drop the cpu check and
add some TODO comments.

And as to 1/2, I am proposing to address the initialization problem
by resetting the haw in vIOMMU in pc_machine_done() in my another
reply. If you are OK with this direction, I'll send out the patch after
testing. :-)

B.R.
Yu

> 
> > ---
> > 
> > Yu Zhang (2):
> >   intel-iommu: differentiate host address width from IOVA address width.
> >   intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.
> > 
> >  hw/i386/acpi-build.c           |  2 +-
> >  hw/i386/intel_iommu.c          | 96 +++++++++++++++++++++++++++++-------------
> >  hw/i386/intel_iommu_internal.h | 10 ++++-
> >  include/hw/i386/intel_iommu.h  | 10 +++--
> >  4 files changed, 81 insertions(+), 37 deletions(-)
> > 
> > -- 
> > 1.9.1
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width.
  2019-01-15  7:13                                     ` Yu Zhang
@ 2019-01-18  7:10                                       ` Yu Zhang
  0 siblings, 0 replies; 57+ messages in thread
From: Yu Zhang @ 2019-01-18  7:10 UTC (permalink / raw)
  To: Eduardo Habkost, Michael S. Tsirkin, Igor Mammedov, Peter Xu
  Cc: Paolo Bonzini, qemu-devel, Richard Henderson

On Tue, Jan 15, 2019 at 03:13:14PM +0800, Yu Zhang wrote:
> On Fri, Dec 28, 2018 at 11:29:41PM -0200, Eduardo Habkost wrote:
> > On Fri, Dec 28, 2018 at 10:32:59AM +0800, Yu Zhang wrote:
> > > On Thu, Dec 27, 2018 at 01:14:11PM -0200, Eduardo Habkost wrote:
> > > > On Wed, Dec 26, 2018 at 01:30:00PM +0800, Yu Zhang wrote:
> > > > > On Tue, Dec 25, 2018 at 11:56:19AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Sat, Dec 22, 2018 at 09:11:26AM +0800, Yu Zhang wrote:
> > > > > > > On Fri, Dec 21, 2018 at 02:02:28PM -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Sat, Dec 22, 2018 at 01:37:58AM +0800, Yu Zhang wrote:
> > > > > > > > > On Fri, Dec 21, 2018 at 12:04:49PM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > > On Sat, Dec 22, 2018 at 12:09:44AM +0800, Yu Zhang wrote:
> > > > > > > > > > > Well, my understanding of the vt-d spec is that the address limitation in
> > > > > > > > > > > DMAR are referring to the same concept of CPUID.MAXPHYSADDR. I do not think
> > > > > > > > > > > there's any different in the native scenario. :)
> > > > > > > > > > 
> > > > > > > > > > I think native machines exist on which the two values are different.
> > > > > > > > > > Is that true?
> > > > > > > > > 
> > > > > > > > > I think the answer is not. My understanding is that HAW(host address wdith) is
> > > > > > > > > the maximum physical address width a CPU can detects(by cpuid.0x80000008).
> > > > > > > > > 
> > > > > > > > > I agree there are some addresses the CPU does not touch, but they are still in
> > > > > > > > > the physical address space, and there's only one physical address space...
> > > > > > > > > 
> > > > > > > > > B.R.
> > > > > > > > > Yu
> > > > > > > > 
> > > > > > > > Ouch I thought we are talking about the virtual address size.
> > > > > > > > I think I did have a box where VTD's virtual address size was
> > > > > > > > smaller than CPU's.
> > > > > > > > For physical one - we just need to make it as big as max supported
> > > > > > > > memory right?
> > > > > > > 
> > > > > > > Well, my understanding of the physical one is the maximum physical address
> > > > > > > width. Sorry, this explain seems nonsense... I mean, it's not just about
> > > > > > > the max supported memory, but also covers MMIO. It shall be detectable
> > > > > > > from cpuid, or ACPI's DMAR table, instead of calculated by the max memory
> > > > > > > size. One common usage of this value is to tell the paging structure entries(
> > > > > > > CPU's or IOMMU's) which bits shall be reserved. There are also some registers
> > > > > > > e.g. apic base reg etc, whose contents are physical addresses, therefore also
> > > > > > > need to follow the similar requirement for the reserved bits.
> > > > > > > 
> > > > > > > So I think the correct direction might be to define this property in the
> > > > > > > machine status level, instead of the CPU level. Is this reasonable to you?
> > > > > > 
> > > > > > At that level yes. But isn't this already specified by "pci-hole64-end"?
> > > > > 
> > > > > But this value is set by guest firmware? Will PCI hotplug change this address?
> > > > > 
> > > > > @Eduardo, do you have any plan to calculate the phys-bits by "pci-hole64-end"?
> > > > > Or introduce another property, say "max-phys-bits" in machine status?
> > > > 
> > > > I agree it may make sense to make the machine code control
> > > > phys-bits instead of the CPU object.  A machine property sounds
> > > > like the simplest solution.
> > > > 
> > > > But I don't think we can have a meaningful discussion about
> > > > implementation if we don't agree about the command-line
> > > > interface.  We must decide what will happen to the CPU and iommu
> > > > physical address width in cases like:
> > > 
> > > Thanks, Eduardo.
> > > 
> > > What about we just use "-machine phys-bits=52", and remove the
> > > "phys-bits" from CPU parameter?
> > 
> > Maybe we can deprecate it, but we can't remove it immediately.
> > We still need to decide what to do on the cases below, while the
> > option is still available.
> 
> I saw the ACPI DMAR is ininitialized in acpi_build(), which is called
> by pc_machine_done(). I guess this is done after the initialization of
> vCPU and vIOMMU.
> 
> So I am wondering, instead of moving "phys-bits" from X86CPU into the
> MachineState, maybe we could:
> 
> 1> Define a "phys_bits" in MachineState or PCMachineState(not sure which
> one is more suitable).
> 
> 2> Set ms->phys_bits in x86_cpu_realizefn().
> 
> 3> Since DMAR is created after vCPU creation, we can build DMAR table
> with ms->phys_bits.
> 
> 4> Also, we can reset the hardware address width for vIOMMU(and the
> vtd_paging_entry_rsvd_field array) in pc_machine_done(), based on the value
> of ms->phys_bits, or from ACPI DMAR table(from spec point of view, address
> width limitation of IOMMU shall come from DMAR, yet I have not figured out
> any simple approach to probe the ACPI property). 
> 
> This way, we do not need worry about the initialization sequence of vCPU
> and vIOMMU, and both DMAR and IOMMU setting are from the machine level which
> follows the spec.
> 
> Any comments? :)
> 

Ping... Andy comments on this proposal? Thanks! :)

Yu

> B.R.
> Yu
> 
> > 
> > > 
> > > > 
> > > >   $QEMU -device intel-iommu
> > > >   $QEMU -cpu ...,phys-bits=50 -device intel-iommu
> > > >   $QEMU -cpu ...,host-phys-bits=on -device intel-iommu
> > > >   $QEMU -machine phys-bits=50 -device intel-iommu
> > > >   $QEMU -machine phys-bits=50 -cpu ...,phys-bits=48 -device intel-iommu
> > > > 
> > > > -- 
> > > > Eduardo
> > > > 
> > > 
> > > B.R.
> > > Yu
> > 
> > -- 
> > Eduardo
> > 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2019-01-18  7:14 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-12 13:05 [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 1/2] intel-iommu: differentiate host address width from IOVA address width Yu Zhang
2018-12-17 13:17   ` Igor Mammedov
2018-12-18  9:27     ` Yu Zhang
2018-12-18 14:23       ` Michael S. Tsirkin
2018-12-18 14:55       ` Igor Mammedov
2018-12-18 14:58         ` Michael S. Tsirkin
2018-12-19  3:03           ` Yu Zhang
2018-12-19  3:12             ` Michael S. Tsirkin
2018-12-19  6:28               ` Yu Zhang
2018-12-19 15:30                 ` Michael S. Tsirkin
2018-12-19  2:57         ` Yu Zhang
2018-12-19 10:40           ` Igor Mammedov
2018-12-19 16:47             ` Michael S. Tsirkin
2018-12-20  5:59               ` Yu Zhang
2018-12-20 21:18             ` Eduardo Habkost
2018-12-21 14:13               ` Igor Mammedov
2018-12-21 16:09                 ` Yu Zhang
2018-12-21 17:04                   ` Michael S. Tsirkin
2018-12-21 17:37                     ` Yu Zhang
2018-12-21 19:02                       ` Michael S. Tsirkin
2018-12-21 20:01                         ` Eduardo Habkost
2018-12-22  1:11                         ` Yu Zhang
2018-12-25 16:56                           ` Michael S. Tsirkin
2018-12-26  5:30                             ` Yu Zhang
2018-12-27 15:14                               ` Eduardo Habkost
2018-12-28  2:32                                 ` Yu Zhang
2018-12-29  1:29                                   ` Eduardo Habkost
2019-01-15  7:13                                     ` Yu Zhang
2019-01-18  7:10                                       ` Yu Zhang
2018-12-27 14:54                 ` Eduardo Habkost
2018-12-28 11:42                   ` Igor Mammedov
2018-12-20 20:58       ` Eduardo Habkost
2018-12-12 13:05 ` [Qemu-devel] [PATCH v3 2/2] intel-iommu: extend VTD emulation to allow 57-bit " Yu Zhang
2018-12-17 13:29   ` Igor Mammedov
2018-12-18  9:47     ` Yu Zhang
2018-12-18 10:01       ` Yu Zhang
2018-12-18 12:43         ` Michael S. Tsirkin
2018-12-18 13:45           ` Yu Zhang
2018-12-18 14:49             ` Michael S. Tsirkin
2018-12-19  3:40               ` Yu Zhang
2018-12-19  4:35                 ` Michael S. Tsirkin
2018-12-19  5:57                   ` Yu Zhang
2018-12-19 15:23                     ` Michael S. Tsirkin
2018-12-20  5:49                       ` Yu Zhang
2018-12-20 18:28                         ` Michael S. Tsirkin
2018-12-21 16:19                           ` Yu Zhang
2018-12-21 17:15                             ` Michael S. Tsirkin
2018-12-21 17:34                               ` Yu Zhang
2018-12-21 18:10                                 ` Michael S. Tsirkin
2018-12-22  0:41                                   ` Yu Zhang
2018-12-25 17:00                                     ` Michael S. Tsirkin
2018-12-26  5:58                                       ` Yu Zhang
2018-12-25  1:59                                 ` Tian, Kevin
2018-12-14  9:17 ` [Qemu-devel] [PATCH v3 0/2] intel-iommu: add support for 5-level virtual IOMMU Yu Zhang
2019-01-15  4:02 ` Michael S. Tsirkin
2019-01-15  7:27   ` Yu Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.