[Qemu-devel] [PATCH v3 0/3] IOMMU: Add Support to VFIO devices with vIOMMU present

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/3] IOMMU: Add Support to VFIO devices with vIOMMU present
@ 2016-05-21 16:19 Aviv B.D
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest Aviv B.D
                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Aviv B.D @ 2016-05-21 16:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Alex Williamson, Peter Xu, Jan Kiszka,
	Aviv Ben-David

From: "Aviv Ben-David" <bd.aviv@gmail.com>

* Advertize Cache Mode capability in iommu cap register.
* Register every VFIO device with IOMMU state.
* On page cache invalidation in vIOMMU, check if the domain belong to
  VFIO device and mirror the guest requests to host.

Changes from v1 to v2:
* remove assumption that the cache do not clears
* fix lockup on high load.

Changes from v2 to v3:
* remove debug leftovers
* split to sepearate commits
* change is_write to flags in vtd_do_iommu_translate, add IOMMU_NO_FAIL 
  to suppress error propagating to guest.


Aviv Ben-David (3):
  IOMMU: add VTD_CAP_VM to vIOMMU capability exposed to guest
  IOMMU: change iommu_op->translate's is_write to flags, add support to
    NO_FAIL flag mode
  IOMMU: Integrate between VFIO and vIOMMU to support device assignment

 exec.c                         |   2 +-
 hw/i386/intel_iommu.c          | 137 ++++++++++++++++++++++++++++++++---------
 hw/i386/intel_iommu_internal.h |   3 +
 hw/vfio/common.c               |  11 +++-
 include/exec/memory.h          |   4 +-
 include/hw/i386/intel_iommu.h  |   4 ++
 include/hw/vfio/vfio-common.h  |   1 +
 memory.c                       |   2 +-
 8 files changed, 129 insertions(+), 35 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-05-21 16:19 [Qemu-devel] [PATCH v3 0/3] IOMMU: Add Support to VFIO devices with vIOMMU present Aviv B.D
@ 2016-05-21 16:19 ` Aviv B.D
  2016-05-21 16:42   ` Jan Kiszka
  2016-05-24  8:14   ` Jason Wang
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 2/3] IOMMU: change iommu_op->translate's is_write to flags, add support to NO_FAIL flag mode Aviv B.D
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment Aviv B.D
  2 siblings, 2 replies; 40+ messages in thread
From: Aviv B.D @ 2016-05-21 16:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Alex Williamson, Peter Xu, Jan Kiszka,
	Aviv Ben-David

From: "Aviv Ben-David" <bd.aviv@gmail.com>

This flag tells the guest to invalidate tlb cache also after unmap operations.

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
---
 hw/i386/intel_iommu.c          | 3 ++-
 hw/i386/intel_iommu_internal.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 347718f..1af8da8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
     s->iq_last_desc_type = VTD_INV_DESC_NONE;
     s->next_frcd_reg = 0;
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
-             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
+             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
+             VTD_CAP_CM;
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
     vtd_reset_context_cache(s);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e5f514c..ae40f73 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,6 +190,7 @@
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
 #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_CM                  (1ULL << 7)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v3 2/3] IOMMU: change iommu_op->translate's is_write to flags, add support to NO_FAIL flag mode
  2016-05-21 16:19 [Qemu-devel] [PATCH v3 0/3] IOMMU: Add Support to VFIO devices with vIOMMU present Aviv B.D
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest Aviv B.D
@ 2016-05-21 16:19 ` Aviv B.D
  2016-06-06  5:04   ` Peter Xu
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment Aviv B.D
  2 siblings, 1 reply; 40+ messages in thread
From: Aviv B.D @ 2016-05-21 16:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Alex Williamson, Peter Xu, Jan Kiszka,
	Aviv Ben-David

From: "Aviv Ben-David" <bd.aviv@gmail.com>

Supports translation trials without reporting error to guest on translation 
failures.

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
---
 exec.c                |  2 +-
 hw/i386/intel_iommu.c | 65 ++++++++++++++++++++++++++++++++-------------------
 include/exec/memory.h |  4 ++--
 memory.c              |  2 +-
 4 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/exec.c b/exec.c
index 2e363f0..028c8e0 100644
--- a/exec.c
+++ b/exec.c
@@ -431,7 +431,7 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
             break;
         }
 
-        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
+        iotlb = mr->iommu_ops->translate(mr, addr, is_write ? IOMMU_WO : IOMMU_RO);
         addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
                 | (addr & iotlb.addr_mask));
         *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1af8da8..410f810 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -356,7 +356,7 @@ static void vtd_set_frcd_and_update_ppf(IntelIOMMUState *s, uint16_t index)
 /* Must not update F field now, should be done later */
 static void vtd_record_frcd(IntelIOMMUState *s, uint16_t index,
                             uint16_t source_id, hwaddr addr,
-                            VTDFaultReason fault, bool is_write)
+                            VTDFaultReason fault, IOMMUAccessFlags flags)
 {
     uint64_t hi = 0, lo;
     hwaddr frcd_reg_addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
@@ -365,7 +365,7 @@ static void vtd_record_frcd(IntelIOMMUState *s, uint16_t index,
 
     lo = VTD_FRCD_FI(addr);
     hi = VTD_FRCD_SID(source_id) | VTD_FRCD_FR(fault);
-    if (!is_write) {
+    if (!(flags == IOMMU_WO || flags == IOMMU_RW)) {
         hi |= VTD_FRCD_T;
     }
     vtd_set_quad_raw(s, frcd_reg_addr, lo);
@@ -396,7 +396,7 @@ static bool vtd_try_collapse_fault(IntelIOMMUState *s, uint16_t source_id)
 /* Log and report an DMAR (address translation) fault to software */
 static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
                                   hwaddr addr, VTDFaultReason fault,
-                                  bool is_write)
+                                  IOMMUAccessFlags flags)
 {
     uint32_t fsts_reg = vtd_get_long_raw(s, DMAR_FSTS_REG);
 
@@ -407,7 +407,7 @@ static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
         return;
     }
     VTD_DPRINTF(FLOG, "sid 0x%"PRIx16 ", fault %d, addr 0x%"PRIx64
-                ", is_write %d", source_id, fault, addr, is_write);
+                ", flags %d", source_id, fault, addr, flags);
     if (fsts_reg & VTD_FSTS_PFO) {
         VTD_DPRINTF(FLOG, "new fault is not recorded due to "
                     "Primary Fault Overflow");
@@ -425,7 +425,7 @@ static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
         return;
     }
 
-    vtd_record_frcd(s, s->next_frcd_reg, source_id, addr, fault, is_write);
+    vtd_record_frcd(s, s->next_frcd_reg, source_id, addr, fault, flags);
 
     if (fsts_reg & VTD_FSTS_PPF) {
         VTD_DPRINTF(FLOG, "there are pending faults already, "
@@ -621,7 +621,7 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
 /* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
+static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, IOMMUAccessFlags flags,
                             uint64_t *slptep, uint32_t *slpte_level,
                             bool *reads, bool *writes)
 {
@@ -641,7 +641,20 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
     }
 
     /* FIXME: what is the Atomics request here? */
-    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
+    switch(flags){
+        case IOMMU_WO:
+            access_right_check = VTD_SL_W;
+            break;
+        case IOMMU_RO:
+            access_right_check = VTD_SL_R;
+            break;
+        case IOMMU_RW: /* passthrow */
+        case IOMMU_NO_FAIL:
+            access_right_check = VTD_SL_R | VTD_SL_W;
+            break;
+        default:
+            assert(0);
+    }
 
     while (true) {
         offset = vtd_gpa_level_offset(gpa, level);
@@ -663,8 +676,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
         if (!(slpte & access_right_check)) {
             VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
                         "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (is_write ? "write" : "read"), gpa, slpte);
-            return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
+                        (flags == IOMMU_WO ? "write" : "read"), gpa, slpte);
+            return (flags == IOMMU_RW || flags == IOMMU_WO) ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
             VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
@@ -781,11 +794,11 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
  *
  * @bus_num: The bus number
  * @devfn: The devfn, which is the  combined of device and function number
- * @is_write: The access is a write operation
+ * @flags: The access is a write operation
  * @entry: IOMMUTLBEntry that contain the addr to be translated and result
  */
 static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
-                                   uint8_t devfn, hwaddr addr, bool is_write,
+                                   uint8_t devfn, hwaddr addr, IOMMUAccessFlags flags,
                                    IOMMUTLBEntry *entry)
 {
     IntelIOMMUState *s = vtd_as->iommu_state;
@@ -803,7 +816,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     /* Check if the request is in interrupt address range */
     if (vtd_is_interrupt_addr(addr)) {
-        if (is_write) {
+        if (flags == IOMMU_WO || flags == IOMMU_RW) {
             /* FIXME: since we don't know the length of the access here, we
              * treat Non-DWORD length write requests without PASID as
              * interrupt requests, too. Withoud interrupt remapping support,
@@ -819,7 +832,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         } else {
             VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
                         "gpa 0x%"PRIx64, addr);
-            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
+            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, flags);
             return;
         }
     }
@@ -848,12 +861,14 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
         if (ret_fr) {
             ret_fr = -ret_fr;
-            if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
+            if (flags != IOMMU_NO_FAIL){
+                if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
+                    VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
                             "requests through this context-entry "
                             "(with FPD Set)");
-            } else {
-                vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+                } else {
+                    vtd_report_dmar_fault(s, source_id, addr, ret_fr, flags);
+                }
             }
             return;
         }
@@ -866,15 +881,17 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
-    ret_fr = vtd_gpa_to_slpte(&ce, addr, is_write, &slpte, &level,
+    ret_fr = vtd_gpa_to_slpte(&ce, addr, flags, &slpte, &level,
                               &reads, &writes);
     if (ret_fr) {
         ret_fr = -ret_fr;
-        if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
+        if (flags != IOMMU_NO_FAIL){
+            if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
+                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
                         "through this context-entry (with FPD Set)");
-        } else {
-            vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+            } else {
+                vtd_report_dmar_fault(s, source_id, addr, ret_fr, flags);
+            }
         }
         return;
     }
@@ -1840,7 +1857,7 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
 }
 
 static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
-                                         bool is_write)
+                                         IOMMUAccessFlags flags)
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
     IntelIOMMUState *s = vtd_as->iommu_state;
@@ -1862,7 +1879,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
     }
 
     vtd_do_iommu_translate(vtd_as, vtd_as->bus, vtd_as->devfn, addr,
-                           is_write, &ret);
+                           flags, &ret);
     VTD_DPRINTF(MMU,
                 "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
                 " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7fb9188..755a026 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -55,6 +55,7 @@ typedef enum {
     IOMMU_RO   = 1,
     IOMMU_WO   = 2,
     IOMMU_RW   = 3,
+    IOMMU_NO_FAIL  = 4, /* may not be present, don't repport error to guest */
 } IOMMUAccessFlags;
 
 struct IOMMUTLBEntry {
@@ -145,10 +146,9 @@ struct MemoryRegionOps {
 };
 
 typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
-
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
-    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, IOMMUAccessFlags flags);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/memory.c b/memory.c
index a8ef852..ba46cbb 100644
--- a/memory.c
+++ b/memory.c
@@ -1522,7 +1522,7 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
     IOMMUTLBEntry iotlb;
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
+        iotlb = mr->iommu_ops->translate(mr, addr, is_write ? IOMMU_WO : IOMMU_RO);
         if (iotlb.perm != IOMMU_NONE) {
             n->notify(n, &iotlb);
         }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-21 16:19 [Qemu-devel] [PATCH v3 0/3] IOMMU: Add Support to VFIO devices with vIOMMU present Aviv B.D
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest Aviv B.D
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 2/3] IOMMU: change iommu_op->translate's is_write to flags, add support to NO_FAIL flag mode Aviv B.D
@ 2016-05-21 16:19 ` Aviv B.D
  2016-05-23 17:53   ` Alex Williamson
  2 siblings, 1 reply; 40+ messages in thread
From: Aviv B.D @ 2016-05-21 16:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Alex Williamson, Peter Xu, Jan Kiszka,
	Aviv Ben-David

From: "Aviv Ben-David" <bd.aviv@gmail.com>

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
---
 hw/i386/intel_iommu.c          | 69 ++++++++++++++++++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |  2 ++
 hw/vfio/common.c               | 11 +++++--
 include/hw/i386/intel_iommu.h  |  4 +++
 include/hw/vfio/vfio-common.h  |  1 +
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 410f810..128ec7c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -43,6 +43,9 @@ static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR);
 #define VTD_DPRINTF(what, fmt, ...) do {} while (0)
 #endif
 
+static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
+                                    uint8_t devfn, VTDContextEntry *ce);
+
 static void vtd_define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
                             uint64_t wmask, uint64_t w1cmask)
 {
@@ -126,6 +129,22 @@ static uint32_t vtd_set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
     return new_val;
 }
 
+static int vtd_get_did_dev(IntelIOMMUState *s, uint8_t bus_num, uint8_t devfn, uint16_t * domain_id)
+{
+    VTDContextEntry ce;
+    int ret_fr;
+
+    assert(domain_id);
+
+    ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    if (ret_fr){
+        return -1;
+    }
+
+    *domain_id =  VTD_CONTEXT_ENTRY_DID(ce.hi);
+    return 0;
+}
+
 static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
                                         uint64_t clear, uint64_t mask)
 {
@@ -724,9 +743,6 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_context_entry_present(ce)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
-                    "is not present", devfn, bus_num);
         return -VTD_FR_CONTEXT_ENTRY_P;
     } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
                (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
@@ -1033,18 +1049,58 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
                                 &domain_id);
 }
 
+static void vtd_iotlb_page_invalidate_vfio(IntelIOMMUState *s, uint16_t domain_id,
+                                      hwaddr addr, uint8_t am)
+{
+    VFIOGuestIOMMU * giommu;
+
+    QLIST_FOREACH(giommu, &(s->giommu_list), iommu_next){
+        VTDAddressSpace *vtd_as = container_of(giommu->iommu, VTDAddressSpace, iommu);
+        uint16_t vfio_domain_id; 
+        int ret = vtd_get_did_dev(s, pci_bus_num(vtd_as->bus), vtd_as->devfn, &vfio_domain_id);
+        int i=0;
+        if (!ret && domain_id == vfio_domain_id){
+            IOMMUTLBEntry entry; 
+            
+            /* do vfio unmap */
+            VTD_DPRINTF(GENERAL, "Remove addr 0x%"PRIx64 " mask %d", addr, am);
+            entry.target_as = NULL;
+            entry.iova = addr & VTD_PAGE_MASK_4K;
+            entry.translated_addr = 0;
+            entry.addr_mask = ~VTD_PAGE_MASK(VTD_PAGE_SHIFT_4K + am);
+            entry.perm = IOMMU_NONE;
+            memory_region_notify_iommu(giommu->iommu, entry);
+       
+            /* do vfio map */
+            VTD_DPRINTF(GENERAL, "add addr 0x%"PRIx64 " mask %d", addr, am);
+            /* call to vtd_iommu_translate */
+            for (i = 0; i < (1 << am); i++, addr+=(1 << VTD_PAGE_SHIFT_4K)){
+                IOMMUTLBEntry entry = s->iommu_ops.translate(giommu->iommu, addr, IOMMU_NO_FAIL); 
+                if (entry.perm != IOMMU_NONE){
+                    memory_region_notify_iommu(giommu->iommu, entry);
+                }
+            }
+        }
+    }
+}
+
 static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                                       hwaddr addr, uint8_t am)
 {
     VTDIOTLBPageInvInfo info;
 
     assert(am <= VTD_MAMV);
+    
     info.domain_id = domain_id;
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
+    
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+
+    vtd_iotlb_page_invalidate_vfio(s, domain_id, addr, am);
 }
 
+
 /* Flush IOTLB
  * Returns the IOTLB Actual Invalidation Granularity.
  * @val: the content of the IOTLB_REG
@@ -1912,6 +1968,13 @@ static Property vtd_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+void vtd_register_giommu(VFIOGuestIOMMU * giommu)
+{
+    VTDAddressSpace *vtd_as = container_of(giommu->iommu, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    
+    QLIST_INSERT_HEAD(&s->giommu_list, giommu, iommu_next);
+}
 
 VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
 {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index ae40f73..102e9a5 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -339,6 +339,8 @@ typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
 #define VTD_PAGE_SHIFT_1G           30
 #define VTD_PAGE_MASK_1G            (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
 
+#define VTD_PAGE_MASK(shift)        (~((1ULL << (shift)) - 1))
+
 struct VTDRootEntry {
     uint64_t val;
     uint64_t rsvd;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 88154a1..54fc8bc 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -35,6 +35,9 @@
 #endif
 #include "trace.h"
 
+#include "hw/sysbus.h"
+#include "hw/i386/intel_iommu.h"
+
 struct vfio_group_head vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 struct vfio_as_head vfio_address_spaces =
@@ -315,12 +318,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 out:
     rcu_read_unlock();
 }
-
+#if 0
 static hwaddr vfio_container_granularity(VFIOContainer *container)
 {
     return (hwaddr)1 << ctz64(container->iova_pgsizes);
 }
-
+#endif
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -384,11 +387,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
         giommu->n.notify = vfio_iommu_map_notify;
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
+        vtd_register_giommu(giommu);
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+#if 0
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
                                    vfio_container_granularity(container),
                                    false);
-
+#endif
         return;
     }
 
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index b024ffa..22f3f83 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -23,6 +23,7 @@
 #define INTEL_IOMMU_H
 #include "hw/qdev.h"
 #include "sysemu/dma.h"
+#include "hw/vfio/vfio-common.h"
 
 #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
 #define INTEL_IOMMU_DEVICE(obj) \
@@ -123,6 +124,8 @@ struct IntelIOMMUState {
     MemoryRegionIOMMUOps iommu_ops;
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+
+    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
 };
 
 /* Find the VTD Address space associated with the given bus pointer,
@@ -130,4 +133,5 @@ struct IntelIOMMUState {
  */
 VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn);
 
+void vtd_register_giommu(VFIOGuestIOMMU * giommu);
 #endif
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eb0e1b0..bf56a1d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -92,6 +92,7 @@ typedef struct VFIOGuestIOMMU {
     MemoryRegion *iommu;
     Notifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
+    QLIST_ENTRY(VFIOGuestIOMMU) iommu_next;
 } VFIOGuestIOMMU;
 
 typedef struct VFIODeviceOps VFIODeviceOps;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest Aviv B.D
@ 2016-05-21 16:42   ` Jan Kiszka
  2016-06-02  8:44     ` Peter Xu
  2016-05-24  8:14   ` Jason Wang
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2016-05-21 16:42 UTC (permalink / raw)
  To: Aviv B.D, qemu-devel; +Cc: Alex Williamson, Peter Xu, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1723 bytes --]

On 2016-05-21 18:19, Aviv B.D wrote:
> From: "Aviv Ben-David" <bd.aviv@gmail.com>
> 
> This flag tells the guest to invalidate tlb cache also after unmap operations.
> 
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> ---
>  hw/i386/intel_iommu.c          | 3 ++-
>  hw/i386/intel_iommu_internal.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 347718f..1af8da8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
>      s->next_frcd_reg = 0;
>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> +             VTD_CAP_CM;

Again, needs to be optional because not all guests will support it or
behave differently when it's set (I've one that refuses to work).

Jan

>      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
>  
>      vtd_reset_context_cache(s);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index e5f514c..ae40f73 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,6 +190,7 @@
>  #define VTD_CAP_MAMV                (VTD_MAMV << 48)
>  #define VTD_CAP_PSI                 (1ULL << 39)
>  #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
> +#define VTD_CAP_CM                  (1ULL << 7)
>  
>  /* Supported Adjusted Guest Address Widths */
>  #define VTD_CAP_SAGAW_SHIFT         8
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment Aviv B.D
@ 2016-05-23 17:53   ` Alex Williamson
  2016-05-26 20:58     ` Alex Williamson
  2016-06-06  7:38     ` Peter Xu
  0 siblings, 2 replies; 40+ messages in thread
From: Alex Williamson @ 2016-05-23 17:53 UTC (permalink / raw)
  To: Aviv B.D; +Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jan Kiszka

On Sat, 21 May 2016 19:19:50 +0300
"Aviv B.D" <bd.aviv@gmail.com> wrote:

> From: "Aviv Ben-David" <bd.aviv@gmail.com>
> 

Some commentary about the changes necessary to achieve $SUBJECT would
be nice here.

> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> ---
>  hw/i386/intel_iommu.c          | 69 ++++++++++++++++++++++++++++++++++++++++--
>  hw/i386/intel_iommu_internal.h |  2 ++
>  hw/vfio/common.c               | 11 +++++--
>  include/hw/i386/intel_iommu.h  |  4 +++
>  include/hw/vfio/vfio-common.h  |  1 +
>  5 files changed, 81 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 410f810..128ec7c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -43,6 +43,9 @@ static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR);
>  #define VTD_DPRINTF(what, fmt, ...) do {} while (0)
>  #endif
>  
> +static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> +                                    uint8_t devfn, VTDContextEntry *ce);
> +
>  static void vtd_define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
>                              uint64_t wmask, uint64_t w1cmask)
>  {
> @@ -126,6 +129,22 @@ static uint32_t vtd_set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
>      return new_val;
>  }
>  
> +static int vtd_get_did_dev(IntelIOMMUState *s, uint8_t bus_num, uint8_t devfn, uint16_t * domain_id)
> +{
> +    VTDContextEntry ce;
> +    int ret_fr;
> +
> +    assert(domain_id);
> +
> +    ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> +    if (ret_fr){
> +        return -1;
> +    }
> +
> +    *domain_id =  VTD_CONTEXT_ENTRY_DID(ce.hi);
> +    return 0;
> +}
> +
>  static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
>                                          uint64_t clear, uint64_t mask)
>  {
> @@ -724,9 +743,6 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>      }
>  
>      if (!vtd_context_entry_present(ce)) {
> -        VTD_DPRINTF(GENERAL,
> -                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
> -                    "is not present", devfn, bus_num);
>          return -VTD_FR_CONTEXT_ENTRY_P;
>      } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
>                 (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
> @@ -1033,18 +1049,58 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>                                  &domain_id);
>  }
>  
> +static void vtd_iotlb_page_invalidate_vfio(IntelIOMMUState *s, uint16_t domain_id,
> +                                      hwaddr addr, uint8_t am)
> +{
> +    VFIOGuestIOMMU * giommu;
> +

VT-d parsing VFIO private data structures, nope this is not a good idea.

> +    QLIST_FOREACH(giommu, &(s->giommu_list), iommu_next){
> +        VTDAddressSpace *vtd_as = container_of(giommu->iommu, VTDAddressSpace, iommu);

VT-d needs to keep track of its own address spaces and call the iommu
notifier, it should have no visibility whatsoever that there are vfio
devices attached.

> +        uint16_t vfio_domain_id; 
> +        int ret = vtd_get_did_dev(s, pci_bus_num(vtd_as->bus), vtd_as->devfn, &vfio_domain_id);
> +        int i=0;
> +        if (!ret && domain_id == vfio_domain_id){
> +            IOMMUTLBEntry entry; 
> +            
> +            /* do vfio unmap */
> +            VTD_DPRINTF(GENERAL, "Remove addr 0x%"PRIx64 " mask %d", addr, am);
> +            entry.target_as = NULL;
> +            entry.iova = addr & VTD_PAGE_MASK_4K;
> +            entry.translated_addr = 0;
> +            entry.addr_mask = ~VTD_PAGE_MASK(VTD_PAGE_SHIFT_4K + am);
> +            entry.perm = IOMMU_NONE;
> +            memory_region_notify_iommu(giommu->iommu, entry);
> +       
> +            /* do vfio map */
> +            VTD_DPRINTF(GENERAL, "add addr 0x%"PRIx64 " mask %d", addr, am);
> +            /* call to vtd_iommu_translate */
> +            for (i = 0; i < (1 << am); i++, addr+=(1 << VTD_PAGE_SHIFT_4K)){
> +                IOMMUTLBEntry entry = s->iommu_ops.translate(giommu->iommu, addr, IOMMU_NO_FAIL); 
> +                if (entry.perm != IOMMU_NONE){
> +                    memory_region_notify_iommu(giommu->iommu, entry);
> +                }
> +            }
> +        }
> +    }
> +}
> +
>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>                                        hwaddr addr, uint8_t am)
>  {
>      VTDIOTLBPageInvInfo info;
>  
>      assert(am <= VTD_MAMV);
> +    
>      info.domain_id = domain_id;
>      info.addr = addr;
>      info.mask = ~((1 << am) - 1);
> +    
>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> +
> +    vtd_iotlb_page_invalidate_vfio(s, domain_id, addr, am);
>  }
>  
> +
>  /* Flush IOTLB
>   * Returns the IOTLB Actual Invalidation Granularity.
>   * @val: the content of the IOTLB_REG
> @@ -1912,6 +1968,13 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +void vtd_register_giommu(VFIOGuestIOMMU * giommu)
> +{
> +    VTDAddressSpace *vtd_as = container_of(giommu->iommu, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    
> +    QLIST_INSERT_HEAD(&s->giommu_list, giommu, iommu_next);
> +}
>  
>  VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>  {
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index ae40f73..102e9a5 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -339,6 +339,8 @@ typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
>  #define VTD_PAGE_SHIFT_1G           30
>  #define VTD_PAGE_MASK_1G            (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
>  
> +#define VTD_PAGE_MASK(shift)        (~((1ULL << (shift)) - 1))
> +
>  struct VTDRootEntry {
>      uint64_t val;
>      uint64_t rsvd;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 88154a1..54fc8bc 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -35,6 +35,9 @@
>  #endif
>  #include "trace.h"
>  
> +#include "hw/sysbus.h"
> +#include "hw/i386/intel_iommu.h"
> +
>  struct vfio_group_head vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
>  struct vfio_as_head vfio_address_spaces =
> @@ -315,12 +318,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>  out:
>      rcu_read_unlock();
>  }
> -
> +#if 0
>  static hwaddr vfio_container_granularity(VFIOContainer *container)
>  {
>      return (hwaddr)1 << ctz64(container->iova_pgsizes);
>  }
> -
> +#endif


Clearly this is unacceptable, the code has a purpose.

>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> @@ -384,11 +387,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          giommu->n.notify = vfio_iommu_map_notify;
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
> +        vtd_register_giommu(giommu);

vfio will not assume VT-d, this is why we register the notifier below.

>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> +#if 0
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
>                                     vfio_container_granularity(container),
>                                     false);
> -
> +#endif

Clearly this also has a purpose.

>          return;
>      }
>  
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index b024ffa..22f3f83 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -23,6 +23,7 @@
>  #define INTEL_IOMMU_H
>  #include "hw/qdev.h"
>  #include "sysemu/dma.h"
> +#include "hw/vfio/vfio-common.h"

No.  This header probably should not have been put under include, VT-d
has no business walking our guest IOMMU list.

>  
>  #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
>  #define INTEL_IOMMU_DEVICE(obj) \
> @@ -123,6 +124,8 @@ struct IntelIOMMUState {
>      MemoryRegionIOMMUOps iommu_ops;
>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +
> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>  };
>  
>  /* Find the VTD Address space associated with the given bus pointer,
> @@ -130,4 +133,5 @@ struct IntelIOMMUState {
>   */
>  VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn);
>  
> +void vtd_register_giommu(VFIOGuestIOMMU * giommu);
>  #endif
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eb0e1b0..bf56a1d 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -92,6 +92,7 @@ typedef struct VFIOGuestIOMMU {
>      MemoryRegion *iommu;
>      Notifier n;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> +    QLIST_ENTRY(VFIOGuestIOMMU) iommu_next;

No.  Use the existing interfaces, create your own address space
tracking in VT-d, we are not going to host a list for VT-d to use.
Also note that there's no consideration of hot-unplug support in these
changes.  vfio already works with guest iommus on powerpc, so any
change to vfio needs to be justified and generalized to a common
guest iommu api.  Thanks,

Alex

>  } VFIOGuestIOMMU;
>  
>  typedef struct VFIODeviceOps VFIODeviceOps;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest Aviv B.D
  2016-05-21 16:42   ` Jan Kiszka
@ 2016-05-24  8:14   ` Jason Wang
  2016-05-24  9:25     ` Jan Kiszka
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Wang @ 2016-05-24  8:14 UTC (permalink / raw)
  To: Aviv B.D, qemu-devel
  Cc: Jan Kiszka, Alex Williamson, Peter Xu, Michael S. Tsirkin



On 2016年05月22日 00:19, Aviv B.D wrote:
> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>
> This flag tells the guest to invalidate tlb cache also after unmap operations.
>
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> ---

Is this a guest visible behavior?  If yes, shouldn't we cache 
translation fault conditions in IOTLB?

>   hw/i386/intel_iommu.c          | 3 ++-
>   hw/i386/intel_iommu_internal.h | 1 +
>   2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 347718f..1af8da8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>       s->iq_last_desc_type = VTD_INV_DESC_NONE;
>       s->next_frcd_reg = 0;
>       s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> +             VTD_CAP_CM;
>       s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
>   
>       vtd_reset_context_cache(s);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index e5f514c..ae40f73 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,6 +190,7 @@
>   #define VTD_CAP_MAMV                (VTD_MAMV << 48)
>   #define VTD_CAP_PSI                 (1ULL << 39)
>   #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
> +#define VTD_CAP_CM                  (1ULL << 7)
>   
>   /* Supported Adjusted Guest Address Widths */
>   #define VTD_CAP_SAGAW_SHIFT         8

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-05-24  8:14   ` Jason Wang
@ 2016-05-24  9:25     ` Jan Kiszka
  2016-05-28 16:12       ` Aviv B.D.
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2016-05-24  9:25 UTC (permalink / raw)
  To: Jason Wang, Aviv B.D, qemu-devel
  Cc: Alex Williamson, Peter Xu, Michael S. Tsirkin

On 2016-05-24 10:14, Jason Wang wrote:
> On 2016年05月22日 00:19, Aviv B.D wrote:
>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>>
>> This flag tells the guest to invalidate tlb cache also after unmap
>> operations.
>>
>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
>> ---
> 
> Is this a guest visible behavior?  If yes, shouldn't we cache
> translation fault conditions in IOTLB?

It is guest visible, and this first of all means, besides requiring to
be optional, that it definitely needs to be off for compat systems. Or
it stays off by default.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-23 17:53   ` Alex Williamson
@ 2016-05-26 20:58     ` Alex Williamson
  2016-05-28 10:52       ` Aviv B.D.
  2016-06-06  7:38     ` Peter Xu
  1 sibling, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-05-26 20:58 UTC (permalink / raw)
  To: Aviv B.D; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

On Mon, 23 May 2016 11:53:42 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Sat, 21 May 2016 19:19:50 +0300
> "Aviv B.D" <bd.aviv@gmail.com> wrote:
> 
> > From: "Aviv Ben-David" <bd.aviv@gmail.com>
> >   
> 
> Some commentary about the changes necessary to achieve $SUBJECT would
> be nice here.
> 
> > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > ---
> >  hw/i386/intel_iommu.c          | 69 ++++++++++++++++++++++++++++++++++++++++--
> >  hw/i386/intel_iommu_internal.h |  2 ++
> >  hw/vfio/common.c               | 11 +++++--
> >  include/hw/i386/intel_iommu.h  |  4 +++
> >  include/hw/vfio/vfio-common.h  |  1 +
> >  5 files changed, 81 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 410f810..128ec7c 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -43,6 +43,9 @@ static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR);
> >  #define VTD_DPRINTF(what, fmt, ...) do {} while (0)
> >  #endif
> >  
> > +static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> > +                                    uint8_t devfn, VTDContextEntry *ce);
> > +
> >  static void vtd_define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
> >                              uint64_t wmask, uint64_t w1cmask)
> >  {
> > @@ -126,6 +129,22 @@ static uint32_t vtd_set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
> >      return new_val;
> >  }
> >  
> > +static int vtd_get_did_dev(IntelIOMMUState *s, uint8_t bus_num, uint8_t devfn, uint16_t * domain_id)
> > +{
> > +    VTDContextEntry ce;
> > +    int ret_fr;
> > +
> > +    assert(domain_id);
> > +
> > +    ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > +    if (ret_fr){
> > +        return -1;
> > +    }
> > +
> > +    *domain_id =  VTD_CONTEXT_ENTRY_DID(ce.hi);
> > +    return 0;
> > +}
> > +
> >  static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
> >                                          uint64_t clear, uint64_t mask)
> >  {
> > @@ -724,9 +743,6 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> >      }
> >  
> >      if (!vtd_context_entry_present(ce)) {
> > -        VTD_DPRINTF(GENERAL,
> > -                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
> > -                    "is not present", devfn, bus_num);
> >          return -VTD_FR_CONTEXT_ENTRY_P;
> >      } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
> >                 (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
> > @@ -1033,18 +1049,58 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
> >                                  &domain_id);
> >  }
> >  
> > +static void vtd_iotlb_page_invalidate_vfio(IntelIOMMUState *s, uint16_t domain_id,
> > +                                      hwaddr addr, uint8_t am)
> > +{
> > +    VFIOGuestIOMMU * giommu;
> > +  
> 
> VT-d parsing VFIO private data structures, nope this is not a good idea.
> 
> > +    QLIST_FOREACH(giommu, &(s->giommu_list), iommu_next){
> > +        VTDAddressSpace *vtd_as = container_of(giommu->iommu, VTDAddressSpace, iommu);  
> 
> VT-d needs to keep track of its own address spaces and call the iommu
> notifier, it should have no visibility whatsoever that there are vfio
> devices attached.
> 
> > +        uint16_t vfio_domain_id; 
> > +        int ret = vtd_get_did_dev(s, pci_bus_num(vtd_as->bus), vtd_as->devfn, &vfio_domain_id);
> > +        int i=0;
> > +        if (!ret && domain_id == vfio_domain_id){
> > +            IOMMUTLBEntry entry; 
> > +            
> > +            /* do vfio unmap */
> > +            VTD_DPRINTF(GENERAL, "Remove addr 0x%"PRIx64 " mask %d", addr, am);
> > +            entry.target_as = NULL;
> > +            entry.iova = addr & VTD_PAGE_MASK_4K;
> > +            entry.translated_addr = 0;
> > +            entry.addr_mask = ~VTD_PAGE_MASK(VTD_PAGE_SHIFT_4K + am);
> > +            entry.perm = IOMMU_NONE;
> > +            memory_region_notify_iommu(giommu->iommu, entry);
> > +       
> > +            /* do vfio map */
> > +            VTD_DPRINTF(GENERAL, "add addr 0x%"PRIx64 " mask %d", addr, am);
> > +            /* call to vtd_iommu_translate */
> > +            for (i = 0; i < (1 << am); i++, addr+=(1 << VTD_PAGE_SHIFT_4K)){
> > +                IOMMUTLBEntry entry = s->iommu_ops.translate(giommu->iommu, addr, IOMMU_NO_FAIL); 
> > +                if (entry.perm != IOMMU_NONE){
> > +                    memory_region_notify_iommu(giommu->iommu, entry);
> > +                }
> > +            }
> > +        }
> > +    }
> > +}
> > +


I think I see what you're trying to do here, find the VTDAddressSpaces
that vfio knows about and determine which are affected by an
invalidation, but I think it really just represents a flaw in the VT-d
code not really matching real hardware.  As I understand the real
hardware, per device translations use the bus:devfn to get the context
entry.  The context entry contains both the domain_id and a pointer to
the page table.  Note that VT-d requires that domain_id and page table
pointers are consistent for all entries, there cannot be two separate
context entries with the same domain_id that point to different page
tables. So really, VTDAddressSpace should be based at the domain, not
the context entry.  Multiple context entries can point to the same
domain, and thus the same VTDAddressSpace.  Doing so would make the
invalidation trivial, simply lookup the VTDAddressSpace based on the
domain_id, get the MemoryRegion, and fire off
memory_region_notify_iommu()s, which then get {un}mapped through vfio
without any direct interaction.  Unfortunately I think that means that
intel-iommu needs some data structure surgery.

With the current code, I think the best we could do would be to look
through every context for matching domain_ids.  For each one, use that
bus:devfn to get the VTDAddressSpace and thus the MemoryRegion and
call a notify for each.  That's not only an inefficient lookup, but
requires a notify per matching bus:devfn since each uses a separate
VTDAddressSpace when we should really only need a notify per domain_id. 

Also, I haven't figured it out yet, but I think we're also going to
need to actually populate the MemoryRegion rather than just use it to
send notifies, otherwise replay won't work, which means that a device
added to a domain with existing mappings wouldn't work.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-26 20:58     ` Alex Williamson
@ 2016-05-28 10:52       ` Aviv B.D.
  2016-05-28 16:02         ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Aviv B.D. @ 2016-05-28 10:52 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

Hi,
Your idea to search the relevent VTDAddressSpace and call it's notifier
will
probably work. Next week I'll try to implement it (for now with the costly
scan
of each context).
I still not sure if populating the MemoryRegion will suffice for hot plug
vfio
device but i'll try to look into it.

As far as I understand the memory_region_iommu_replay function, it still
scans
the whole 64bit address space, and therefore may hang the VM for a long
time.

Aviv.

On Thu, May 26, 2016 at 11:58 PM Alex Williamson <alex.williamson@redhat.com>
wrote:

> On Mon, 23 May 2016 11:53:42 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
>
> > On Sat, 21 May 2016 19:19:50 +0300
> > "Aviv B.D" <bd.aviv@gmail.com> wrote:
> >
> > > From: "Aviv Ben-David" <bd.aviv@gmail.com>
> > >
> >
> > Some commentary about the changes necessary to achieve $SUBJECT would
> > be nice here.
> >
> > > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 69
> ++++++++++++++++++++++++++++++++++++++++--
> > >  hw/i386/intel_iommu_internal.h |  2 ++
> > >  hw/vfio/common.c               | 11 +++++--
> > >  include/hw/i386/intel_iommu.h  |  4 +++
> > >  include/hw/vfio/vfio-common.h  |  1 +
> > >  5 files changed, 81 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 410f810..128ec7c 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -43,6 +43,9 @@ static int vtd_dbgflags = VTD_DBGBIT(GENERAL) |
> VTD_DBGBIT(CSR);
> > >  #define VTD_DPRINTF(what, fmt, ...) do {} while (0)
> > >  #endif
> > >
> > > +static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t
> bus_num,
> > > +                                    uint8_t devfn, VTDContextEntry
> *ce);
> > > +
> > >  static void vtd_define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t
> val,
> > >                              uint64_t wmask, uint64_t w1cmask)
> > >  {
> > > @@ -126,6 +129,22 @@ static uint32_t
> vtd_set_clear_mask_long(IntelIOMMUState *s, hwaddr addr,
> > >      return new_val;
> > >  }
> > >
> > > +static int vtd_get_did_dev(IntelIOMMUState *s, uint8_t bus_num,
> uint8_t devfn, uint16_t * domain_id)
> > > +{
> > > +    VTDContextEntry ce;
> > > +    int ret_fr;
> > > +
> > > +    assert(domain_id);
> > > +
> > > +    ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > > +    if (ret_fr){
> > > +        return -1;
> > > +    }
> > > +
> > > +    *domain_id =  VTD_CONTEXT_ENTRY_DID(ce.hi);
> > > +    return 0;
> > > +}
> > > +
> > >  static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr
> addr,
> > >                                          uint64_t clear, uint64_t mask)
> > >  {
> > > @@ -724,9 +743,6 @@ static int
> vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> > >      }
> > >
> > >      if (!vtd_context_entry_present(ce)) {
> > > -        VTD_DPRINTF(GENERAL,
> > > -                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
> > > -                    "is not present", devfn, bus_num);
> > >          return -VTD_FR_CONTEXT_ENTRY_P;
> > >      } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
> > >                 (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
> > > @@ -1033,18 +1049,58 @@ static void
> vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
> > >                                  &domain_id);
> > >  }
> > >
> > > +static void vtd_iotlb_page_invalidate_vfio(IntelIOMMUState *s,
> uint16_t domain_id,
> > > +                                      hwaddr addr, uint8_t am)
> > > +{
> > > +    VFIOGuestIOMMU * giommu;
> > > +
> >
> > VT-d parsing VFIO private data structures, nope this is not a good idea.
> >
> > > +    QLIST_FOREACH(giommu, &(s->giommu_list), iommu_next){
> > > +        VTDAddressSpace *vtd_as = container_of(giommu->iommu,
> VTDAddressSpace, iommu);
> >
> > VT-d needs to keep track of its own address spaces and call the iommu
> > notifier, it should have no visibility whatsoever that there are vfio
> > devices attached.
> >
> > > +        uint16_t vfio_domain_id;
> > > +        int ret = vtd_get_did_dev(s, pci_bus_num(vtd_as->bus),
> vtd_as->devfn, &vfio_domain_id);
> > > +        int i=0;
> > > +        if (!ret && domain_id == vfio_domain_id){
> > > +            IOMMUTLBEntry entry;
> > > +
> > > +            /* do vfio unmap */
> > > +            VTD_DPRINTF(GENERAL, "Remove addr 0x%"PRIx64 " mask %d",
> addr, am);
> > > +            entry.target_as = NULL;
> > > +            entry.iova = addr & VTD_PAGE_MASK_4K;
> > > +            entry.translated_addr = 0;
> > > +            entry.addr_mask = ~VTD_PAGE_MASK(VTD_PAGE_SHIFT_4K + am);
> > > +            entry.perm = IOMMU_NONE;
> > > +            memory_region_notify_iommu(giommu->iommu, entry);
> > > +
> > > +            /* do vfio map */
> > > +            VTD_DPRINTF(GENERAL, "add addr 0x%"PRIx64 " mask %d",
> addr, am);
> > > +            /* call to vtd_iommu_translate */
> > > +            for (i = 0; i < (1 << am); i++, addr+=(1 <<
> VTD_PAGE_SHIFT_4K)){
> > > +                IOMMUTLBEntry entry =
> s->iommu_ops.translate(giommu->iommu, addr, IOMMU_NO_FAIL);
> > > +                if (entry.perm != IOMMU_NONE){
> > > +                    memory_region_notify_iommu(giommu->iommu, entry);
> > > +                }
> > > +            }
> > > +        }
> > > +    }
> > > +}
> > > +
>
>
> I think I see what you're trying to do here, find the VTDAddressSpaces
> that vfio knows about and determine which are affected by an
> invalidation, but I think it really just represents a flaw in the VT-d
> code not really matching real hardware.  As I understand the real
> hardware, per device translations use the bus:devfn to get the context
> entry.  The context entry contains both the domain_id and a pointer to
> the page table.  Note that VT-d requires that domain_id and page table
> pointers are consistent for all entries, there cannot be two separate
> context entries with the same domain_id that point to different page
> tables. So really, VTDAddressSpace should be based at the domain, not
> the context entry.  Multiple context entries can point to the same
> domain, and thus the same VTDAddressSpace.  Doing so would make the
> invalidation trivial, simply lookup the VTDAddressSpace based on the
> domain_id, get the MemoryRegion, and fire off
> memory_region_notify_iommu()s, which then get {un}mapped through vfio
> without any direct interaction.  Unfortunately I think that means that
> intel-iommu needs some data structure surgery.
>
> With the current code, I think the best we could do would be to look
> through every context for matching domain_ids.  For each one, use that
> bus:devfn to get the VTDAddressSpace and thus the MemoryRegion and
> call a notify for each.  That's not only an inefficient lookup, but
> requires a notify per matching bus:devfn since each uses a separate
> VTDAddressSpace when we should really only need a notify per domain_id.
>
> Also, I haven't figured it out yet, but I think we're also going to
> need to actually populate the MemoryRegion rather than just use it to
> send notifies, otherwise replay won't work, which means that a device
> added to a domain with existing mappings wouldn't work.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-28 10:52       ` Aviv B.D.
@ 2016-05-28 16:02         ` Alex Williamson
  2016-05-28 16:10           ` Aviv B.D.
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-05-28 16:02 UTC (permalink / raw)
  To: Aviv B.D.; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

On Sat, 28 May 2016 10:52:58 +0000
"Aviv B.D." <bd.aviv@gmail.com> wrote:

> Hi,
> Your idea to search the relevent VTDAddressSpace and call it's notifier
> will
> probably work. Next week I'll try to implement it (for now with the costly
> scan
> of each context).

I think an optimization we can make is to use pci_for_each_bus() and
pci_for_each_device() to scan only context entries where devices are
present.  Then for each context entry, retrieve the DID, if it matches
the invalidation domain_id, retrieve the VTDAddressSpace and perform a
memory_region_notify_iommu() using VTDAddressSpace.iommu.  Still
horribly inefficient, but an improvement over walking all context
entries and avoids gratuitous callbacks between unrelated drivers in
QEMU.

Overall, I have very little faith that this will be the only change
required to make this work though.  For instance, if a device is added
or removed from a domain, where is that accounted for?  Ideally this
should trigger the region_add/region_del listener callbacks, but I
don't see how that works with how VT-d creates a fixed VTDAddressSpace
per device, and in fact how our QEMU memory model doesn't allow the
address space of a device to be dynamically aliased against other
address spaces or really changed at all. 

> I still not sure if populating the MemoryRegion will suffice for hot plug
> vfio
> device but i'll try to look into it.
> 
> As far as I understand the memory_region_iommu_replay function, it still
> scans
> the whole 64bit address space, and therefore may hang the VM for a long
> time.

Then we need to fix that problem, one option might be to make a replay
callback on MemoryRegionIOMMUOps that walks the page tables for a given
context entry rather than blindly traversing a 64bit address space.  We
can't simply ignore the issue by #ifdef'ing out the code.  I suspect
there's a lot more involved to make VT-d interact properly with a
physical device than what's been proposed so far.  At every
invalidation, we need to figure out what's changed and update the host
mappings.  We also need better, more dynamic address space management
to make the virtual hardware reflect physical hardware when we enable
things like passthrough mode or have multiple devices sharing an iommu
domain.  I think we're just barely scratching the surface here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-28 16:02         ` Alex Williamson
@ 2016-05-28 16:10           ` Aviv B.D.
  2016-05-28 17:39             ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Aviv B.D. @ 2016-05-28 16:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

On Sat, May 28, 2016 at 7:02 PM Alex Williamson <alex.williamson@redhat.com>
wrote:

> On Sat, 28 May 2016 10:52:58 +0000
> "Aviv B.D." <bd.aviv@gmail.com> wrote:
>
> > Hi,
> > Your idea to search the relevent VTDAddressSpace and call it's notifier
> > will
> > probably work. Next week I'll try to implement it (for now with the
> costly
> > scan
> > of each context).
>
> I think an optimization we can make is to use pci_for_each_bus() and
> pci_for_each_device() to scan only context entries where devices are
> present.  Then for each context entry, retrieve the DID, if it matches
> the invalidation domain_id, retrieve the VTDAddressSpace and perform a
> memory_region_notify_iommu() using VTDAddressSpace.iommu.  Still
> horribly inefficient, but an improvement over walking all context
> entries and avoids gratuitous callbacks between unrelated drivers in
> QEMU.
>

Thanks for the references on how I can do it. :)

>
> Overall, I have very little faith that this will be the only change
> required to make this work though.  For instance, if a device is added
> or removed from a domain, where is that accounted for?  Ideally this
> should trigger the region_add/region_del listener callbacks, but I
> don't see how that works with how VT-d creates a fixed VTDAddressSpace
> per device, and in fact how our QEMU memory model doesn't allow the
> address space of a device to be dynamically aliased against other
> address spaces or really changed at all.
>
> > I still not sure if populating the MemoryRegion will suffice for hot plug
> > vfio
> > device but i'll try to look into it.
> >
> > As far as I understand the memory_region_iommu_replay function, it still
> > scans
> > the whole 64bit address space, and therefore may hang the VM for a long
> > time.
>
> Then we need to fix that problem, one option might be to make a replay
> callback on MemoryRegionIOMMUOps that walks the page tables for a given
> context entry rather than blindly traversing a 64bit address space.  We
> can't simply ignore the issue by #ifdef'ing out the code.  I suspect
> there's a lot more involved to make VT-d interact properly with a
> physical device than what's been proposed so far.  At every
> invalidation, we need to figure out what's changed and update the host
> mappings.  We also need better, more dynamic address space management
> to make the virtual hardware reflect physical hardware when we enable
> things like passthrough mode or have multiple devices sharing an iommu
> domain.  I think we're just barely scratching the surface here.  Thanks,
>
> Alex
>


I agree with you regarding hotplug, therefore I only ifdef this code out
and didn't
delete it. With the call to memory_region_iommu_replay QEMU hangs on startup
with a very long loop that prevent any device assignment  with vIOMMU
enabled.

I'm hoping not to enlarge the scope of this patch to include hotplug device
assignment
with iommu enabled.

Thanks,
Aviv.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-05-24  9:25     ` Jan Kiszka
@ 2016-05-28 16:12       ` Aviv B.D.
  2016-05-28 16:34         ` Kiszka, Jan
  0 siblings, 1 reply; 40+ messages in thread
From: Aviv B.D. @ 2016-05-28 16:12 UTC (permalink / raw)
  To: Jan Kiszka, Jason Wang, qemu-devel
  Cc: Alex Williamson, Peter Xu, Michael S. Tsirkin

What is the best way to add this configuration option?

Aviv.

On Tue, May 24, 2016 at 12:25 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:

> On 2016-05-24 10:14, Jason Wang wrote:
> > On 2016年05月22日 00:19, Aviv B.D wrote:
> >> From: "Aviv Ben-David" <bd.aviv@gmail.com>
> >>
> >> This flag tells the guest to invalidate tlb cache also after unmap
> >> operations.
> >>
> >> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> >> ---
> >
> > Is this a guest visible behavior?  If yes, shouldn't we cache
> > translation fault conditions in IOTLB?
>
> It is guest visible, and this first of all means, besides requiring to
> be optional, that it definitely needs to be off for compat systems. Or
> it stays off by default.
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT RDA ITP SES-DE
> Corporate Competence Center Embedded Linux
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-05-28 16:12       ` Aviv B.D.
@ 2016-05-28 16:34         ` Kiszka, Jan
  0 siblings, 0 replies; 40+ messages in thread
From: Kiszka, Jan @ 2016-05-28 16:34 UTC (permalink / raw)
  To: jasowang, qemu-devel, bd.aviv; +Cc: alex.williamson, peterx, mst

Build on top of Marcel's -device for IOMMUs patches, add a device property. Sorry, no reference at hand.

Jan

Mit TouchDown von meinem Android-Telefon gesendet (www.symantec.com)

-----Original Message-----
From: Aviv B.D. [bd.aviv@gmail.com]
Received: Samstag, 28 Mai 2016, 18:12
To: Kiszka, Jan (CT RDA ITP SES-DE) [jan.kiszka@siemens.com]; Jason Wang [jasowang@redhat.com]; qemu-devel@nongnu.org [qemu-devel@nongnu.org]
CC: Alex Williamson [alex.williamson@redhat.com]; Peter Xu [peterx@redhat.com]; Michael S. Tsirkin [mst@redhat.com]
Subject: Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest

What is the best way to add this configuration option?

Aviv.

On Tue, May 24, 2016 at 12:25 PM Jan Kiszka <jan.kiszka@siemens.com<mailto:jan.kiszka@siemens.com>> wrote:
On 2016-05-24 10:14, Jason Wang wrote:
> On 2016年05月22日 00:19, Aviv B.D wrote:
>> From: "Aviv Ben-David" <bd.aviv@gmail.com<mailto:bd.aviv@gmail.com>>
>>
>> This flag tells the guest to invalidate tlb cache also after unmap
>> operations.
>>
>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com<mailto:bd.aviv@gmail.com>>
>> ---
>
> Is this a guest visible behavior?  If yes, shouldn't we cache
> translation fault conditions in IOTLB?

It is guest visible, and this first of all means, besides requiring to
be optional, that it definitely needs to be off for compat systems. Or
it stays off by default.

Jan

--
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-28 16:10           ` Aviv B.D.
@ 2016-05-28 17:39             ` Alex Williamson
  2016-05-28 18:14               ` Aviv B.D.
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-05-28 17:39 UTC (permalink / raw)
  To: Aviv B.D.; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

On Sat, 28 May 2016 16:10:55 +0000
"Aviv B.D." <bd.aviv@gmail.com> wrote:

> On Sat, May 28, 2016 at 7:02 PM Alex Williamson <alex.williamson@redhat.com>
> wrote:
> 
> > On Sat, 28 May 2016 10:52:58 +0000
> > "Aviv B.D." <bd.aviv@gmail.com> wrote:
> >  
> > > Hi,
> > > Your idea to search the relevent VTDAddressSpace and call it's notifier
> > > will
> > > probably work. Next week I'll try to implement it (for now with the  
> > costly  
> > > scan
> > > of each context).  
> >
> > I think an optimization we can make is to use pci_for_each_bus() and
> > pci_for_each_device() to scan only context entries where devices are
> > present.  Then for each context entry, retrieve the DID, if it matches
> > the invalidation domain_id, retrieve the VTDAddressSpace and perform a
> > memory_region_notify_iommu() using VTDAddressSpace.iommu.  Still
> > horribly inefficient, but an improvement over walking all context
> > entries and avoids gratuitous callbacks between unrelated drivers in
> > QEMU.
> >  
> 
> Thanks for the references on how I can do it. :)
> 
> >
> > Overall, I have very little faith that this will be the only change
> > required to make this work though.  For instance, if a device is added
> > or removed from a domain, where is that accounted for?  Ideally this
> > should trigger the region_add/region_del listener callbacks, but I
> > don't see how that works with how VT-d creates a fixed VTDAddressSpace
> > per device, and in fact how our QEMU memory model doesn't allow the
> > address space of a device to be dynamically aliased against other
> > address spaces or really changed at all.
> >  
> > > I still not sure if populating the MemoryRegion will suffice for hot plug
> > > vfio
> > > device but i'll try to look into it.
> > >
> > > As far as I understand the memory_region_iommu_replay function, it still
> > > scans
> > > the whole 64bit address space, and therefore may hang the VM for a long
> > > time.  
> >
> > Then we need to fix that problem, one option might be to make a replay
> > callback on MemoryRegionIOMMUOps that walks the page tables for a given
> > context entry rather than blindly traversing a 64bit address space.  We
> > can't simply ignore the issue by #ifdef'ing out the code.  I suspect
> > there's a lot more involved to make VT-d interact properly with a
> > physical device than what's been proposed so far.  At every
> > invalidation, we need to figure out what's changed and update the host
> > mappings.  We also need better, more dynamic address space management
> > to make the virtual hardware reflect physical hardware when we enable
> > things like passthrough mode or have multiple devices sharing an iommu
> > domain.  I think we're just barely scratching the surface here.  Thanks,
> >
> > Alex
> >  
> 
> 
> I agree with you regarding hotplug, therefore I only ifdef this code out
> and didn't
> delete it. With the call to memory_region_iommu_replay QEMU hangs on startup
> with a very long loop that prevent any device assignment  with vIOMMU
> enabled.
> 
> I'm hoping not to enlarge the scope of this patch to include hotplug device
> assignment
> with iommu enabled.

It's not just hotplug, any case where an existing domain can be applied
to a device.  The series is incomplete without such support and I won't
accept any changes into vfio that disables code that's correct in other
contexts.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-28 17:39             ` Alex Williamson
@ 2016-05-28 18:14               ` Aviv B.D.
  2016-05-28 19:48                 ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Aviv B.D. @ 2016-05-28 18:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

Hi,
As far as I tested the disabled code (call to memory_region_iommu_replay)
hangup
QEMU on startup if IOMMU is enabled (scaning 64 bit address space takes
more
than an hour on modern hardware) , at least on x86 hardware. So the code is
not 100%
correct for any context. Maybe it just should be disabled for x86
architecture?

By specification any such behavior of applying a domain to device should
include
cache invalidation if CM flag is present so I'm not thinking that my patch
break
this scenario.

Thanks,
Aviv.

On Sat, May 28, 2016 at 8:39 PM Alex Williamson <alex.williamson@redhat.com>
wrote:

> On Sat, 28 May 2016 16:10:55 +0000
> "Aviv B.D." <bd.aviv@gmail.com> wrote:
>
> > On Sat, May 28, 2016 at 7:02 PM Alex Williamson <
> alex.williamson@redhat.com>
> > wrote:
> >
> > > On Sat, 28 May 2016 10:52:58 +0000
> > > "Aviv B.D." <bd.aviv@gmail.com> wrote:
> > >
> > > > Hi,
> > > > Your idea to search the relevent VTDAddressSpace and call it's
> notifier
> > > > will
> > > > probably work. Next week I'll try to implement it (for now with the
> > > costly
> > > > scan
> > > > of each context).
> > >
> > > I think an optimization we can make is to use pci_for_each_bus() and
> > > pci_for_each_device() to scan only context entries where devices are
> > > present.  Then for each context entry, retrieve the DID, if it matches
> > > the invalidation domain_id, retrieve the VTDAddressSpace and perform a
> > > memory_region_notify_iommu() using VTDAddressSpace.iommu.  Still
> > > horribly inefficient, but an improvement over walking all context
> > > entries and avoids gratuitous callbacks between unrelated drivers in
> > > QEMU.
> > >
> >
> > Thanks for the references on how I can do it. :)
> >
> > >
> > > Overall, I have very little faith that this will be the only change
> > > required to make this work though.  For instance, if a device is added
> > > or removed from a domain, where is that accounted for?  Ideally this
> > > should trigger the region_add/region_del listener callbacks, but I
> > > don't see how that works with how VT-d creates a fixed VTDAddressSpace
> > > per device, and in fact how our QEMU memory model doesn't allow the
> > > address space of a device to be dynamically aliased against other
> > > address spaces or really changed at all.
> > >
> > > > I still not sure if populating the MemoryRegion will suffice for hot
> plug
> > > > vfio
> > > > device but i'll try to look into it.
> > > >
> > > > As far as I understand the memory_region_iommu_replay function, it
> still
> > > > scans
> > > > the whole 64bit address space, and therefore may hang the VM for a
> long
> > > > time.
> > >
> > > Then we need to fix that problem, one option might be to make a replay
> > > callback on MemoryRegionIOMMUOps that walks the page tables for a given
> > > context entry rather than blindly traversing a 64bit address space.  We
> > > can't simply ignore the issue by #ifdef'ing out the code.  I suspect
> > > there's a lot more involved to make VT-d interact properly with a
> > > physical device than what's been proposed so far.  At every
> > > invalidation, we need to figure out what's changed and update the host
> > > mappings.  We also need better, more dynamic address space management
> > > to make the virtual hardware reflect physical hardware when we enable
> > > things like passthrough mode or have multiple devices sharing an iommu
> > > domain.  I think we're just barely scratching the surface here.
> Thanks,
> > >
> > > Alex
> > >
> >
> >
> > I agree with you regarding hotplug, therefore I only ifdef this code out
> > and didn't
> > delete it. With the call to memory_region_iommu_replay QEMU hangs on
> startup
> > with a very long loop that prevent any device assignment  with vIOMMU
> > enabled.
> >
> > I'm hoping not to enlarge the scope of this patch to include hotplug
> device
> > assignment
> > with iommu enabled.
>
> It's not just hotplug, any case where an existing domain can be applied
> to a device.  The series is incomplete without such support and I won't
> accept any changes into vfio that disables code that's correct in other
> contexts.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-28 18:14               ` Aviv B.D.
@ 2016-05-28 19:48                 ` Alex Williamson
  2016-06-02 13:09                   ` Aviv B.D.
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-05-28 19:48 UTC (permalink / raw)
  To: Aviv B.D.; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

On Sat, 28 May 2016 18:14:18 +0000
"Aviv B.D." <bd.aviv@gmail.com> wrote:

> Hi,
> As far as I tested the disabled code (call to memory_region_iommu_replay)
> hangup
> QEMU on startup if IOMMU is enabled (scaning 64 bit address space takes
> more
> than an hour on modern hardware) , at least on x86 hardware. So the code is
> not 100%
> correct for any context. Maybe it just should be disabled for x86
> architecture?
> 
> By specification any such behavior of applying a domain to device should
> include
> cache invalidation if CM flag is present so I'm not thinking that my patch
> break
> this scenario.

The functionality is completely necessary, imagine moving a device from
an IOMMU API domain in the guest back to the passthrough domain, if
there is no replay of the IOMMU context, the device cannot perform any
DMA at all.  The current replay mechanism is obviously not designed for
iterating over every page of a 64bit address space, which is why I
suggest a replay callback on MemoryRegionIOMMUOps so that VT-d can
optimize the replay by walking the VT-d page tables and perhaps
implementation of hardware passthrough mode and the ability to
dynamically switch a device to address_space_memory.  The current
replay code is correct and functional in a context with a window based
IOMMU where the IOMMU address space is much smaller.  We cannot have
correct operation without a mechanism to rebuild the host IOMMU context
when a device is switched to a new domain.  Please address it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-05-21 16:42   ` Jan Kiszka
@ 2016-06-02  8:44     ` Peter Xu
  2016-06-02 13:00       ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2016-06-02  8:44 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Aviv B.D, qemu-devel, Alex Williamson, Michael S. Tsirkin

On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
> On 2016-05-21 18:19, Aviv B.D wrote:
> > From: "Aviv Ben-David" <bd.aviv@gmail.com>
> > 
> > This flag tells the guest to invalidate tlb cache also after unmap operations.
> > 
> > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > ---
> >  hw/i386/intel_iommu.c          | 3 ++-
> >  hw/i386/intel_iommu_internal.h | 1 +
> >  2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 347718f..1af8da8 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
> >      s->iq_last_desc_type = VTD_INV_DESC_NONE;
> >      s->next_frcd_reg = 0;
> >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> > -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> > +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > +             VTD_CAP_CM;
> 
> Again, needs to be optional because not all guests will support it or
> behave differently when it's set (I've one that refuses to work).

There should be more than one way to make it optional. Which is
better? What I can think of:

(Assume we have Marcel's "-device intel_iommu" working already)

1. Let the CM bit optional, or say, we need to specify something like
   "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
   have CM disabled but with VFIO device, let QEMU raise error.

2. We automatically detect whether we need CM bit. E.g., if we have
   VFIO and vIOMMU both enabled, we automatically set the bit. Another
   case is maybe we would in the future support nested vIOMMU? If so,
   we can do the same thing for the nested feature.

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-02  8:44     ` Peter Xu
@ 2016-06-02 13:00       ` Alex Williamson
  2016-06-02 13:14         ` Jan Kiszka
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-06-02 13:00 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jan Kiszka, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Thu, 2 Jun 2016 16:44:39 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
> > On 2016-05-21 18:19, Aviv B.D wrote:  
> > > From: "Aviv Ben-David" <bd.aviv@gmail.com>
> > > 
> > > This flag tells the guest to invalidate tlb cache also after unmap operations.
> > > 
> > > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 3 ++-
> > >  hw/i386/intel_iommu_internal.h | 1 +
> > >  2 files changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 347718f..1af8da8 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
> > >      s->iq_last_desc_type = VTD_INV_DESC_NONE;
> > >      s->next_frcd_reg = 0;
> > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> > > -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> > > +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > +             VTD_CAP_CM;  
> > 
> > Again, needs to be optional because not all guests will support it or
> > behave differently when it's set (I've one that refuses to work).  
> 
> There should be more than one way to make it optional. Which is
> better? What I can think of:
> 
> (Assume we have Marcel's "-device intel_iommu" working already)
> 
> 1. Let the CM bit optional, or say, we need to specify something like
>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
>    have CM disabled but with VFIO device, let QEMU raise error.
> 
> 2. We automatically detect whether we need CM bit. E.g., if we have
>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
>    case is maybe we would in the future support nested vIOMMU? If so,
>    we can do the same thing for the nested feature.


Why do we need to support VT-d for guests that do not support CM=1?
The VT-d spec indicates that software should be written to handle both
caching modes (6.1).  Granted this is a *should* and not a *must*,
but can't we consider guests that do not support CM=1 incompatible with
emulated VT-d?  If CM=0 needs to be supported then we need to shadow
all of the remapping structures since vfio effectively becomes a cache
of the that would otherwise depend on the invalidation of both present
and non-present entries.  What guests do not support CM=1?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-28 19:48                 ` Alex Williamson
@ 2016-06-02 13:09                   ` Aviv B.D.
  2016-06-02 13:34                     ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Aviv B.D. @ 2016-06-02 13:09 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

Hi,

In case of hot plug vfio device there should not be any active mapping
to this device prior the device addition. Also before it added to a guest
the guest should not attach the device to any domain as the device is not
present.
With CM enabled the guest must invalidate the domain or individual mappings
that belong to this new device before any use of those maps.

I'm still not sure that this functionality is necessary in x86 and
currently there
is a scenario (for x86) that use this functionality.

Thanks,
Aviv.

On Sat, May 28, 2016 at 10:48 PM Alex Williamson <alex.williamson@redhat.com>
wrote:

> On Sat, 28 May 2016 18:14:18 +0000
> "Aviv B.D." <bd.aviv@gmail.com> wrote:
>
> > Hi,
> > As far as I tested the disabled code (call to memory_region_iommu_replay)
> > hangup
> > QEMU on startup if IOMMU is enabled (scaning 64 bit address space takes
> > more
> > than an hour on modern hardware) , at least on x86 hardware. So the code
> is
> > not 100%
> > correct for any context. Maybe it just should be disabled for x86
> > architecture?
> >
> > By specification any such behavior of applying a domain to device should
> > include
> > cache invalidation if CM flag is present so I'm not thinking that my
> patch
> > break
> > this scenario.
>
> The functionality is completely necessary, imagine moving a device from
> an IOMMU API domain in the guest back to the passthrough domain, if
> there is no replay of the IOMMU context, the device cannot perform any
> DMA at all.  The current replay mechanism is obviously not designed for
> iterating over every page of a 64bit address space, which is why I
> suggest a replay callback on MemoryRegionIOMMUOps so that VT-d can
> optimize the replay by walking the VT-d page tables and perhaps
> implementation of hardware passthrough mode and the ability to
> dynamically switch a device to address_space_memory.  The current
> replay code is correct and functional in a context with a window based
> IOMMU where the IOMMU address space is much smaller.  We cannot have
> correct operation without a mechanism to rebuild the host IOMMU context
> when a device is switched to a new domain.  Please address it.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-02 13:00       ` Alex Williamson
@ 2016-06-02 13:14         ` Jan Kiszka
  2016-06-02 13:17           ` Jan Kiszka
                             ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Jan Kiszka @ 2016-06-02 13:14 UTC (permalink / raw)
  To: Alex Williamson, Peter Xu; +Cc: Aviv B.D, qemu-devel, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3012 bytes --]

On 2016-06-02 15:00, Alex Williamson wrote:
> On Thu, 2 Jun 2016 16:44:39 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
>> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
>>> On 2016-05-21 18:19, Aviv B.D wrote:  
>>>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>>>>
>>>> This flag tells the guest to invalidate tlb cache also after unmap operations.
>>>>
>>>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
>>>> ---
>>>>  hw/i386/intel_iommu.c          | 3 ++-
>>>>  hw/i386/intel_iommu_internal.h | 1 +
>>>>  2 files changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 347718f..1af8da8 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>>>>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
>>>>      s->next_frcd_reg = 0;
>>>>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
>>>> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
>>>> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
>>>> +             VTD_CAP_CM;  
>>>
>>> Again, needs to be optional because not all guests will support it or
>>> behave differently when it's set (I've one that refuses to work).  
>>
>> There should be more than one way to make it optional. Which is
>> better? What I can think of:
>>
>> (Assume we have Marcel's "-device intel_iommu" working already)
>>
>> 1. Let the CM bit optional, or say, we need to specify something like
>>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
>>    have CM disabled but with VFIO device, let QEMU raise error.
>>
>> 2. We automatically detect whether we need CM bit. E.g., if we have
>>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
>>    case is maybe we would in the future support nested vIOMMU? If so,
>>    we can do the same thing for the nested feature.
> 
> 
> Why do we need to support VT-d for guests that do not support CM=1?
> The VT-d spec indicates that software should be written to handle both
> caching modes (6.1).  Granted this is a *should* and not a *must*,
> but can't we consider guests that do not support CM=1 incompatible with
> emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> all of the remapping structures since vfio effectively becomes a cache
> of the that would otherwise depend on the invalidation of both present
> and non-present entries.  What guests do not support CM=1?  Thanks,

- there is at least one guest that does not support CM=1 yet (Jailhouse)
- there might be more or there might be broken ones as hardware
  generally doesn't have CM=1, thus this case is typically untested
- an AMD IOMMU (to my current understanding) will require shadowing
  anyway has it has no comparable concept, thus we will eventually be
  able to use that strategy also for VT-d

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-02 13:14         ` Jan Kiszka
@ 2016-06-02 13:17           ` Jan Kiszka
  2016-06-02 16:15           ` Michael S. Tsirkin
  2016-06-06  5:04           ` Peter Xu
  2 siblings, 0 replies; 40+ messages in thread
From: Jan Kiszka @ 2016-06-02 13:17 UTC (permalink / raw)
  To: Alex Williamson, Peter Xu; +Cc: Aviv B.D, qemu-devel, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3308 bytes --]

On 2016-06-02 15:14, Jan Kiszka wrote:
> On 2016-06-02 15:00, Alex Williamson wrote:
>> On Thu, 2 Jun 2016 16:44:39 +0800
>> Peter Xu <peterx@redhat.com> wrote:
>>
>>> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
>>>> On 2016-05-21 18:19, Aviv B.D wrote:  
>>>>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>>>>>
>>>>> This flag tells the guest to invalidate tlb cache also after unmap operations.
>>>>>
>>>>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
>>>>> ---
>>>>>  hw/i386/intel_iommu.c          | 3 ++-
>>>>>  hw/i386/intel_iommu_internal.h | 1 +
>>>>>  2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index 347718f..1af8da8 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>>>>>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
>>>>>      s->next_frcd_reg = 0;
>>>>>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
>>>>> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
>>>>> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
>>>>> +             VTD_CAP_CM;  
>>>>
>>>> Again, needs to be optional because not all guests will support it or
>>>> behave differently when it's set (I've one that refuses to work).  
>>>
>>> There should be more than one way to make it optional. Which is
>>> better? What I can think of:
>>>
>>> (Assume we have Marcel's "-device intel_iommu" working already)
>>>
>>> 1. Let the CM bit optional, or say, we need to specify something like
>>>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
>>>    have CM disabled but with VFIO device, let QEMU raise error.
>>>
>>> 2. We automatically detect whether we need CM bit. E.g., if we have
>>>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
>>>    case is maybe we would in the future support nested vIOMMU? If so,
>>>    we can do the same thing for the nested feature.
>>
>>
>> Why do we need to support VT-d for guests that do not support CM=1?
>> The VT-d spec indicates that software should be written to handle both
>> caching modes (6.1).  Granted this is a *should* and not a *must*,
>> but can't we consider guests that do not support CM=1 incompatible with
>> emulated VT-d?  If CM=0 needs to be supported then we need to shadow
>> all of the remapping structures since vfio effectively becomes a cache
>> of the that would otherwise depend on the invalidation of both present
>> and non-present entries.  What guests do not support CM=1?  Thanks,
> 
> - there is at least one guest that does not support CM=1 yet (Jailhouse)
> - there might be more or there might be broken ones as hardware
>   generally doesn't have CM=1, thus this case is typically untested
> - an AMD IOMMU (to my current understanding) will require shadowing
>   anyway has it has no comparable concept, thus we will eventually be
>   able to use that strategy also for VT-d

- more accurate hardware emulation, ie. less differences between
  modelled and real behaviour
  (one reason IR will be optional with VT-d because the Q35 chipset
  didn't include it)

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-06-02 13:09                   ` Aviv B.D.
@ 2016-06-02 13:34                     ` Alex Williamson
  2016-06-06  8:09                       ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-06-02 13:34 UTC (permalink / raw)
  To: Aviv B.D.; +Cc: Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin

On Thu, 02 Jun 2016 13:09:27 +0000
"Aviv B.D." <bd.aviv@gmail.com> wrote:

> Hi,
> 
> In case of hot plug vfio device there should not be any active mapping
> to this device prior the device addition.

Counter example - a device is hot added to a guest booted with iommu=pt.

> Also before it added to a guest
> the guest should not attach the device to any domain as the device is not
> present.

The static identity map domain (ie. passthrough domain) can precede the
device existing.

> With CM enabled the guest must invalidate the domain or individual mappings
> that belong to this new device before any use of those maps.
> 
> I'm still not sure that this functionality is necessary in x86 and
> currently there
> is a scenario (for x86) that use this functionality.

Clearly I disagree, it is necessary.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-02 13:14         ` Jan Kiszka
  2016-06-02 13:17           ` Jan Kiszka
@ 2016-06-02 16:15           ` Michael S. Tsirkin
  2016-06-06  5:04           ` Peter Xu
  2 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2016-06-02 16:15 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Alex Williamson, Peter Xu, Aviv B.D, qemu-devel

On Thu, Jun 02, 2016 at 03:14:36PM +0200, Jan Kiszka wrote:
> On 2016-06-02 15:00, Alex Williamson wrote:
> > On Thu, 2 Jun 2016 16:44:39 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> >> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
> >>> On 2016-05-21 18:19, Aviv B.D wrote:  
> >>>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
> >>>>
> >>>> This flag tells the guest to invalidate tlb cache also after unmap operations.
> >>>>
> >>>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> >>>> ---
> >>>>  hw/i386/intel_iommu.c          | 3 ++-
> >>>>  hw/i386/intel_iommu_internal.h | 1 +
> >>>>  2 files changed, 3 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >>>> index 347718f..1af8da8 100644
> >>>> --- a/hw/i386/intel_iommu.c
> >>>> +++ b/hw/i386/intel_iommu.c
> >>>> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
> >>>>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
> >>>>      s->next_frcd_reg = 0;
> >>>>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> >>>> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> >>>> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> >>>> +             VTD_CAP_CM;  
> >>>
> >>> Again, needs to be optional because not all guests will support it or
> >>> behave differently when it's set (I've one that refuses to work).  
> >>
> >> There should be more than one way to make it optional. Which is
> >> better? What I can think of:
> >>
> >> (Assume we have Marcel's "-device intel_iommu" working already)
> >>
> >> 1. Let the CM bit optional, or say, we need to specify something like
> >>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
> >>    have CM disabled but with VFIO device, let QEMU raise error.
> >>
> >> 2. We automatically detect whether we need CM bit. E.g., if we have
> >>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
> >>    case is maybe we would in the future support nested vIOMMU? If so,
> >>    we can do the same thing for the nested feature.
> > 
> > 
> > Why do we need to support VT-d for guests that do not support CM=1?

I don't think we need to do this. Spec is rather clear on this point.
Just fix the guests.
Support for CM=0 might still be useful because CM=1 is not required if
device is restartable (most software devices would be).

I prefer cmbit=on by default, with cmbit=off, don't allow VFIO.


> > The VT-d spec indicates that software should be written to handle both
> > caching modes (6.1).  Granted this is a *should* and not a *must*,
> > but can't we consider guests that do not support CM=1 incompatible with
> > emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> > all of the remapping structures since vfio effectively becomes a cache
> > of the that would otherwise depend on the invalidation of both present
> > and non-present entries.  What guests do not support CM=1?  Thanks,
> 
> - there is at least one guest that does not support CM=1 yet (Jailhouse)
> - there might be more or there might be broken ones as hardware
>   generally doesn't have CM=1, thus this case is typically untested
> - an AMD IOMMU (to my current understanding) will require shadowing
>   anyway has it has no comparable concept,


I was rather sure this is it:
	26 NpCache: not present table entries cached. RO. Reset Xb. 1=Indicates
	that the IOMMU caches page table entries that are marked as not present.
	When this bit is set, software must issue an invalidate after any change
	to a PDE or PTE. 0=Indicates that the IOMMU caches only page table
	entries that are marked as present. When NpCache is clear, software must
	issue an invalidate after any change to a PDE or PTE marked present
	before the change.  Implementation note: For hardware implementations of
	the IOMMU, this bit must be 0b.

this implementation seems to set this bit.


> thus we will eventually be
>   able to use that strategy also for VT-d
> 
> Jan
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-02 13:14         ` Jan Kiszka
  2016-06-02 13:17           ` Jan Kiszka
  2016-06-02 16:15           ` Michael S. Tsirkin
@ 2016-06-06  5:04           ` Peter Xu
  2016-06-06 13:11             ` Alex Williamson
  2 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2016-06-06  5:04 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Alex Williamson, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Thu, Jun 02, 2016 at 03:14:36PM +0200, Jan Kiszka wrote:
> On 2016-06-02 15:00, Alex Williamson wrote:
> > On Thu, 2 Jun 2016 16:44:39 +0800
> > Peter Xu <peterx@redhat.com> wrote:
[...]
> >> There should be more than one way to make it optional. Which is
> >> better? What I can think of:
> >>
> >> (Assume we have Marcel's "-device intel_iommu" working already)
> >>
> >> 1. Let the CM bit optional, or say, we need to specify something like
> >>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
> >>    have CM disabled but with VFIO device, let QEMU raise error.
> >>
> >> 2. We automatically detect whether we need CM bit. E.g., if we have
> >>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
> >>    case is maybe we would in the future support nested vIOMMU? If so,
> >>    we can do the same thing for the nested feature.
> > 
> > 
> > Why do we need to support VT-d for guests that do not support CM=1?
> > The VT-d spec indicates that software should be written to handle both
> > caching modes (6.1).  Granted this is a *should* and not a *must*,
> > but can't we consider guests that do not support CM=1 incompatible with
> > emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> > all of the remapping structures since vfio effectively becomes a cache
> > of the that would otherwise depend on the invalidation of both present
> > and non-present entries.  What guests do not support CM=1?  Thanks,
> 
> - there is at least one guest that does not support CM=1 yet (Jailhouse)

Besides the reason that there might have guests that do not support
CM=1, will there be performance considerations? When user's
configuration does not require CM capability (e.g., generic VM
configuration, without VFIO), shall we allow user to disable the CM
bit so that we can have better IOMMU performance (avoid extra and
useless invalidations)?

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/3] IOMMU: change iommu_op->translate's is_write to flags, add support to NO_FAIL flag mode
  2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 2/3] IOMMU: change iommu_op->translate's is_write to flags, add support to NO_FAIL flag mode Aviv B.D
@ 2016-06-06  5:04   ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2016-06-06  5:04 UTC (permalink / raw)
  To: Aviv B.D; +Cc: qemu-devel, Michael S. Tsirkin, Alex Williamson, Jan Kiszka

On Sat, May 21, 2016 at 07:19:49PM +0300, Aviv B.D wrote:
[...]
>  static void vtd_record_frcd(IntelIOMMUState *s, uint16_t index,
>                              uint16_t source_id, hwaddr addr,
> -                            VTDFaultReason fault, bool is_write)
> +                            VTDFaultReason fault, IOMMUAccessFlags flags)
>  {
>      uint64_t hi = 0, lo;
>      hwaddr frcd_reg_addr = DMAR_FRCD_REG_OFFSET + (((uint64_t)index) << 4);
> @@ -365,7 +365,7 @@ static void vtd_record_frcd(IntelIOMMUState *s, uint16_t index,
>  
>      lo = VTD_FRCD_FI(addr);
>      hi = VTD_FRCD_SID(source_id) | VTD_FRCD_FR(fault);
> -    if (!is_write) {
> +    if (!(flags == IOMMU_WO || flags == IOMMU_RW)) {

Is it possible that we have IOMMU_RW here? I see this flag is only
used when DMAR is disabled and we assume all address are rw. No other
place is using it. If so, I'd suggest we can make it simpler like:

     if (!(flags == IOMMU_WO)) {

>          hi |= VTD_FRCD_T;
>      }
>      vtd_set_quad_raw(s, frcd_reg_addr, lo);
> @@ -396,7 +396,7 @@ static bool vtd_try_collapse_fault(IntelIOMMUState *s, uint16_t source_id)
>  /* Log and report an DMAR (address translation) fault to software */
>  static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
>                                    hwaddr addr, VTDFaultReason fault,
> -                                  bool is_write)
> +                                  IOMMUAccessFlags flags)
>  {
>      uint32_t fsts_reg = vtd_get_long_raw(s, DMAR_FSTS_REG);
>  
> @@ -407,7 +407,7 @@ static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
>          return;
>      }
>      VTD_DPRINTF(FLOG, "sid 0x%"PRIx16 ", fault %d, addr 0x%"PRIx64
> -                ", is_write %d", source_id, fault, addr, is_write);
> +                ", flags %d", source_id, fault, addr, flags);
>      if (fsts_reg & VTD_FSTS_PFO) {
>          VTD_DPRINTF(FLOG, "new fault is not recorded due to "
>                      "Primary Fault Overflow");
> @@ -425,7 +425,7 @@ static void vtd_report_dmar_fault(IntelIOMMUState *s, uint16_t source_id,
>          return;
>      }
>  
> -    vtd_record_frcd(s, s->next_frcd_reg, source_id, addr, fault, is_write);
> +    vtd_record_frcd(s, s->next_frcd_reg, source_id, addr, fault, flags);
>  
>      if (fsts_reg & VTD_FSTS_PPF) {
>          VTD_DPRINTF(FLOG, "there are pending faults already, "
> @@ -621,7 +621,7 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>  /* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> -static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
> +static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, IOMMUAccessFlags flags,
>                              uint64_t *slptep, uint32_t *slpte_level,
>                              bool *reads, bool *writes)
>  {
> @@ -641,7 +641,20 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
>      }
>  
>      /* FIXME: what is the Atomics request here? */
> -    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
> +    switch(flags){
> +        case IOMMU_WO:
> +            access_right_check = VTD_SL_W;
> +            break;
> +        case IOMMU_RO:
> +            access_right_check = VTD_SL_R;
> +            break;
> +        case IOMMU_RW: /* passthrow */

Same as above. Remove this line? (btw, it's "passthrough" :)

> +        case IOMMU_NO_FAIL:
> +            access_right_check = VTD_SL_R | VTD_SL_W;
> +            break;
> +        default:
> +            assert(0);
> +    }
>  
>      while (true) {
>          offset = vtd_gpa_level_offset(gpa, level);
> @@ -663,8 +676,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
>          if (!(slpte & access_right_check)) {
>              VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
>                          "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
> -                        (is_write ? "write" : "read"), gpa, slpte);
> -            return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> +                        (flags == IOMMU_WO ? "write" : "read"), gpa, slpte);
> +            return (flags == IOMMU_RW || flags == IOMMU_WO) ? -VTD_FR_WRITE : -VTD_FR_READ;

Same. (please looking for all IOMMU_RW places, will stop here)

>          }
>          if (vtd_slpte_nonzero_rsvd(slpte, level)) {
>              VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
> @@ -781,11 +794,11 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
>   *
>   * @bus_num: The bus number
>   * @devfn: The devfn, which is the  combined of device and function number
> - * @is_write: The access is a write operation
> + * @flags: The access is a write operation

Please change comments after "flags:" correspondingly.

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-05-23 17:53   ` Alex Williamson
  2016-05-26 20:58     ` Alex Williamson
@ 2016-06-06  7:38     ` Peter Xu
  2016-06-06 17:30       ` Alex Williamson
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Xu @ 2016-06-06  7:38 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Aviv B.D, qemu-devel, Michael S. Tsirkin, Jan Kiszka

Some questions not quite related to this patch content but vfio...

On Mon, May 23, 2016 at 11:53:42AM -0600, Alex Williamson wrote:
> On Sat, 21 May 2016 19:19:50 +0300
> "Aviv B.D" <bd.aviv@gmail.com> wrote:

[...]

> > +#if 0
> >  static hwaddr vfio_container_granularity(VFIOContainer *container)
> >  {
> >      return (hwaddr)1 << ctz64(container->iova_pgsizes);
> >  }
> > -
> > +#endif

Here we are fetching the smallest page size that host IOMMU support,
so even if host IOMMU support large pages, it will not be used as long
as guest enabled vIOMMU, right?

> 
> 
> Clearly this is unacceptable, the code has a purpose.
> 
> >  static void vfio_listener_region_add(MemoryListener *listener,
> >                                       MemoryRegionSection *section)
> >  {
> > @@ -384,11 +387,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >          giommu->n.notify = vfio_iommu_map_notify;
> >          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >  
> > +        vtd_register_giommu(giommu);
> 
> vfio will not assume VT-d, this is why we register the notifier below.
> 
> >          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> > +#if 0
> >          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> >                                     vfio_container_granularity(container),
> >                                     false);

For memory_region_iommu_replay(), we are using
vfio_container_granularity() as the granularity, which is the host
IOMMU page size. However inside it:

void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
                                hwaddr granularity, bool is_write)
{
    hwaddr addr;
    IOMMUTLBEntry iotlb;

    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
        if (iotlb.perm != IOMMU_NONE) {
            n->notify(n, &iotlb);
        }

        /* if (2^64 - MR size) < granularity, it's possible to get an
         * infinite loop here.  This should catch such a wraparound */
        if ((addr + granularity) < addr) {
            break;
        }
    }
}

Is it possible that iotlb mapped to a large page (or any page that is
not the same as granularity)? The above code should have assumed that
host/guest IOMMU are having the same page size == granularity?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-06-02 13:34                     ` Alex Williamson
@ 2016-06-06  8:09                       ` Peter Xu
  2016-06-06 18:21                         ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2016-06-06  8:09 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Aviv B.D., Jan Kiszka, qemu-devel, Michael S. Tsirkin

On Thu, Jun 02, 2016 at 07:34:17AM -0600, Alex Williamson wrote:
> On Thu, 02 Jun 2016 13:09:27 +0000
> "Aviv B.D." <bd.aviv@gmail.com> wrote:
> 
> > Hi,
> > 
> > In case of hot plug vfio device there should not be any active mapping
> > to this device prior the device addition.
> 
> Counter example - a device is hot added to a guest booted with iommu=pt.

I got the same question with Aviv...

For hot-plug devices, even if it is using iommu=pt, shouldn't it still
follow the steps that first init vfio device, then configure device
context entry? Let me list the steps for device addition in case I got
any mistake:

1. user add new VFIO device A

2. vfio_listener_region_add() called for device A on the IOMMU mr,
   here we should create the iommu notifier. However since the context
   entry still does not exist, memory_region_iommu_replay() will got
   all invalid IOTLB (IOMMU_NONE entries)

3. guest kernel found the device, enabled the device, filled in
   context entry for device A with "pass-through" (so the SLPTPTR is
   invalid)

4. guest sent context invalidation to QEMU vIOMMU since we have CM=1
   set for guest vIOMMU

5. QEMU vIOMMU handle the invalidation, trigger VFIO notify to do
   correct VFIO mapping for device A

Though here step 5 should still be missing (IIUC Aviv's patch 3 still
not handled context invalidation). Just want to know whether we can
avoid the replay operation for Intel vIOMMUs (for Intel only, because
Intel has context invalidation and cache mode support, not sure about
other platform)?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-06  5:04           ` Peter Xu
@ 2016-06-06 13:11             ` Alex Williamson
  2016-06-06 13:43               ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-06-06 13:11 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jan Kiszka, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Mon, 6 Jun 2016 13:04:07 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jun 02, 2016 at 03:14:36PM +0200, Jan Kiszka wrote:
> > On 2016-06-02 15:00, Alex Williamson wrote:  
> > > On Thu, 2 Jun 2016 16:44:39 +0800
> > > Peter Xu <peterx@redhat.com> wrote:  
> [...]
> > >> There should be more than one way to make it optional. Which is
> > >> better? What I can think of:
> > >>
> > >> (Assume we have Marcel's "-device intel_iommu" working already)
> > >>
> > >> 1. Let the CM bit optional, or say, we need to specify something like
> > >>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
> > >>    have CM disabled but with VFIO device, let QEMU raise error.
> > >>
> > >> 2. We automatically detect whether we need CM bit. E.g., if we have
> > >>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
> > >>    case is maybe we would in the future support nested vIOMMU? If so,
> > >>    we can do the same thing for the nested feature.  
> > > 
> > > 
> > > Why do we need to support VT-d for guests that do not support CM=1?
> > > The VT-d spec indicates that software should be written to handle both
> > > caching modes (6.1).  Granted this is a *should* and not a *must*,
> > > but can't we consider guests that do not support CM=1 incompatible with
> > > emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> > > all of the remapping structures since vfio effectively becomes a cache
> > > of the that would otherwise depend on the invalidation of both present
> > > and non-present entries.  What guests do not support CM=1?  Thanks,  
> > 
> > - there is at least one guest that does not support CM=1 yet (Jailhouse)  
> 
> Besides the reason that there might have guests that do not support
> CM=1, will there be performance considerations? When user's
> configuration does not require CM capability (e.g., generic VM
> configuration, without VFIO), shall we allow user to disable the CM
> bit so that we can have better IOMMU performance (avoid extra and
> useless invalidations)?

With Alexey's proposed patch to have callback ops when the iommu
notifier list adds its first entry and removes its last, any of the
additional overhead to generate notifies when nobody is listening can
be avoided.  These same callbacks would be the ones that need to
generate a hw_error if a notifier is added while running in CM=0.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-06 13:11             ` Alex Williamson
@ 2016-06-06 13:43               ` Peter Xu
  2016-06-06 17:02                 ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2016-06-06 13:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
> On Mon, 6 Jun 2016 13:04:07 +0800
> Peter Xu <peterx@redhat.com> wrote:
[...]
> > Besides the reason that there might have guests that do not support
> > CM=1, will there be performance considerations? When user's
> > configuration does not require CM capability (e.g., generic VM
> > configuration, without VFIO), shall we allow user to disable the CM
> > bit so that we can have better IOMMU performance (avoid extra and
> > useless invalidations)?
> 
> With Alexey's proposed patch to have callback ops when the iommu
> notifier list adds its first entry and removes its last, any of the
> additional overhead to generate notifies when nobody is listening can
> be avoided.  These same callbacks would be the ones that need to
> generate a hw_error if a notifier is added while running in CM=0.

Not familar with Alexey's patch, but is that for VFIO only? I mean, if
we configured CMbit=1, guest kernel will send invalidation request
every time it creates new entries (context entries, or iotlb
entries). Even without VFIO notifiers, guest need to trap into QEMU
and process the invalidation requests. This is avoidable if we are not
using VFIO devices at all (so no need to maintain any mappings),
right?

If we allow user to specify cmbit={0|1}, user can decide whether
he/she would like to take this benefit.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-06 13:43               ` Peter Xu
@ 2016-06-06 17:02                 ` Alex Williamson
  2016-06-07  3:20                   ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-06-06 17:02 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jan Kiszka, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Mon, 6 Jun 2016 21:43:17 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
> > On Mon, 6 Jun 2016 13:04:07 +0800
> > Peter Xu <peterx@redhat.com> wrote:  
> [...]
> > > Besides the reason that there might have guests that do not support
> > > CM=1, will there be performance considerations? When user's
> > > configuration does not require CM capability (e.g., generic VM
> > > configuration, without VFIO), shall we allow user to disable the CM
> > > bit so that we can have better IOMMU performance (avoid extra and
> > > useless invalidations)?  
> > 
> > With Alexey's proposed patch to have callback ops when the iommu
> > notifier list adds its first entry and removes its last, any of the
> > additional overhead to generate notifies when nobody is listening can
> > be avoided.  These same callbacks would be the ones that need to
> > generate a hw_error if a notifier is added while running in CM=0.  
> 
> Not familar with Alexey's patch

https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html

>, but is that for VFIO only?

vfio is currently the only user of the iommu notifier, but the
interface is generic, which is how it should (must) be.

> I mean, if
> we configured CMbit=1, guest kernel will send invalidation request
> every time it creates new entries (context entries, or iotlb
> entries). Even without VFIO notifiers, guest need to trap into QEMU
> and process the invalidation requests. This is avoidable if we are not
> using VFIO devices at all (so no need to maintain any mappings),
> right?

CM=1 only defines that not-present and invalid entries can be cached,
any changes to existing entries requires an invalidation regardless of
CM.  What you're looking for sounds more like ECAP.C:

C: Page-walk Coherency
  This field indicates if hardware access to the root, context,
  extended-context and interrupt-remap tables, and second-level paging
  structures for requests-without PASID, are coherent (snooped) or not.
    • 0: Indicates hardware accesses to remapping structures are non-coherent.
    • 1: Indicates hardware accesses to remapping structures are coherent.

Without both CM=0 and C=0, our only virtualization mechanism for
maintaining a hardware cache coherent with the guest view of the iommu
would be to shadow all of the VT-d structures.  For purely emulated
devices, maybe we can get away with that, but I doubt the current
ghashes used for the iotlb are prepared for it.

> If we allow user to specify cmbit={0|1}, user can decide whether
> he/she would like to take this benefit.

So long as the *default* gives us the ability to support an external
hardware cache, like vfio, and we generate a hw_error or equivalent to
avoid unsafe combinations, you're free to enable whatever other
shortcuts you want.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-06-06  7:38     ` Peter Xu
@ 2016-06-06 17:30       ` Alex Williamson
  0 siblings, 0 replies; 40+ messages in thread
From: Alex Williamson @ 2016-06-06 17:30 UTC (permalink / raw)
  To: Peter Xu; +Cc: Aviv B.D, qemu-devel, Michael S. Tsirkin, Jan Kiszka

On Mon, 6 Jun 2016 15:38:25 +0800
Peter Xu <peterx@redhat.com> wrote:

> Some questions not quite related to this patch content but vfio...
> 
> On Mon, May 23, 2016 at 11:53:42AM -0600, Alex Williamson wrote:
> > On Sat, 21 May 2016 19:19:50 +0300
> > "Aviv B.D" <bd.aviv@gmail.com> wrote:  
> 
> [...]
> 
> > > +#if 0
> > >  static hwaddr vfio_container_granularity(VFIOContainer *container)
> > >  {
> > >      return (hwaddr)1 << ctz64(container->iova_pgsizes);
> > >  }
> > > -
> > > +#endif  
> 
> Here we are fetching the smallest page size that host IOMMU support,
> so even if host IOMMU support large pages, it will not be used as long
> as guest enabled vIOMMU, right?

Not using this replay mechanism, correct.  AFAIK, this replay code has
only been tested on POWER where the window is much, much smaller than
the 64bit address space and hugepages are not supported.  A replay
callback into the iommu could could not only walk the address space
more efficiently, but also attempt to map with hugepages.  It would
however need to be cautious not to coalesce separate mappings by the
guest into a single mapping through vfio, or else we're going to have
inconsistency for mapping vs unmapping that vfio does not expect or
support.
 
> > 
> > 
> > Clearly this is unacceptable, the code has a purpose.
> >   
> > >  static void vfio_listener_region_add(MemoryListener *listener,
> > >                                       MemoryRegionSection *section)
> > >  {
> > > @@ -384,11 +387,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
> > >          giommu->n.notify = vfio_iommu_map_notify;
> > >          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> > >  
> > > +        vtd_register_giommu(giommu);  
> > 
> > vfio will not assume VT-d, this is why we register the notifier below.
> >   
> > >          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> > > +#if 0
> > >          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> > >                                     vfio_container_granularity(container),
> > >                                     false);  
> 
> For memory_region_iommu_replay(), we are using
> vfio_container_granularity() as the granularity, which is the host
> IOMMU page size. However inside it:
> 
> void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
>                                 hwaddr granularity, bool is_write)
> {
>     hwaddr addr;
>     IOMMUTLBEntry iotlb;
> 
>     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>         iotlb = mr->iommu_ops->translate(mr, addr, is_write);
>         if (iotlb.perm != IOMMU_NONE) {
>             n->notify(n, &iotlb);
>         }
> 
>         /* if (2^64 - MR size) < granularity, it's possible to get an
>          * infinite loop here.  This should catch such a wraparound */
>         if ((addr + granularity) < addr) {
>             break;
>         }
>     }
> }
> 
> Is it possible that iotlb mapped to a large page (or any page that is
> not the same as granularity)? The above code should have assumed that
> host/guest IOMMU are having the same page size == granularity?

I think this is answered above.  This is not remotely efficient code
for a real 64bit IOMMU (BTW, VT-d does not support the full 64bit
address space either, I believe it's more like 48bits) and is not going
to replay hugepages, but it will give us sufficiently correct IOMMU
entries... eventually. Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-06-06  8:09                       ` Peter Xu
@ 2016-06-06 18:21                         ` Alex Williamson
  2016-06-07 13:20                           ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-06-06 18:21 UTC (permalink / raw)
  To: Peter Xu; +Cc: Aviv B.D., Jan Kiszka, qemu-devel, Michael S. Tsirkin

On Mon, 6 Jun 2016 16:09:11 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jun 02, 2016 at 07:34:17AM -0600, Alex Williamson wrote:
> > On Thu, 02 Jun 2016 13:09:27 +0000
> > "Aviv B.D." <bd.aviv@gmail.com> wrote:
> >   
> > > Hi,
> > > 
> > > In case of hot plug vfio device there should not be any active mapping
> > > to this device prior the device addition.  
> > 
> > Counter example - a device is hot added to a guest booted with iommu=pt.  
> 
> I got the same question with Aviv...
> 
> For hot-plug devices

First, let's just remove "hot-plug" from this discussion because I'm
afraid someone is going to say "let's just not support hotplug
devices".  The issue of moving a device to a pre-populated domain can
occur any time a device attaches to a new domain.  It might occur if a
device is added to a vfio driver in the L1 guest where it's already
managing a device.  It might occur any time the device is released from
an L1 vfio user and returned to the static identity map in the L1 guest
kernel.

>, even if it is using iommu=pt, shouldn't it still
> follow the steps that first init vfio device, then configure device
> context entry? Let me list the steps for device addition in case I got
> any mistake:
> 
> 1. user add new VFIO device A
> 
> 2. vfio_listener_region_add() called for device A on the IOMMU mr,
>    here we should create the iommu notifier. However since the context
>    entry still does not exist, memory_region_iommu_replay() will got
>    all invalid IOTLB (IOMMU_NONE entries)
> 
> 3. guest kernel found the device, enabled the device, filled in
>    context entry for device A with "pass-through" (so the SLPTPTR is
>    invalid)
> 
> 4. guest sent context invalidation to QEMU vIOMMU since we have CM=1
>    set for guest vIOMMU
> 
> 5. QEMU vIOMMU handle the invalidation, trigger VFIO notify to do
>    correct VFIO mapping for device A
> 
> Though here step 5 should still be missing (IIUC Aviv's patch 3 still
> not handled context invalidation). Just want to know whether we can
> avoid the replay operation for Intel vIOMMUs (for Intel only, because
> Intel has context invalidation and cache mode support, not sure about
> other platform)?

I'm not sure why you're so eager to avoid implementing a replay
callback for VT-d.  What happens when VT-d is not enabled by the
guest?  vfio/pci.c calls pci_device_iommu_address_space() to get the
address space for the device, which results in vtd_find_add_as() which
gives us a unique VTDAddressSpace.as for that device.  With VT-d not in
use by the guest, when do steps 3-5 occur?  I agree with your reasoning
when VT-d is enabled, but the BIOS/UEFI does not enable VT-d, so are we
going to exclude all use cases of an assigned device prior to the guest
enabling VT-d?

On that same topic, I'm really dubious that we have the flexibility in
our memory API to really support an IOMMU like VT-d and the layout of
having a VTDAddressSpace per device, rather than exposing shared
address spaces, has implications on the efficiency and locked memory
requirements for vfio.  In the above case with VT-d disabled, the
VTDAddressSpace should alias to the system memory AddressSpace and
dynamically switch to a per device address space when VT-d is enabled.
AFAICT, we don't have anything resembling that sort of feature, so we
rely on the IOMMU driver to replay, perhaps even configuring its own
MemoryListener to IOMMU notifier gateway, which is also something that
doesn't currently exist.

Additionally, if there are things unique to VT-d, for instance if VT-d
is enabled and we can rely on the sequence of events you've set forth,
then yes, the replay mechanism should do nothing.  But only the VT-d
code can decide that, which implies that vfio always needs to call the
replay function and a new MemoryRegionIOMMUOps callback needs to decide
what to do given the current state of the vIOMMU.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-06 17:02                 ` Alex Williamson
@ 2016-06-07  3:20                   ` Peter Xu
  2016-06-07  3:58                     ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2016-06-07  3:20 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
> On Mon, 6 Jun 2016 21:43:17 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
> > > On Mon, 6 Jun 2016 13:04:07 +0800
> > > Peter Xu <peterx@redhat.com> wrote:  
> > [...]
> > > > Besides the reason that there might have guests that do not support
> > > > CM=1, will there be performance considerations? When user's
> > > > configuration does not require CM capability (e.g., generic VM
> > > > configuration, without VFIO), shall we allow user to disable the CM
> > > > bit so that we can have better IOMMU performance (avoid extra and
> > > > useless invalidations)?  
> > > 
> > > With Alexey's proposed patch to have callback ops when the iommu
> > > notifier list adds its first entry and removes its last, any of the
> > > additional overhead to generate notifies when nobody is listening can
> > > be avoided.  These same callbacks would be the ones that need to
> > > generate a hw_error if a notifier is added while running in CM=0.  
> > 
> > Not familar with Alexey's patch
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html

Thanks for the pointer. :)

> 
> >, but is that for VFIO only?
> 
> vfio is currently the only user of the iommu notifier, but the
> interface is generic, which is how it should (must) be.

Yes.

> 
> > I mean, if
> > we configured CMbit=1, guest kernel will send invalidation request
> > every time it creates new entries (context entries, or iotlb
> > entries). Even without VFIO notifiers, guest need to trap into QEMU
> > and process the invalidation requests. This is avoidable if we are not
> > using VFIO devices at all (so no need to maintain any mappings),
> > right?
> 
> CM=1 only defines that not-present and invalid entries can be cached,
> any changes to existing entries requires an invalidation regardless of
> CM.  What you're looking for sounds more like ECAP.C:

Yes, but I guess what I was talking about is CM bit but not ECAP.C.
When we clear/replace one context entry, guest kernel will definitely
send one context entry invalidation to QEMU:

static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
{
	if (!iommu)
		return;

	clear_context_table(iommu, bus, devfn);
	iommu->flush.flush_context(iommu, 0, 0, 0,
					   DMA_CCMD_GLOBAL_INVL);
	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
}

... While if we are creating a new one (like attaching a new VFIO
device?), it's an optional behavior depending on whether CM bit is
set:

static int domain_context_mapping_one(struct dmar_domain *domain,
				      struct intel_iommu *iommu,
				      u8 bus, u8 devfn)
{
    ...
	/*
	 * It's a non-present to present mapping. If hardware doesn't cache
	 * non-present entry we only need to flush the write-buffer. If the
	 * _does_ cache non-present entries, then it does so in the special
	 * domain #0, which we have to flush:
	 */
	if (cap_caching_mode(iommu->cap)) {
		iommu->flush.flush_context(iommu, 0,
					   (((u16)bus) << 8) | devfn,
					   DMA_CCMD_MASK_NOBIT,
					   DMA_CCMD_DEVICE_INVL);
		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
	} else {
		iommu_flush_write_buffer(iommu);
	}
    ...
}

Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
will send these invalidations. What I meant is that, we should allow
user to specify the CM bit, so that when we are not using VFIO
devices, we can skip the above flush_content() and flush_iotlb()
etc... So, besides the truth that we have some guests do not support
CM bit (like Jailhouse), performance might be another consideration
point that we should allow user to specify the CM bit themselfs.

> 
> C: Page-walk Coherency
>   This field indicates if hardware access to the root, context,
>   extended-context and interrupt-remap tables, and second-level paging
>   structures for requests-without PASID, are coherent (snooped) or not.
>     • 0: Indicates hardware accesses to remapping structures are non-coherent.
>     • 1: Indicates hardware accesses to remapping structures are coherent.
> 
> Without both CM=0 and C=0, our only virtualization mechanism for
> maintaining a hardware cache coherent with the guest view of the iommu
> would be to shadow all of the VT-d structures.  For purely emulated
> devices, maybe we can get away with that, but I doubt the current
> ghashes used for the iotlb are prepared for it.

Actually I haven't noticed this bit yet. I see that this will decide
whether guest kernel need to send specific clflush() when modifying
IOMMU PTEs, but shouldn't we flush the memory cache always so that we
can sure IOMMU can see the same memory data as CPU does?

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-07  3:20                   ` Peter Xu
@ 2016-06-07  3:58                     ` Alex Williamson
  2016-06-07  5:00                       ` Peter Xu
  2016-06-07  5:21                       ` Huang, Kai
  0 siblings, 2 replies; 40+ messages in thread
From: Alex Williamson @ 2016-06-07  3:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jan Kiszka, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Tue, 7 Jun 2016 11:20:32 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
> > On Mon, 6 Jun 2016 21:43:17 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:  
> > > > On Mon, 6 Jun 2016 13:04:07 +0800
> > > > Peter Xu <peterx@redhat.com> wrote:    
> > > [...]  
> > > > > Besides the reason that there might have guests that do not support
> > > > > CM=1, will there be performance considerations? When user's
> > > > > configuration does not require CM capability (e.g., generic VM
> > > > > configuration, without VFIO), shall we allow user to disable the CM
> > > > > bit so that we can have better IOMMU performance (avoid extra and
> > > > > useless invalidations)?    
> > > > 
> > > > With Alexey's proposed patch to have callback ops when the iommu
> > > > notifier list adds its first entry and removes its last, any of the
> > > > additional overhead to generate notifies when nobody is listening can
> > > > be avoided.  These same callbacks would be the ones that need to
> > > > generate a hw_error if a notifier is added while running in CM=0.    
> > > 
> > > Not familar with Alexey's patch  
> > 
> > https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html  
> 
> Thanks for the pointer. :)
> 
> >   
> > >, but is that for VFIO only?  
> > 
> > vfio is currently the only user of the iommu notifier, but the
> > interface is generic, which is how it should (must) be.  
> 
> Yes.
> 
> >   
> > > I mean, if
> > > we configured CMbit=1, guest kernel will send invalidation request
> > > every time it creates new entries (context entries, or iotlb
> > > entries). Even without VFIO notifiers, guest need to trap into QEMU
> > > and process the invalidation requests. This is avoidable if we are not
> > > using VFIO devices at all (so no need to maintain any mappings),
> > > right?  
> > 
> > CM=1 only defines that not-present and invalid entries can be cached,
> > any changes to existing entries requires an invalidation regardless of
> > CM.  What you're looking for sounds more like ECAP.C:  
> 
> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
> When we clear/replace one context entry, guest kernel will definitely
> send one context entry invalidation to QEMU:
> 
> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
> {
> 	if (!iommu)
> 		return;
> 
> 	clear_context_table(iommu, bus, devfn);
> 	iommu->flush.flush_context(iommu, 0, 0, 0,
> 					   DMA_CCMD_GLOBAL_INVL);
> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
> }
> 
> ... While if we are creating a new one (like attaching a new VFIO
> device?), it's an optional behavior depending on whether CM bit is
> set:
> 
> static int domain_context_mapping_one(struct dmar_domain *domain,
> 				      struct intel_iommu *iommu,
> 				      u8 bus, u8 devfn)
> {
>     ...
> 	/*
> 	 * It's a non-present to present mapping. If hardware doesn't cache
> 	 * non-present entry we only need to flush the write-buffer. If the
> 	 * _does_ cache non-present entries, then it does so in the special
> 	 * domain #0, which we have to flush:
> 	 */
> 	if (cap_caching_mode(iommu->cap)) {
> 		iommu->flush.flush_context(iommu, 0,
> 					   (((u16)bus) << 8) | devfn,
> 					   DMA_CCMD_MASK_NOBIT,
> 					   DMA_CCMD_DEVICE_INVL);
> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> 	} else {
> 		iommu_flush_write_buffer(iommu);
> 	}
>     ...
> }
> 
> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
> will send these invalidations. What I meant is that, we should allow
> user to specify the CM bit, so that when we are not using VFIO
> devices, we can skip the above flush_content() and flush_iotlb()
> etc... So, besides the truth that we have some guests do not support
> CM bit (like Jailhouse), performance might be another consideration
> point that we should allow user to specify the CM bit themselfs.

I'm dubious of this, IOMMU drivers are already aware that hardware
flushes are expensive and do batching to optimize it.  The queued
invalidation mechanism itself is meant to allow asynchronous
invalidations.  QEMU invalidating a virtual IOMMU might very well be
faster than hardware.

> > 
> > C: Page-walk Coherency
> >   This field indicates if hardware access to the root, context,
> >   extended-context and interrupt-remap tables, and second-level paging
> >   structures for requests-without PASID, are coherent (snooped) or not.
> >     • 0: Indicates hardware accesses to remapping structures are non-coherent.
> >     • 1: Indicates hardware accesses to remapping structures are coherent.
> > 
> > Without both CM=0 and C=0, our only virtualization mechanism for
> > maintaining a hardware cache coherent with the guest view of the iommu
> > would be to shadow all of the VT-d structures.  For purely emulated
> > devices, maybe we can get away with that, but I doubt the current
> > ghashes used for the iotlb are prepared for it.  
> 
> Actually I haven't noticed this bit yet. I see that this will decide
> whether guest kernel need to send specific clflush() when modifying
> IOMMU PTEs, but shouldn't we flush the memory cache always so that we
> can sure IOMMU can see the same memory data as CPU does?

I think it would be a question of how much the g_hash code really buys
us in the VT-d code, it might be faster to do a lookup each time if it
means fewer flushes.  Those hashes are useless overhead for assigned
devices, so maybe we can avoid them when we only have assigned
devices ;)  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-07  3:58                     ` Alex Williamson
@ 2016-06-07  5:00                       ` Peter Xu
  2016-06-07  5:21                       ` Huang, Kai
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Xu @ 2016-06-07  5:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, Aviv B.D, qemu-devel, Michael S. Tsirkin

On Mon, Jun 06, 2016 at 09:58:09PM -0600, Alex Williamson wrote:
> On Tue, 7 Jun 2016 11:20:32 +0800
> Peter Xu <peterx@redhat.com> wrote:
[...]
> > Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
> > will send these invalidations. What I meant is that, we should allow
> > user to specify the CM bit, so that when we are not using VFIO
> > devices, we can skip the above flush_content() and flush_iotlb()
> > etc... So, besides the truth that we have some guests do not support
> > CM bit (like Jailhouse), performance might be another consideration
> > point that we should allow user to specify the CM bit themselfs.
> 
> I'm dubious of this, IOMMU drivers are already aware that hardware
> flushes are expensive and do batching to optimize it.  The queued
> invalidation mechanism itself is meant to allow asynchronous
> invalidations.  QEMU invalidating a virtual IOMMU might very well be
> faster than hardware.

Agree. However it seems that current Linux is still not taking this
advantage... check qi_flush_context() and qi_flush_iotlb().
qi_submit_sync() is used for both, which sends one invalidation with a
explicit wait to make sure it's sync.

> 
> > > 
> > > C: Page-walk Coherency
> > >   This field indicates if hardware access to the root, context,
> > >   extended-context and interrupt-remap tables, and second-level paging
> > >   structures for requests-without PASID, are coherent (snooped) or not.
> > >     • 0: Indicates hardware accesses to remapping structures are non-coherent.
> > >     • 1: Indicates hardware accesses to remapping structures are coherent.
> > > 
> > > Without both CM=0 and C=0, our only virtualization mechanism for
> > > maintaining a hardware cache coherent with the guest view of the iommu
> > > would be to shadow all of the VT-d structures.  For purely emulated
> > > devices, maybe we can get away with that, but I doubt the current
> > > ghashes used for the iotlb are prepared for it.  
> > 
> > Actually I haven't noticed this bit yet. I see that this will decide
> > whether guest kernel need to send specific clflush() when modifying
> > IOMMU PTEs, but shouldn't we flush the memory cache always so that we
> > can sure IOMMU can see the same memory data as CPU does?
> 
> I think it would be a question of how much the g_hash code really buys
> us in the VT-d code, it might be faster to do a lookup each time if it
> means fewer flushes.  Those hashes are useless overhead for assigned
> devices, so maybe we can avoid them when we only have assigned
> devices ;)  Thanks,

Errr, I just noticed that VFIO devices do not need emulated
cache. There are indeed lots of pending works TBD on vIOMMU side...

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-07  3:58                     ` Alex Williamson
  2016-06-07  5:00                       ` Peter Xu
@ 2016-06-07  5:21                       ` Huang, Kai
  2016-06-07 18:46                         ` Alex Williamson
  1 sibling, 1 reply; 40+ messages in thread
From: Huang, Kai @ 2016-06-07  5:21 UTC (permalink / raw)
  To: Alex Williamson, Peter Xu
  Cc: Aviv B.D, Jan Kiszka, qemu-devel, Michael S. Tsirkin



On 6/7/2016 3:58 PM, Alex Williamson wrote:
> On Tue, 7 Jun 2016 11:20:32 +0800
> Peter Xu <peterx@redhat.com> wrote:
>
>> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
>>> On Mon, 6 Jun 2016 21:43:17 +0800
>>> Peter Xu <peterx@redhat.com> wrote:
>>>
>>>> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
>>>>> On Mon, 6 Jun 2016 13:04:07 +0800
>>>>> Peter Xu <peterx@redhat.com> wrote:
>>>> [...]
>>>>>> Besides the reason that there might have guests that do not support
>>>>>> CM=1, will there be performance considerations? When user's
>>>>>> configuration does not require CM capability (e.g., generic VM
>>>>>> configuration, without VFIO), shall we allow user to disable the CM
>>>>>> bit so that we can have better IOMMU performance (avoid extra and
>>>>>> useless invalidations)?
>>>>>
>>>>> With Alexey's proposed patch to have callback ops when the iommu
>>>>> notifier list adds its first entry and removes its last, any of the
>>>>> additional overhead to generate notifies when nobody is listening can
>>>>> be avoided.  These same callbacks would be the ones that need to
>>>>> generate a hw_error if a notifier is added while running in CM=0.
>>>>
>>>> Not familar with Alexey's patch
>>>
>>> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html
>>
>> Thanks for the pointer. :)
>>
>>>
>>>> , but is that for VFIO only?
>>>
>>> vfio is currently the only user of the iommu notifier, but the
>>> interface is generic, which is how it should (must) be.
>>
>> Yes.
>>
>>>
>>>> I mean, if
>>>> we configured CMbit=1, guest kernel will send invalidation request
>>>> every time it creates new entries (context entries, or iotlb
>>>> entries). Even without VFIO notifiers, guest need to trap into QEMU
>>>> and process the invalidation requests. This is avoidable if we are not
>>>> using VFIO devices at all (so no need to maintain any mappings),
>>>> right?
>>>
>>> CM=1 only defines that not-present and invalid entries can be cached,
>>> any changes to existing entries requires an invalidation regardless of
>>> CM.  What you're looking for sounds more like ECAP.C:
>>
>> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
>> When we clear/replace one context entry, guest kernel will definitely
>> send one context entry invalidation to QEMU:
>>
>> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
>> {
>> 	if (!iommu)
>> 		return;
>>
>> 	clear_context_table(iommu, bus, devfn);
>> 	iommu->flush.flush_context(iommu, 0, 0, 0,
>> 					   DMA_CCMD_GLOBAL_INVL);
>> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
>> }
>>
>> ... While if we are creating a new one (like attaching a new VFIO
>> device?), it's an optional behavior depending on whether CM bit is
>> set:
>>
>> static int domain_context_mapping_one(struct dmar_domain *domain,
>> 				      struct intel_iommu *iommu,
>> 				      u8 bus, u8 devfn)
>> {
>>     ...
>> 	/*
>> 	 * It's a non-present to present mapping. If hardware doesn't cache
>> 	 * non-present entry we only need to flush the write-buffer. If the
>> 	 * _does_ cache non-present entries, then it does so in the special
>> 	 * domain #0, which we have to flush:
>> 	 */
>> 	if (cap_caching_mode(iommu->cap)) {
>> 		iommu->flush.flush_context(iommu, 0,
>> 					   (((u16)bus) << 8) | devfn,
>> 					   DMA_CCMD_MASK_NOBIT,
>> 					   DMA_CCMD_DEVICE_INVL);
>> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>> 	} else {
>> 		iommu_flush_write_buffer(iommu);
>> 	}
>>     ...
>> }
>>
>> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
>> will send these invalidations. What I meant is that, we should allow
>> user to specify the CM bit, so that when we are not using VFIO
>> devices, we can skip the above flush_content() and flush_iotlb()
>> etc... So, besides the truth that we have some guests do not support
>> CM bit (like Jailhouse), performance might be another consideration
>> point that we should allow user to specify the CM bit themselfs.
>
> I'm dubious of this, IOMMU drivers are already aware that hardware
> flushes are expensive and do batching to optimize it.  The queued
> invalidation mechanism itself is meant to allow asynchronous
> invalidations.  QEMU invalidating a virtual IOMMU might very well be
> faster than hardware.

Do batching doesn't mean we can eliminate the IOTLB flush for mappings 
from non-present to present, in case of CM=1, while in case CM=0 those 
IOTLB flush are not necessary, just like the code above shows. Therefore 
generally speaking CM=0 should have better performance than CM=1, even 
for Qemu's vIOMMU.

In my understanding the purpose of exposing CM=1 is to force guest do 
IOTLB flush for each mapping change (including from non-present to 
present) so Qemu is able to emulate each mapping change from guest 
(correct me if I was wrong). If previous statement stands, CM=1 is 
really a workaround for making vfio assigned devices and vIOMMU work 
together, and unfortunately this cannot work on other vendor's IOMMU 
without CM bit, such as AMD's IOMMU.

So what's the requirements of making vfio assigned devices and vIOMMU 
work together? I think it should be more helpful to implement a more 
generic solution to monitor and emulate guest vIOMMU's page table, 
rather than simply exposing CM=1 to guest, as it only works on intel IOMMU.

And what do you mean asynchronous invalidations? I think the iova of the 
changed mappings cannot be used until the mappings are invalidated. It 
doesn't matter whether the invalidation is done via QI or register.

Thanks,
-Kai

>
>>>
>>> C: Page-walk Coherency
>>>   This field indicates if hardware access to the root, context,
>>>   extended-context and interrupt-remap tables, and second-level paging
>>>   structures for requests-without PASID, are coherent (snooped) or not.
>>>     • 0: Indicates hardware accesses to remapping structures are non-coherent.
>>>     • 1: Indicates hardware accesses to remapping structures are coherent.
>>>
>>> Without both CM=0 and C=0, our only virtualization mechanism for
>>> maintaining a hardware cache coherent with the guest view of the iommu
>>> would be to shadow all of the VT-d structures.  For purely emulated
>>> devices, maybe we can get away with that, but I doubt the current
>>> ghashes used for the iotlb are prepared for it.
>>
>> Actually I haven't noticed this bit yet. I see that this will decide
>> whether guest kernel need to send specific clflush() when modifying
>> IOMMU PTEs, but shouldn't we flush the memory cache always so that we
>> can sure IOMMU can see the same memory data as CPU does?
>
> I think it would be a question of how much the g_hash code really buys
> us in the VT-d code, it might be faster to do a lookup each time if it
> means fewer flushes.  Those hashes are useless overhead for assigned
> devices, so maybe we can avoid them when we only have assigned
> devices ;)  Thanks,
>
> Alex
>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment
  2016-06-06 18:21                         ` Alex Williamson
@ 2016-06-07 13:20                           ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2016-06-07 13:20 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Aviv B.D., Jan Kiszka, qemu-devel, Michael S. Tsirkin

On Mon, Jun 06, 2016 at 12:21:34PM -0600, Alex Williamson wrote:
[...]
> I'm not sure why you're so eager to avoid implementing a replay
> callback for VT-d.  What happens when VT-d is not enabled by the
> guest?  vfio/pci.c calls pci_device_iommu_address_space() to get the
> address space for the device, which results in vtd_find_add_as() which
> gives us a unique VTDAddressSpace.as for that device.  With VT-d not in
> use by the guest, when do steps 3-5 occur?  I agree with your reasoning
> when VT-d is enabled, but the BIOS/UEFI does not enable VT-d, so are we
> going to exclude all use cases of an assigned device prior to the guest
> enabling VT-d?

I think I got the point. I failed to consider the case that we can run
IOMMU without enabling it, like BIOS (as you have mentioned), or we
can disable IOMMU in kernel boot parameters (though, iommu=pt should
still follow the case that IOMMU is enabled).

Sounds like a replay callback is a good idea. For Intel version of the
callback: when DMAR is enabled, we can just return directly. when DMAR
is disabled, we just do whatever we need to do region_add() for the
global address_space_memory.

> 
> On that same topic, I'm really dubious that we have the flexibility in
> our memory API to really support an IOMMU like VT-d and the layout of
> having a VTDAddressSpace per device, rather than exposing shared
> address spaces, has implications on the efficiency and locked memory
> requirements for vfio.  In the above case with VT-d disabled, the
> VTDAddressSpace should alias to the system memory AddressSpace and
> dynamically switch to a per device address space when VT-d is enabled.
> AFAICT, we don't have anything resembling that sort of feature, so we
> rely on the IOMMU driver to replay, perhaps even configuring its own
> MemoryListener to IOMMU notifier gateway, which is also something that
> doesn't currently exist.

It sounds more like a notifier for "IOMMU enablement"? The notifier
should be triggered when IOMMU switch between enabled <-> disabled?
When this happens, we rebuild the mapping in some way.

> 
> Additionally, if there are things unique to VT-d, for instance if VT-d
> is enabled and we can rely on the sequence of events you've set forth,
> then yes, the replay mechanism should do nothing.  But only the VT-d
> code can decide that, which implies that vfio always needs to call the
> replay function and a new MemoryRegionIOMMUOps callback needs to decide
> what to do given the current state of the vIOMMU.  Thanks,

Right. Thanks.

-- peterx

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-07  5:21                       ` Huang, Kai
@ 2016-06-07 18:46                         ` Alex Williamson
  2016-06-07 22:39                           ` Huang, Kai
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2016-06-07 18:46 UTC (permalink / raw)
  To: Huang, Kai; +Cc: Peter Xu, Aviv B.D, Jan Kiszka, qemu-devel, Michael S. Tsirkin

On Tue, 7 Jun 2016 17:21:06 +1200
"Huang, Kai" <kai.huang@linux.intel.com> wrote:

> On 6/7/2016 3:58 PM, Alex Williamson wrote:
> > On Tue, 7 Jun 2016 11:20:32 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >  
> >> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:  
> >>> On Mon, 6 Jun 2016 21:43:17 +0800
> >>> Peter Xu <peterx@redhat.com> wrote:
> >>>  
> >>>> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:  
> >>>>> On Mon, 6 Jun 2016 13:04:07 +0800
> >>>>> Peter Xu <peterx@redhat.com> wrote:  
> >>>> [...]  
> >>>>>> Besides the reason that there might have guests that do not support
> >>>>>> CM=1, will there be performance considerations? When user's
> >>>>>> configuration does not require CM capability (e.g., generic VM
> >>>>>> configuration, without VFIO), shall we allow user to disable the CM
> >>>>>> bit so that we can have better IOMMU performance (avoid extra and
> >>>>>> useless invalidations)?  
> >>>>>
> >>>>> With Alexey's proposed patch to have callback ops when the iommu
> >>>>> notifier list adds its first entry and removes its last, any of the
> >>>>> additional overhead to generate notifies when nobody is listening can
> >>>>> be avoided.  These same callbacks would be the ones that need to
> >>>>> generate a hw_error if a notifier is added while running in CM=0.  
> >>>>
> >>>> Not familar with Alexey's patch  
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html  
> >>
> >> Thanks for the pointer. :)
> >>  
> >>>  
> >>>> , but is that for VFIO only?  
> >>>
> >>> vfio is currently the only user of the iommu notifier, but the
> >>> interface is generic, which is how it should (must) be.  
> >>
> >> Yes.
> >>  
> >>>  
> >>>> I mean, if
> >>>> we configured CMbit=1, guest kernel will send invalidation request
> >>>> every time it creates new entries (context entries, or iotlb
> >>>> entries). Even without VFIO notifiers, guest need to trap into QEMU
> >>>> and process the invalidation requests. This is avoidable if we are not
> >>>> using VFIO devices at all (so no need to maintain any mappings),
> >>>> right?  
> >>>
> >>> CM=1 only defines that not-present and invalid entries can be cached,
> >>> any changes to existing entries requires an invalidation regardless of
> >>> CM.  What you're looking for sounds more like ECAP.C:  
> >>
> >> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
> >> When we clear/replace one context entry, guest kernel will definitely
> >> send one context entry invalidation to QEMU:
> >>
> >> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
> >> {
> >> 	if (!iommu)
> >> 		return;
> >>
> >> 	clear_context_table(iommu, bus, devfn);
> >> 	iommu->flush.flush_context(iommu, 0, 0, 0,
> >> 					   DMA_CCMD_GLOBAL_INVL);
> >> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
> >> }
> >>
> >> ... While if we are creating a new one (like attaching a new VFIO
> >> device?), it's an optional behavior depending on whether CM bit is
> >> set:
> >>
> >> static int domain_context_mapping_one(struct dmar_domain *domain,
> >> 				      struct intel_iommu *iommu,
> >> 				      u8 bus, u8 devfn)
> >> {
> >>     ...
> >> 	/*
> >> 	 * It's a non-present to present mapping. If hardware doesn't cache
> >> 	 * non-present entry we only need to flush the write-buffer. If the
> >> 	 * _does_ cache non-present entries, then it does so in the special
> >> 	 * domain #0, which we have to flush:
> >> 	 */
> >> 	if (cap_caching_mode(iommu->cap)) {
> >> 		iommu->flush.flush_context(iommu, 0,
> >> 					   (((u16)bus) << 8) | devfn,
> >> 					   DMA_CCMD_MASK_NOBIT,
> >> 					   DMA_CCMD_DEVICE_INVL);
> >> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> >> 	} else {
> >> 		iommu_flush_write_buffer(iommu);
> >> 	}
> >>     ...
> >> }
> >>
> >> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
> >> will send these invalidations. What I meant is that, we should allow
> >> user to specify the CM bit, so that when we are not using VFIO
> >> devices, we can skip the above flush_content() and flush_iotlb()
> >> etc... So, besides the truth that we have some guests do not support
> >> CM bit (like Jailhouse), performance might be another consideration
> >> point that we should allow user to specify the CM bit themselfs.  
> >
> > I'm dubious of this, IOMMU drivers are already aware that hardware
> > flushes are expensive and do batching to optimize it.  The queued
> > invalidation mechanism itself is meant to allow asynchronous
> > invalidations.  QEMU invalidating a virtual IOMMU might very well be
> > faster than hardware.  
> 
> Do batching doesn't mean we can eliminate the IOTLB flush for mappings 
> from non-present to present, in case of CM=1, while in case CM=0 those 
> IOTLB flush are not necessary, just like the code above shows. Therefore 
> generally speaking CM=0 should have better performance than CM=1, even 
> for Qemu's vIOMMU.
> 
> In my understanding the purpose of exposing CM=1 is to force guest do 
> IOTLB flush for each mapping change (including from non-present to 
> present) so Qemu is able to emulate each mapping change from guest 
> (correct me if I was wrong). If previous statement stands, CM=1 is 
> really a workaround for making vfio assigned devices and vIOMMU work 
> together, and unfortunately this cannot work on other vendor's IOMMU 
> without CM bit, such as AMD's IOMMU.
> 
> So what's the requirements of making vfio assigned devices and vIOMMU 
> work together? I think it should be more helpful to implement a more 
> generic solution to monitor and emulate guest vIOMMU's page table, 
> rather than simply exposing CM=1 to guest, as it only works on intel IOMMU.
> 
> And what do you mean asynchronous invalidations? I think the iova of the 
> changed mappings cannot be used until the mappings are invalidated. It 
> doesn't matter whether the invalidation is done via QI or register.

Ok, so you're arguing that CM=0 is more efficient than CM=1 because it
eliminates some portion of the invalidations necessary by the guest,
while at the same time arguing for a more general solution of shadow
page tables which would trap into the vIOMMU at every update,
eliminating all the batching done by the guest IOMMU driver code
attempting to reduce and consolidate the number of flushes done.  All
of this with only speculation on what might be more efficient.  Can we
get vIOMMU working, especially with assigned devices, before we
speculate further?

How do we expect a vIOMMU to be used in a guest?  In the case of
emulated devices, what value does it provide?  Are emulated devices
isolated from one another by the vIOMMU?  No.  Do we have 32-bit
emulated devices for which DMA translation at the vIOMMU is
significantly more efficient than bounce buffers within the guest?
Probably not, and if we did we could just emulate 64bit devices.  So I
assume that beyond being a development tool, our primary useful feature
of a vIOMMU is to expose devices to guest userspace (and thus nested
guests) via tools like vfio.  Am I wrong here?  In this use case, it's
most efficient to run with iommu=pt in the L1 guest, which would make
any sort of invalidations a very rare event.  Making use of vfio inside
the L1 guest would then move a device from the static-identity domain
in the L1 guest into a new domain where the mappings for that domain
are also relatively static.  So I really don't see why we're trying to
optimize the invalidation of the vIOMMU at this point.  I also have to
believe that the hardware folks that designed VT-d believed there to be
a performance advantage to using emulated VT-d with CM=1 versus
shadowing all of the VT-d data structures in the hypervisor or they
wouldn't have added this bit to the specification.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
  2016-06-07 18:46                         ` Alex Williamson
@ 2016-06-07 22:39                           ` Huang, Kai
  0 siblings, 0 replies; 40+ messages in thread
From: Huang, Kai @ 2016-06-07 22:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Aviv B.D, Jan Kiszka, qemu-devel, Peter Xu, Michael S. Tsirkin



On 6/8/2016 6:46 AM, Alex Williamson wrote:
> On Tue, 7 Jun 2016 17:21:06 +1200
> "Huang, Kai" <kai.huang@linux.intel.com> wrote:
>
>> On 6/7/2016 3:58 PM, Alex Williamson wrote:
>>> On Tue, 7 Jun 2016 11:20:32 +0800
>>> Peter Xu <peterx@redhat.com> wrote:
>>>
>>>> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
>>>>> On Mon, 6 Jun 2016 21:43:17 +0800
>>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>>> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
>>>>>>> On Mon, 6 Jun 2016 13:04:07 +0800
>>>>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>>> [...]
>>>>>>>> Besides the reason that there might have guests that do not support
>>>>>>>> CM=1, will there be performance considerations? When user's
>>>>>>>> configuration does not require CM capability (e.g., generic VM
>>>>>>>> configuration, without VFIO), shall we allow user to disable the CM
>>>>>>>> bit so that we can have better IOMMU performance (avoid extra and
>>>>>>>> useless invalidations)?
>>>>>>>
>>>>>>> With Alexey's proposed patch to have callback ops when the iommu
>>>>>>> notifier list adds its first entry and removes its last, any of the
>>>>>>> additional overhead to generate notifies when nobody is listening can
>>>>>>> be avoided.  These same callbacks would be the ones that need to
>>>>>>> generate a hw_error if a notifier is added while running in CM=0.
>>>>>>
>>>>>> Not familar with Alexey's patch
>>>>>
>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html
>>>>
>>>> Thanks for the pointer. :)
>>>>
>>>>>
>>>>>> , but is that for VFIO only?
>>>>>
>>>>> vfio is currently the only user of the iommu notifier, but the
>>>>> interface is generic, which is how it should (must) be.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>> I mean, if
>>>>>> we configured CMbit=1, guest kernel will send invalidation request
>>>>>> every time it creates new entries (context entries, or iotlb
>>>>>> entries). Even without VFIO notifiers, guest need to trap into QEMU
>>>>>> and process the invalidation requests. This is avoidable if we are not
>>>>>> using VFIO devices at all (so no need to maintain any mappings),
>>>>>> right?
>>>>>
>>>>> CM=1 only defines that not-present and invalid entries can be cached,
>>>>> any changes to existing entries requires an invalidation regardless of
>>>>> CM.  What you're looking for sounds more like ECAP.C:
>>>>
>>>> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
>>>> When we clear/replace one context entry, guest kernel will definitely
>>>> send one context entry invalidation to QEMU:
>>>>
>>>> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
>>>> {
>>>> 	if (!iommu)
>>>> 		return;
>>>>
>>>> 	clear_context_table(iommu, bus, devfn);
>>>> 	iommu->flush.flush_context(iommu, 0, 0, 0,
>>>> 					   DMA_CCMD_GLOBAL_INVL);
>>>> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
>>>> }
>>>>
>>>> ... While if we are creating a new one (like attaching a new VFIO
>>>> device?), it's an optional behavior depending on whether CM bit is
>>>> set:
>>>>
>>>> static int domain_context_mapping_one(struct dmar_domain *domain,
>>>> 				      struct intel_iommu *iommu,
>>>> 				      u8 bus, u8 devfn)
>>>> {
>>>>     ...
>>>> 	/*
>>>> 	 * It's a non-present to present mapping. If hardware doesn't cache
>>>> 	 * non-present entry we only need to flush the write-buffer. If the
>>>> 	 * _does_ cache non-present entries, then it does so in the special
>>>> 	 * domain #0, which we have to flush:
>>>> 	 */
>>>> 	if (cap_caching_mode(iommu->cap)) {
>>>> 		iommu->flush.flush_context(iommu, 0,
>>>> 					   (((u16)bus) << 8) | devfn,
>>>> 					   DMA_CCMD_MASK_NOBIT,
>>>> 					   DMA_CCMD_DEVICE_INVL);
>>>> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>>>> 	} else {
>>>> 		iommu_flush_write_buffer(iommu);
>>>> 	}
>>>>     ...
>>>> }
>>>>
>>>> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
>>>> will send these invalidations. What I meant is that, we should allow
>>>> user to specify the CM bit, so that when we are not using VFIO
>>>> devices, we can skip the above flush_content() and flush_iotlb()
>>>> etc... So, besides the truth that we have some guests do not support
>>>> CM bit (like Jailhouse), performance might be another consideration
>>>> point that we should allow user to specify the CM bit themselfs.
>>>
>>> I'm dubious of this, IOMMU drivers are already aware that hardware
>>> flushes are expensive and do batching to optimize it.  The queued
>>> invalidation mechanism itself is meant to allow asynchronous
>>> invalidations.  QEMU invalidating a virtual IOMMU might very well be
>>> faster than hardware.
>>
>> Do batching doesn't mean we can eliminate the IOTLB flush for mappings
>> from non-present to present, in case of CM=1, while in case CM=0 those
>> IOTLB flush are not necessary, just like the code above shows. Therefore
>> generally speaking CM=0 should have better performance than CM=1, even
>> for Qemu's vIOMMU.
>>
>> In my understanding the purpose of exposing CM=1 is to force guest do
>> IOTLB flush for each mapping change (including from non-present to
>> present) so Qemu is able to emulate each mapping change from guest
>> (correct me if I was wrong). If previous statement stands, CM=1 is
>> really a workaround for making vfio assigned devices and vIOMMU work
>> together, and unfortunately this cannot work on other vendor's IOMMU
>> without CM bit, such as AMD's IOMMU.
>>
>> So what's the requirements of making vfio assigned devices and vIOMMU
>> work together? I think it should be more helpful to implement a more
>> generic solution to monitor and emulate guest vIOMMU's page table,
>> rather than simply exposing CM=1 to guest, as it only works on intel IOMMU.
>>
>> And what do you mean asynchronous invalidations? I think the iova of the
>> changed mappings cannot be used until the mappings are invalidated. It
>> doesn't matter whether the invalidation is done via QI or register.
>
> Ok, so you're arguing that CM=0 is more efficient than CM=1 because it
> eliminates some portion of the invalidations necessary by the guest,
> while at the same time arguing for a more general solution of shadow
> page tables which would trap into the vIOMMU at every update,
> eliminating all the batching done by the guest IOMMU driver code
> attempting to reduce and consolidate the number of flushes done.  All
> of this with only speculation on what might be more efficient.  Can we
> get vIOMMU working, especially with assigned devices, before we
> speculate further?
>
> How do we expect a vIOMMU to be used in a guest?  In the case of
> emulated devices, what value does it provide?  Are emulated devices
> isolated from one another by the vIOMMU?  No.  Do we have 32-bit
> emulated devices for which DMA translation at the vIOMMU is
> significantly more efficient than bounce buffers within the guest?
> Probably not, and if we did we could just emulate 64bit devices.  So I
> assume that beyond being a development tool, our primary useful feature
> of a vIOMMU is to expose devices to guest userspace (and thus nested
> guests) via tools like vfio.  Am I wrong here?  In this use case, it's
> most efficient to run with iommu=pt in the L1 guest, which would make
> any sort of invalidations a very rare event.  Making use of vfio inside
> the L1 guest would then move a device from the static-identity domain
> in the L1 guest into a new domain where the mappings for that domain
> are also relatively static.  So I really don't see why we're trying to
> optimize the invalidation of the vIOMMU at this point.  I also have to
> believe that the hardware folks that designed VT-d believed there to be
> a performance advantage to using emulated VT-d with CM=1 versus
> shadowing all of the VT-d data structures in the hypervisor or they
> wouldn't have added this bit to the specification.  Thanks,
Hi Alex,

Sorry for jumping to this discussion suddenly. Yes I absolutely agree 
that getting vIOMMU working with assigned devices is more important 
thing than arguing on vIOMMU performance on CM bit. Actually I am very 
eager to make vIOMMU working with vfio for assigned device in guest as I 
want to try DPDK via VFIO in guest (this is the reason I searched vIOMMU 
support in Qemu and found this discussion). My concern for CM bit is not 
performance, but it is not generic way, but still it is better than 
nothing :)

Thanks,
-Kai

>
> Alex
>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-06-07 22:39 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-21 16:19 [Qemu-devel] [PATCH v3 0/3] IOMMU: Add Support to VFIO devices with vIOMMU present Aviv B.D
2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest Aviv B.D
2016-05-21 16:42   ` Jan Kiszka
2016-06-02  8:44     ` Peter Xu
2016-06-02 13:00       ` Alex Williamson
2016-06-02 13:14         ` Jan Kiszka
2016-06-02 13:17           ` Jan Kiszka
2016-06-02 16:15           ` Michael S. Tsirkin
2016-06-06  5:04           ` Peter Xu
2016-06-06 13:11             ` Alex Williamson
2016-06-06 13:43               ` Peter Xu
2016-06-06 17:02                 ` Alex Williamson
2016-06-07  3:20                   ` Peter Xu
2016-06-07  3:58                     ` Alex Williamson
2016-06-07  5:00                       ` Peter Xu
2016-06-07  5:21                       ` Huang, Kai
2016-06-07 18:46                         ` Alex Williamson
2016-06-07 22:39                           ` Huang, Kai
2016-05-24  8:14   ` Jason Wang
2016-05-24  9:25     ` Jan Kiszka
2016-05-28 16:12       ` Aviv B.D.
2016-05-28 16:34         ` Kiszka, Jan
2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 2/3] IOMMU: change iommu_op->translate's is_write to flags, add support to NO_FAIL flag mode Aviv B.D
2016-06-06  5:04   ` Peter Xu
2016-05-21 16:19 ` [Qemu-devel] [PATCH v3 3/3] IOMMU: Integrate between VFIO and vIOMMU to support device assignment Aviv B.D
2016-05-23 17:53   ` Alex Williamson
2016-05-26 20:58     ` Alex Williamson
2016-05-28 10:52       ` Aviv B.D.
2016-05-28 16:02         ` Alex Williamson
2016-05-28 16:10           ` Aviv B.D.
2016-05-28 17:39             ` Alex Williamson
2016-05-28 18:14               ` Aviv B.D.
2016-05-28 19:48                 ` Alex Williamson
2016-06-02 13:09                   ` Aviv B.D.
2016-06-02 13:34                     ` Alex Williamson
2016-06-06  8:09                       ` Peter Xu
2016-06-06 18:21                         ` Alex Williamson
2016-06-07 13:20                           ` Peter Xu
2016-06-06  7:38     ` Peter Xu
2016-06-06 17:30       ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.