All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
@ 2018-05-04  3:08 Peter Xu
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
                   ` (12 more replies)
  0 siblings, 13 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

v2:
- fix patchew code style warnings
- interval tree: postpone malloc when inserting; simplify node remove
  a bit where proper [Jason]
- fix up comment and commit message for iommu lock patch [Kevin]
- protect context cache too using the iommu lock [Kevin, Jason]
- add vast comment in patch 8 to explain the modify-PTE problem
  [Jason, Kevin]

Online repo:

  https://github.com/xzpeter/qemu/tree/fix-vtd-dma

This series fixes several major problems that current code has:

- Issue 1: when getting very big PSI UNMAP invalidations, the current
  code is buggy in that we might skip the notification while actually
  we should always send that notification.

- Issue 2: IOTLB is not thread safe, while block dataplane can be
  accessing and updating it in parallel.

- Issue 3: For devices that only registered with UNMAP-only notifiers,
  we don't really need to do page walking for PSIs, we can directly
  deliver the notification down.  For example, vhost.

- Issue 4: unsafe window for MAP notified devices like vfio-pci (and
  in the future, vDPA as well).  The problem is that, now for domain
  invalidations we do this to make sure the shadow page tables are
  correctly synced:

  1. unmap the whole address space
  2. replay the whole address space, map existing pages

  However during step 1 and 2 there will be a very tiny window (it can
  be as big as 3ms) that the shadow page table is either invalid or
  incomplete (since we're rebuilding it up).  That's fatal error since
  devices never know that happending and it's still possible to DMA to
  memories.

Patch 1 fixes issue 1.  I put it at the first since it's picked from
an old post.

Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.

Patch 3 fixes issue 2.

Patch 4 fixes issue 3.

Patch 5-9 fix issue 4.  Here a very simple interval tree is
implemented based on Gtree.  It's different with general interval tree
in that it does not allow user to pass in private data (e.g.,
translated addresses).  However that benefits us that then we can
merge adjacent interval leaves so that hopefully we won't consume much
memory even if the mappings are a lot (that happens for nested virt -
when mapping the whole L2 guest RAM range, it can be at least in GBs).

Patch 10 is another big cleanup only can work after patch 9.

Tests:

- device assignments to L1, even L2 guests.  With this series applied
  (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
  we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
  with assigned devices and things will work.  We can't before.

- vhost smoke test for regression.

Please review.  Thanks,

Peter Xu (10):
  intel-iommu: send PSI always even if across PDEs
  intel-iommu: remove IntelIOMMUNotifierNode
  intel-iommu: add iommu lock
  intel-iommu: only do page walk for MAP notifiers
  intel-iommu: introduce vtd_page_walk_info
  intel-iommu: pass in address space when page walk
  util: implement simple interval tree logic
  intel-iommu: maintain per-device iova ranges
  intel-iommu: don't unmap all for shadow page table
  intel-iommu: remove notify_unmap for page walk

 include/hw/i386/intel_iommu.h |  19 ++-
 include/qemu/interval-tree.h  | 130 +++++++++++++++
 hw/i386/intel_iommu.c         | 306 +++++++++++++++++++++++++---------
 util/interval-tree.c          | 208 +++++++++++++++++++++++
 hw/i386/trace-events          |   3 +-
 util/Makefile.objs            |   1 +
 6 files changed, 579 insertions(+), 88 deletions(-)
 create mode 100644 include/qemu/interval-tree.h
 create mode 100644 util/interval-tree.c

-- 
2.17.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-17 14:42   ` Auger Eric
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

During IOVA page table walking, there is a special case when the PSI
covers one whole PDE (Page Directory Entry, which contains 512 Page
Table Entries) or more.  In the past, we skip that entry and we don't
notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
notification to registered UNMAP notifiers in this case.

For UNMAP only notifiers, this might cause IOTLBs cached in the devices
even if they were already invalid.  For MAP/UNMAP notifiers like
vfio-pci, this will cause stale page mappings.

This special case doesn't trigger often, but it is very easy to be
triggered by nested device assignments, since in that case we'll
possibly map the whole L2 guest RAM region into the device's IOVA
address space (several GBs at least), which is far bigger than normal
kernel driver usages of the device (tens of MBs normally).

Without this patch applied to L1 QEMU, nested device assignment to L2
guests will dump some errors like:

qemu-system-x86_64: VFIO_MAP_DMA: -17
qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
                    0x7f89a920d000) = -17 (File exists)

Acked-by: Jason Wang <jasowang@redhat.com>
[peterx: rewrite the commit message]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fb31de9416..b359efd6f9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
 
+static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
+                             vtd_page_walk_hook hook_fn, void *private)
+{
+    assert(hook_fn);
+    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
+                            entry->addr_mask, entry->perm);
+    return hook_fn(entry, private);
+}
+
 /**
  * vtd_page_walk_level - walk over specific level for IOVA range
  *
@@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
+        entry.target_as = &address_space_memory;
+        entry.iova = iova & subpage_mask;
+        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+        entry.addr_mask = ~subpage_mask;
+
         if (vtd_is_last_slpte(slpte, level)) {
-            entry.target_as = &address_space_memory;
-            entry.iova = iova & subpage_mask;
             /* NOTE: this is only meaningful if entry_valid == true */
             entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            entry.addr_mask = ~subpage_mask;
-            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
             if (!entry_valid && !notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
-                                    entry.addr_mask, entry.perm);
-            if (hook_fn) {
-                ret = hook_fn(&entry, private);
-                if (ret < 0) {
-                    return ret;
-                }
+            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+            if (ret < 0) {
+                return ret;
             }
         } else {
             if (!entry_valid) {
-                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                if (notify_unmap) {
+                    /*
+                     * The whole entry is invalid; unmap it all.
+                     * Translated address is meaningless, zero it.
+                     */
+                    entry.translated_addr = 0x0;
+                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+                    if (ret < 0) {
+                        return ret;
+                    }
+                } else {
+                    trace_vtd_page_walk_skip_perm(iova, iova_next);
+                }
                 goto next;
             }
             ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-17  9:46   ` Auger Eric
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 03/10] intel-iommu: add iommu lock Peter Xu
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

That is not really necessary.  Removing that node struct and put the
list entry directly into VTDAddressSpace.  It simplfies the code a lot.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  9 ++------
 hw/i386/intel_iommu.c         | 41 ++++++++++-------------------------
 2 files changed, 14 insertions(+), 36 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 45ec8919b6..220697253f 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDIrq VTDIrq;
 typedef struct VTD_MSIMessage VTD_MSIMessage;
-typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -93,6 +92,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDAddressSpace) next;
 };
 
 struct VTDBus {
@@ -253,11 +253,6 @@ struct VTD_MSIMessage {
 /* When IR is enabled, all MSI/MSI-X data bits should be zero */
 #define VTD_IR_MSI_DATA          (0)
 
-struct IntelIOMMUNotifierNode {
-    VTDAddressSpace *vtd_as;
-    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
-};
-
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
@@ -295,7 +290,7 @@ struct IntelIOMMUState {
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
     /* list of registered notifiers */
-    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
+    QLIST_HEAD(, VTDAddressSpace) notifiers_list;
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b359efd6f9..5987b48d43 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1248,10 +1248,10 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 
 static void vtd_iommu_replay_all(IntelIOMMUState *s)
 {
-    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        memory_region_iommu_replay_all(&node->vtd_as->iommu);
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
+        memory_region_iommu_replay_all(&vtd_as->iommu);
     }
 }
 
@@ -1372,7 +1372,6 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 
 static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 {
-    IntelIOMMUNotifierNode *node;
     VTDContextEntry ce;
     VTDAddressSpace *vtd_as;
 
@@ -1381,8 +1380,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
             domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -1402,12 +1400,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                            uint16_t domain_id, hwaddr addr,
                                            uint8_t am)
 {
-    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
     VTDContextEntry ce;
     int ret;
 
-    QLIST_FOREACH(node, &(s->notifiers_list), next) {
-        VTDAddressSpace *vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -2344,8 +2341,6 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
     IntelIOMMUState *s = vtd_as->iommu_state;
-    IntelIOMMUNotifierNode *node = NULL;
-    IntelIOMMUNotifierNode *next_node = NULL;
 
     if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
         error_report("We need to set caching-mode=1 for intel-iommu to enable "
@@ -2354,21 +2349,11 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
     }
 
     if (old == IOMMU_NOTIFIER_NONE) {
-        node = g_malloc0(sizeof(*node));
-        node->vtd_as = vtd_as;
-        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
-        return;
-    }
-
-    /* update notifier node with new flags */
-    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
-        if (node->vtd_as == vtd_as) {
-            if (new == IOMMU_NOTIFIER_NONE) {
-                QLIST_REMOVE(node, next);
-                g_free(node);
-            }
-            return;
-        }
+        /* Insert new ones */
+        QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
+    } else if (new == IOMMU_NOTIFIER_NONE) {
+        /* Remove old ones */
+        QLIST_REMOVE(vtd_as, next);
     }
 }
 
@@ -2838,12 +2823,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 
 static void vtd_address_space_unmap_all(IntelIOMMUState *s)
 {
-    IntelIOMMUNotifierNode *node;
     VTDAddressSpace *vtd_as;
     IOMMUNotifier *n;
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
             vtd_address_space_unmap(vtd_as, n);
         }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 03/10] intel-iommu: add iommu lock
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-17 14:32   ` Auger Eric
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

Add a per-iommu big lock to protect IOMMU status.  Currently the only
thing to be protected is the IOTLB/context cache, since that can be
accessed even without BQL, e.g., in IO dataplane.

Note that we don't need to protect device page tables since that's fully
controlled by the guest kernel.  However there is still possibility that
malicious drivers will program the device to not obey the rule.  In that
case QEMU can't really do anything useful, instead the guest itself will
be responsible for all uncertainties.

Reported-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  6 +++++
 hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 220697253f..ee517704e7 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -300,6 +300,12 @@ struct IntelIOMMUState {
     OnOffAuto intr_eim;             /* Toggle for EIM cabability */
     bool buggy_eim;                 /* Force buggy EIM unless eim=off */
     uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
+
+    /*
+     * Protects IOMMU states in general.  Currently it protects the
+     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
+     */
+    QemuMutex iommu_lock;
 };
 
 /* Find the VTD Address space associated with the given bus pointer,
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5987b48d43..112971638d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -128,6 +128,16 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
     return new_val;
 }
 
+static inline void vtd_iommu_lock(IntelIOMMUState *s)
+{
+    qemu_mutex_lock(&s->iommu_lock);
+}
+
+static inline void vtd_iommu_unlock(IntelIOMMUState *s)
+{
+    qemu_mutex_unlock(&s->iommu_lock);
+}
+
 /* GHashTable functions */
 static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
 {
@@ -172,7 +182,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
 }
 
 /* Reset all the gen of VTDAddressSpace to zero and set the gen of
- * IntelIOMMUState to 1.
+ * IntelIOMMUState to 1.  Must be with IOMMU lock held.
  */
 static void vtd_reset_context_cache(IntelIOMMUState *s)
 {
@@ -197,12 +207,19 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
     s->context_cache_gen = 1;
 }
 
-static void vtd_reset_iotlb(IntelIOMMUState *s)
+static void vtd_reset_iotlb_locked(IntelIOMMUState *s)
 {
     assert(s->iotlb);
     g_hash_table_remove_all(s->iotlb);
 }
 
+static void vtd_reset_iotlb(IntelIOMMUState *s)
+{
+    vtd_iommu_lock(s);
+    vtd_reset_iotlb_locked(s);
+    vtd_iommu_unlock(s);
+}
+
 static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
                                   uint32_t level)
 {
@@ -215,6 +232,7 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
     return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
 }
 
+/* Must be with IOMMU lock held */
 static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
                                        hwaddr addr)
 {
@@ -235,6 +253,7 @@ out:
     return entry;
 }
 
+/* Must be with IOMMU lock held */
 static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
                              uint16_t domain_id, hwaddr addr, uint64_t slpte,
                              uint8_t access_flags, uint32_t level)
@@ -246,7 +265,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
         trace_vtd_iotlb_reset("iotlb exceeds size limit");
-        vtd_reset_iotlb(s);
+        vtd_reset_iotlb_locked(s);
     }
 
     entry->gfn = gfn;
@@ -1106,7 +1125,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     IntelIOMMUState *s = vtd_as->iommu_state;
     VTDContextEntry ce;
     uint8_t bus_num = pci_bus_num(bus);
-    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+    VTDContextCacheEntry *cc_entry;
     uint64_t slpte, page_mask;
     uint32_t level;
     uint16_t source_id = vtd_make_source_id(bus_num, devfn);
@@ -1123,6 +1142,10 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
      */
     assert(!vtd_is_interrupt_addr(addr));
 
+    vtd_iommu_lock(s);
+
+    cc_entry = &vtd_as->context_cache_entry;
+
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
@@ -1182,7 +1205,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
          * IOMMU region can be swapped back.
          */
         vtd_pt_enable_fast_path(s, source_id);
-
+        vtd_iommu_unlock(s);
         return true;
     }
 
@@ -1203,6 +1226,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
                      access_flags, level);
 out:
+    vtd_iommu_unlock(s);
     entry->iova = addr & page_mask;
     entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) & page_mask;
     entry->addr_mask = ~page_mask;
@@ -1210,6 +1234,7 @@ out:
     return true;
 
 error:
+    vtd_iommu_unlock(s);
     entry->iova = 0;
     entry->translated_addr = 0;
     entry->addr_mask = 0;
@@ -1258,10 +1283,13 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
     trace_vtd_inv_desc_cc_global();
+    /* Protects context cache */
+    vtd_iommu_lock(s);
     s->context_cache_gen++;
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
     }
+    vtd_iommu_unlock(s);
     vtd_switch_address_space_all(s);
     /*
      * From VT-d spec 6.5.2.1, a global context entry invalidation
@@ -1377,8 +1405,10 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 
     trace_vtd_inv_desc_iotlb_domain(domain_id);
 
+    vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
+    vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
@@ -1426,7 +1456,9 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.domain_id = domain_id;
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
+    vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+    vtd_iommu_unlock(s);
     vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
 }
 
@@ -3072,6 +3104,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->notifiers_list);
+    qemu_mutex_init(&s->iommu_lock);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (2 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 03/10] intel-iommu: add iommu lock Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-17 13:39   ` Auger Eric
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

For UNMAP-only IOMMU notifiers, we don't really need to walk the page
tables.  Fasten that procedure by skipping the page table walk.  That
should boost performance for UNMAP-only notifiers like vhost.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  2 ++
 hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index ee517704e7..9e0a6c1c6a 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -93,6 +93,8 @@ struct VTDAddressSpace {
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
+    /* Superset of notifier flags that this address space has */
+    IOMMUNotifierFlag notifier_flags;
 };
 
 struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 112971638d..9a418abfb6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
     qemu_mutex_unlock(&s->iommu_lock);
 }
 
+/* Whether the address space needs to notify new mappings */
+static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
+{
+    return as->notifier_flags & IOMMU_NOTIFIER_MAP;
+}
+
 /* GHashTable functions */
 static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
 {
@@ -1433,14 +1439,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     VTDAddressSpace *vtd_as;
     VTDContextEntry ce;
     int ret;
+    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
 
     QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
-            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
-                          vtd_page_invalidate_notify_hook,
-                          (void *)&vtd_as->iommu, true, s->aw_bits);
+            if (vtd_as_notify_mappings(vtd_as)) {
+                /*
+                 * For MAP-inclusive notifiers, we need to walk the
+                 * page table to sync the shadow page table.
+                 */
+                vtd_page_walk(&ce, addr, addr + size,
+                              vtd_page_invalidate_notify_hook,
+                              (void *)&vtd_as->iommu, true, s->aw_bits);
+            } else {
+                /*
+                 * For UNMAP-only notifiers, we don't need to walk the
+                 * page tables.  We just deliver the PSI down to
+                 * invalidate caches.
+                 */
+                IOMMUTLBEntry entry = {
+                    .target_as = &address_space_memory,
+                    .iova = addr,
+                    .translated_addr = 0,
+                    .addr_mask = size - 1,
+                    .perm = IOMMU_NONE,
+                };
+                memory_region_notify_iommu(&vtd_as->iommu, entry);
+            }
         }
     }
 }
@@ -2380,6 +2407,9 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
         exit(1);
     }
 
+    /* Update per-address-space notifier flags */
+    vtd_as->notifier_flags = new;
+
     if (old == IOMMU_NOTIFIER_NONE) {
         /* Insert new ones */
         QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
@@ -2890,8 +2920,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   PCI_FUNC(vtd_as->devfn),
                                   VTD_CONTEXT_ENTRY_DID(ce.hi),
                                   ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
-                      s->aw_bits);
+        if (vtd_as_notify_mappings(vtd_as)) {
+            /* This is required only for MAP typed notifiers */
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+                          s->aw_bits);
+        }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (3 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-17 14:32   ` Auger Eric
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 06/10] intel-iommu: pass in address space when page walk Peter Xu
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

During the recursive page walking of IOVA page tables, some stack
variables are constant variables and never changed during the whole page
walking procedure.  Isolate them into a struct so that we don't need to
pass those contants down the stack every time and multiple times.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 56 ++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 19 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9a418abfb6..b2b2a0a441 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -747,9 +747,27 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
 
+/**
+ * Constant information used during page walking
+ *
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @notify_unmap: whether we should notify invalid entries
+ * @aw: maximum address width
+ */
+typedef struct {
+    vtd_page_walk_hook hook_fn;
+    void *private;
+    bool notify_unmap;
+    uint8_t aw;
+} vtd_page_walk_info;
+
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
-                             vtd_page_walk_hook hook_fn, void *private)
+                             vtd_page_walk_info *info)
 {
+    vtd_page_walk_hook hook_fn = info->hook_fn;
+    void *private = info->private;
+
     assert(hook_fn);
     trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
                             entry->addr_mask, entry->perm);
@@ -762,17 +780,13 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
  * @addr: base GPA addr to start the walk
  * @start: IOVA range start address
  * @end: IOVA range end address (start <= addr < end)
- * @hook_fn: hook func to be called when detected page
- * @private: private data to be passed into hook func
  * @read: whether parent level has read permission
  * @write: whether parent level has write permission
- * @notify_unmap: whether we should notify invalid entries
- * @aw: maximum address width
+ * @info: constant information for the page walk
  */
 static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
-                               uint64_t end, vtd_page_walk_hook hook_fn,
-                               void *private, uint32_t level, bool read,
-                               bool write, bool notify_unmap, uint8_t aw)
+                               uint64_t end, uint32_t level, bool read,
+                               bool write, vtd_page_walk_info *info)
 {
     bool read_cur, write_cur, entry_valid;
     uint32_t offset;
@@ -822,24 +836,24 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
 
         if (vtd_is_last_slpte(slpte, level)) {
             /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            if (!entry_valid && !notify_unmap) {
+            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+            if (!entry_valid && !info->notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+            ret = vtd_page_walk_one(&entry, level, info);
             if (ret < 0) {
                 return ret;
             }
         } else {
             if (!entry_valid) {
-                if (notify_unmap) {
+                if (info->notify_unmap) {
                     /*
                      * The whole entry is invalid; unmap it all.
                      * Translated address is meaningless, zero it.
                      */
                     entry.translated_addr = 0x0;
-                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+                    ret = vtd_page_walk_one(&entry, level, info);
                     if (ret < 0) {
                         return ret;
                     }
@@ -848,10 +862,9 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
                 }
                 goto next;
             }
-            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
-                                      MIN(iova_next, end), hook_fn, private,
-                                      level - 1, read_cur, write_cur,
-                                      notify_unmap, aw);
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+                                      iova, MIN(iova_next, end), level - 1,
+                                      read_cur, write_cur, info);
             if (ret < 0) {
                 return ret;
             }
@@ -880,6 +893,12 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
+    vtd_page_walk_info info = {
+        .hook_fn = hook_fn,
+        .private = private,
+        .notify_unmap = notify_unmap,
+        .aw = aw,
+    };
 
     if (!vtd_iova_range_check(start, ce, aw)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
@@ -890,8 +909,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
         end = vtd_iova_limit(ce, aw);
     }
 
-    return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, notify_unmap, aw);
+    return vtd_page_walk_level(addr, start, end, level, true, true, &info);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 06/10] intel-iommu: pass in address space when page walk
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (4 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-17 14:32   ` Auger Eric
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 07/10] util: implement simple interval tree logic Peter Xu
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

We pass in the VTDAddressSpace to replace the aw bits when doing page
walk.  The VTDAddressSpace contains the aw bits information, meanwhile
we'll need to do something more in the follow up patches regarding to
the address spaces.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b2b2a0a441..83769f2b8c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -753,13 +753,13 @@ typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
  * @hook_fn: hook func to be called when detected page
  * @private: private data to be passed into hook func
  * @notify_unmap: whether we should notify invalid entries
- * @aw: maximum address width
+ * @as: VT-d address space of the device
  */
 typedef struct {
+    VTDAddressSpace *as;
     vtd_page_walk_hook hook_fn;
     void *private;
     bool notify_unmap;
-    uint8_t aw;
 } vtd_page_walk_info;
 
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
@@ -796,6 +796,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
     uint64_t iova = start;
     uint64_t iova_next;
     int ret = 0;
+    uint8_t aw = info->as->iommu_state->aw_bits;
 
     trace_vtd_page_walk_level(addr, level, start, end);
 
@@ -836,7 +837,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
 
         if (vtd_is_last_slpte(slpte, level)) {
             /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
             if (!entry_valid && !info->notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
@@ -862,7 +863,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
                 }
                 goto next;
             }
-            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
             if (ret < 0) {
@@ -885,19 +886,20 @@ next:
  * @end: IOVA range end address (start <= addr < end)
  * @hook_fn: the hook that to be called for each detected area
  * @private: private data for the hook function
- * @aw: maximum address width
+ * @as: the VT-d address space of the device
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
                          vtd_page_walk_hook hook_fn, void *private,
-                         bool notify_unmap, uint8_t aw)
+                         bool notify_unmap, VTDAddressSpace *as)
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
+    uint8_t aw = as->iommu_state->aw_bits;
     vtd_page_walk_info info = {
         .hook_fn = hook_fn,
         .private = private,
         .notify_unmap = notify_unmap,
-        .aw = aw,
+        .as = as,
     };
 
     if (!vtd_iova_range_check(start, ce, aw)) {
@@ -1470,7 +1472,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                  */
                 vtd_page_walk(&ce, addr, addr + size,
                               vtd_page_invalidate_notify_hook,
-                              (void *)&vtd_as->iommu, true, s->aw_bits);
+                              (void *)&vtd_as->iommu, true, vtd_as);
             } else {
                 /*
                  * For UNMAP-only notifiers, we don't need to walk the
@@ -2941,7 +2943,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
             vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
-                          s->aw_bits);
+                          vtd_as);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 07/10] util: implement simple interval tree logic
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (5 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 06/10] intel-iommu: pass in address space when page walk Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

Introduce a simplest interval tree implementation based on GTree.
Current implementation is mostly tailored to maintain and trace device
mapped IOVA ranges, but still it might be useful to other modules in the
future.

It is naive in that it even does not allow user to pass in private
structs along with the ranges.  However it's good in that the tree can
do mergings of ranges when necessary when without such information.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/interval-tree.h | 130 ++++++++++++++++++++++
 util/interval-tree.c         | 208 +++++++++++++++++++++++++++++++++++
 util/Makefile.objs           |   1 +
 3 files changed, 339 insertions(+)
 create mode 100644 include/qemu/interval-tree.h
 create mode 100644 util/interval-tree.c

diff --git a/include/qemu/interval-tree.h b/include/qemu/interval-tree.h
new file mode 100644
index 0000000000..14d364455b
--- /dev/null
+++ b/include/qemu/interval-tree.h
@@ -0,0 +1,130 @@
+/*
+ * An very simplified interval tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *  Peter Xu <peterx@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+#ifndef INTERVAL_TREE_H
+#define INTERVAL_TREE_H
+
+/*
+ * Currently the interval tree will only allow to keep ranges
+ * information, and no extra user data is allowed for each element.  A
+ * benefit is that we can merge adjacent ranges internally within the
+ * tree.  It can save a lot of memory when the ranges are splitted but
+ * mostly continuous.
+ *
+ * Note that current implementation does not provide any thread
+ * protections.  Callers of the interval tree should be responsible
+ * for the thread safety issue.
+ */
+
+#include <glib.h>
+
+#define  IT_OK           (0)
+#define  IT_ERR_OVERLAP  (-1)
+
+typedef unsigned long long ITValue;
+typedef struct ITTree ITTree;
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);
+
+struct ITRange {
+    ITValue start;
+    ITValue end;
+};
+typedef struct ITRange ITRange;
+
+/**
+ * it_tree_new:
+ *
+ * Create a new interval tree.
+ *
+ * Returns: the tree pointer when succeeded, or NULL if error.
+ */
+ITTree *it_tree_new(void);
+
+/**
+ * it_tree_insert:
+ *
+ * @tree: the interval tree to insert
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Insert an interval range to the tree.  If there is overlapped
+ * ranges, IT_ERR_OVERLAP will be returned.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int it_tree_insert(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_remove:
+ *
+ * @tree: the interval tree to remove range from
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Remove an range from the tree.  The range does not need to be
+ * exactly what has inserted.  All the ranges that are included in the
+ * provided range will be removed from the tree.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int it_tree_remove(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_find:
+ *
+ * @tree: the interval tree to search from
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Search for a range in the interval tree that overlaps with the
+ * range specified.  Only the first found range will be returned.
+ *
+ * Return: ITRange if found, or NULL if not found.  Note: the returned
+ * ITRange pointer is maintained internally.  User should only read
+ * the content but never modify or free the content.
+ */
+ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_find_value:
+ *
+ * @tree: the interval tree to search from
+ * @value: the value to find
+ *
+ * Similar to it_tree_find(), but it tries to find range (value, value).
+ *
+ * Return: same as it_tree_find().
+ */
+ITRange *it_tree_find_value(ITTree *tree, ITValue value);
+
+/**
+ * it_tree_foreach:
+ *
+ * @tree: the interval tree to iterate on
+ * @iterator: the interator for the ranges, return true to stop
+ *
+ * Search for a range in the interval tree.
+ *
+ * Return: 1 if found any overlap, 0 if not, <0 if error.
+ */
+void it_tree_foreach(ITTree *tree, it_tree_iterator iterator);
+
+/**
+ * it_tree_destroy:
+ *
+ * @tree: the interval tree to destroy
+ *
+ * Destroy an existing interval tree.
+ *
+ * Return: None.
+ */
+void it_tree_destroy(ITTree *tree);
+
+#endif
diff --git a/util/interval-tree.c b/util/interval-tree.c
new file mode 100644
index 0000000000..3d2bb951aa
--- /dev/null
+++ b/util/interval-tree.c
@@ -0,0 +1,208 @@
+/*
+ * An very simplified interval tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *  Peter Xu <peterx@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+
+#include <glib.h>
+#include "qemu/interval-tree.h"
+
+/*
+ * Each element of the internal tree is an ITRange.  It is shared
+ * between the key and value of the element, or we can see it a tree
+ * with keys only but no values.
+ */
+
+struct ITTree {
+    GTree *tree;
+};
+
+static int it_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
+{
+    const ITRange *r1 = a, *r2 = b;
+
+    if (r1->start > r2->end) {
+        return 1;
+    }
+
+    if (r1->end < r2->start) {
+        return -1;
+    }
+
+    /* Overlapped */
+    return 0;
+}
+
+/* Find out intersection of range A and B, put into OUT */
+static inline void it_range_and(ITRange *out, ITRange *a, ITRange *b)
+{
+    out->start = MAX(a->start, b->start);
+    out->end = MIN(a->end, b->end);
+}
+
+static inline gboolean it_range_equal(ITRange *a, ITRange *b)
+{
+    return a->start == b->start && a->end == b->end;
+}
+
+/* Whether ITRange A is superset of B? */
+static inline gboolean it_range_cover(ITRange *a, ITRange *b)
+{
+    return a->start <= b->start && a->end >= b->end;
+}
+
+ITTree *it_tree_new(void)
+{
+    ITTree *ittree = g_new0(ITTree, 1);
+
+    /* We don't have values actually, no need to free */
+    ittree->tree = g_tree_new_full(it_tree_compare, NULL, g_free, NULL);
+
+    return ittree;
+}
+
+ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end)
+{
+    ITRange range;
+
+    g_assert(tree);
+
+    range.start = start;
+    range.end = end;
+
+    return g_tree_lookup(tree->tree, &range);
+}
+
+ITRange *it_tree_find_value(ITTree *tree, ITValue value)
+{
+    return it_tree_find(tree, value, value);
+}
+
+static inline void it_tree_insert_internal(GTree *gtree, ITRange *range)
+{
+    /* Key and value are sharing the same range data */
+    g_tree_insert(gtree, range, range);
+}
+
+int it_tree_insert(ITTree *tree, ITValue start, ITValue end)
+{
+    ITRange range, *new, *overlap;
+    GTree *gtree;
+
+    g_assert(tree);
+    g_assert(start <= end);
+
+    gtree = tree->tree;
+
+    range.start = start;
+    range.end = end;
+
+    /* We don't allow to insert range that overlaps with existings */
+    if (g_tree_lookup(gtree, &range)) {
+        return IT_ERR_OVERLAP;
+    }
+
+    /* Merge left adjacent range */
+    overlap = it_tree_find_value(tree, start - 1);
+    if (overlap) {
+        range.start = overlap->start;
+        g_tree_remove(gtree, overlap);
+    }
+
+    /* Merge right adjacent range */
+    overlap = it_tree_find_value(tree, end + 1);
+    if (overlap) {
+        range.end = overlap->end;
+        g_tree_remove(gtree, overlap);
+    }
+
+    new = g_new0(ITRange, 1);
+    new->start = range.start;
+    new->end = range.end;
+    it_tree_insert_internal(gtree, new);
+
+    return IT_OK;
+}
+
+static gboolean it_tree_traverse(gpointer key, gpointer value,
+                                gpointer data)
+{
+    it_tree_iterator iterator = data;
+    ITRange *range = key;
+
+    g_assert(key == value);
+
+    return iterator(range->start, range->end);
+}
+
+void it_tree_foreach(ITTree *tree, it_tree_iterator iterator)
+{
+    g_assert(tree && iterator);
+    g_tree_foreach(tree->tree, it_tree_traverse, iterator);
+}
+
+/* Remove subset `range', which is part of `overlap'. */
+static void it_tree_remove_subset(GTree *gtree, const ITRange *overlap,
+                                  const ITRange *range)
+{
+    ITRange *range1, *range2;
+
+    if (overlap->start < range->start) {
+        range1 = g_new0(ITRange, 1);
+        range1->start = overlap->start;
+        range1->end = range->start - 1;
+    } else {
+        range1 = NULL;
+    }
+    if (range->end < overlap->end) {
+        range2 = g_new0(ITRange, 1);
+        range2->start = range->end + 1;
+        range2->end = overlap->end;
+    } else {
+        range2 = NULL;
+    }
+
+    g_tree_remove(gtree, overlap);
+
+    if (range1) {
+        it_tree_insert_internal(gtree, range1);
+    }
+    if (range2) {
+        it_tree_insert_internal(gtree, range2);
+    }
+}
+
+int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
+{
+    ITRange range = { .start = start, .end = end }, *overlap, and;
+    GTree *gtree;
+
+    g_assert(tree);
+
+    gtree = tree->tree;
+    while ((overlap = g_tree_lookup(gtree, &range))) {
+        if (it_range_cover(overlap, &range)) {
+            /* Split existing range into two if needed; done */
+            it_tree_remove_subset(gtree, overlap, &range);
+            break;
+        } else {
+            /* Remove intersection and continue */
+            it_range_and(&and, overlap, &range);
+            g_assert(and.start <= and.end);
+            it_tree_remove_subset(gtree, overlap, &and);
+        }
+    }
+
+    return IT_OK;
+}
+
+void it_tree_destroy(ITTree *tree)
+{
+    g_tree_destroy(tree->tree);
+    g_free(tree);
+}
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 728c3541db..4ac33910ed 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -47,4 +47,5 @@ util-obj-y += qht.o
 util-obj-y += range.o
 util-obj-y += stats64.o
 util-obj-y += systemd.o
+util-obj-y += interval-tree.o
 util-obj-$(CONFIG_LINUX) += vfio-helpers.o
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (6 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 07/10] util: implement simple interval tree logic Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

For each VTDAddressSpace, now we maintain what IOVA ranges we have
mapped and what we have not.  With that information, now we only send
MAP or UNMAP when necessary.  Say, we don't send MAP notifies if we know
we have already mapped the range, meanwhile we don't send UNMAP notifies
if we know we never mapped the range at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  2 +
 hw/i386/intel_iommu.c         | 73 +++++++++++++++++++++++++++++++++++
 hw/i386/trace-events          |  2 +
 3 files changed, 77 insertions(+)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 9e0a6c1c6a..5e54f4fae2 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -27,6 +27,7 @@
 #include "hw/i386/ioapic.h"
 #include "hw/pci/msi.h"
 #include "hw/sysbus.h"
+#include "qemu/interval-tree.h"
 
 #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
 #define INTEL_IOMMU_DEVICE(obj) \
@@ -95,6 +96,7 @@ struct VTDAddressSpace {
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
+    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
 struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83769f2b8c..16b3514afb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -765,12 +765,82 @@ typedef struct {
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
                              vtd_page_walk_info *info)
 {
+    VTDAddressSpace *as = info->as;
     vtd_page_walk_hook hook_fn = info->hook_fn;
     void *private = info->private;
+    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
+                                   entry->iova + entry->addr_mask);
 
     assert(hook_fn);
+
+    /* Update local IOVA mapped ranges */
+    if (entry->perm) {
+        if (mapped) {
+            /*
+             * Skip since we have already mapped this range.
+             *
+             * NOTE: here we didn't solve the modify-PTE problem. For
+             * example:
+             *
+             * (1) map iova1 to 4K page P1
+             * (2) send PSI on (iova1, iova1 + 4k)
+             * (3) modify iova1 to 4K page P2
+             * (4) send PSI on (iova1, iova1 + 4k)
+             *
+             * Logically QEMU as emulator should atomically modify the
+             * shadow page table PTE to follow what has been changed.
+             * However here actually we will never have a way to
+             * emulate this "atomic PTE modification" procedure.
+             * Because even we know that this PTE is changed we'll
+             * still need to unmap it first then remap it (please
+             * refer to VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA
+             * VFIO APIs as an example).  And there is no such API to
+             * atomically update an IOMMU PTE on the host.  Then we'll
+             * have a small window that we still have invalid page
+             * mapping (while ideally we shoudn't).
+             *
+             * With out current code, on step (4) we just ignored the
+             * PTE update and we'll skip the map update, which will
+             * leave stale mappings.
+             *
+             * Modifying PTEs won't happen on general OSs (e.g.,
+             * Linux) - it should be prohibited since the device may
+             * be using the page table during the modification then
+             * the behavior of modifying PTEs could be undefined.
+             * However it's still possible for other guests to do so.
+             *
+             * Let's mark this as TODO.  After all it should never
+             * happen on general OSs like Linux/Windows/... Meanwhile
+             * even if the guest is running a private and problematic
+             * OS, the stale page (P1) will definitely still be a
+             * valid page for current L1 QEMU, so the worst case is
+             * that L1 guest will be at risk on its own on that single
+             * device only.  It'll never harm the rest of L1 guest,
+             * the host memory or other guests on the host.
+             *
+             * Let's solve this once-and-for-all when we really
+             * needed, and when we are capable of (for now, we can't).
+             * Though I would suspect that won't happen soon.
+             */
+            trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
+                                             mapped->start, mapped->end);
+            return 0;
+        }
+        it_tree_insert(as->iova_tree, entry->iova,
+                       entry->iova + entry->addr_mask);
+    } else {
+        if (!mapped) {
+            /* Skip since we didn't map this range at all */
+            trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
+            return 0;
+        }
+        it_tree_remove(as->iova_tree, entry->iova,
+                       entry->iova + entry->addr_mask);
+    }
+
     trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
                             entry->addr_mask, entry->perm);
+
     return hook_fn(entry, private);
 }
 
@@ -2804,6 +2874,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+        vtd_dev_as->iova_tree = it_tree_new();
 
         /*
          * Memory region relationships looks like (Address range shows
@@ -2900,6 +2971,8 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
                              VTD_PCI_FUNC(as->devfn),
                              entry.iova, size);
 
+    it_tree_remove(as->iova_tree, entry.iova, entry.iova + entry.addr_mask);
+
     memory_region_notify_one(n, &entry);
 }
 
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 22d44648af..677f83420d 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -40,6 +40,8 @@ vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint6
 vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
 vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
 vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t start, uint64_t end) "iova 0x%"PRIx64" mask 0x%"PRIx64" start 0x%"PRIx64" end 0x%"PRIx64
+vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (7 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-17 17:23   ` Auger Eric
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 10/10] intel-iommu: remove notify_unmap for page walk Peter Xu
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

IOMMU replay was carried out before in many use cases, e.g., context
cache invalidations, domain flushes.  We used this mechanism to sync the
shadow page table by firstly (1) unmap the whole address space, then
(2) walk the page table to remap what's in the table.

This is very dangerous.

The problem is that we'll have a very small window (in my measurement,
it can be about 3ms) during above step (1) and (2) that the device will
see no (or incomplete) device page table.  Howerver the device never
knows that.  This can cause DMA error of devices, who assumes the page
table is always there.

So the point is that, for MAP typed notifiers (vfio-pci, for example)
they'll need the mapped page entries always be there.  We can never
unmap any existing page entries like what we did in (1) above.

The only solution is to remove step (1).  We can't do that before since
we didn't know what device page was mapped and what was not, so we unmap
them all.  Now with the new IOVA tree QEMU knows what has mapped and
what has not.  We don't need this step (1) any more.  Remove it.

Note that after removing that global unmap flushing, we'll need to
notify unmap now during page walkings.

This should fix the DMA error problem that Jintack Lim reported with
nested device assignment.  This problem won't not happen always, e.g., I
cannot reproduce the error.  However after collecting logs it shows that
this is the possible cause to Jintack's problem.

Reported-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 16b3514afb..9439103cac 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3003,10 +3003,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
 
     /*
      * The replay can be triggered by either a invalidation or a newly
-     * created entry. No matter what, we release existing mappings
-     * (it means flushing caches for UNMAP-only registers).
+     * created entry.
      */
-    vtd_address_space_unmap(vtd_as, n);
 
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
@@ -3015,8 +3013,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   ce.hi, ce.lo);
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
-            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
                           vtd_as);
+        } else {
+            vtd_address_space_unmap(vtd_as, n);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 10/10] intel-iommu: remove notify_unmap for page walk
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (8 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
@ 2018-05-04  3:08 ` Peter Xu
  2018-05-04  3:20 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

Now after previous changes, we will always pass that parameter as true
now.  Remove it. After removing that variable, now we can greatly
simplify the page walking logic.

No functional change at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 70 +++++++++++++++++--------------------------
 hw/i386/trace-events  |  1 -
 2 files changed, 28 insertions(+), 43 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9439103cac..c156235135 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -759,7 +759,6 @@ typedef struct {
     VTDAddressSpace *as;
     vtd_page_walk_hook hook_fn;
     void *private;
-    bool notify_unmap;
 } vtd_page_walk_info;
 
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
@@ -900,45 +899,34 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
-        entry.target_as = &address_space_memory;
-        entry.iova = iova & subpage_mask;
-        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
-        entry.addr_mask = ~subpage_mask;
-
-        if (vtd_is_last_slpte(slpte, level)) {
-            /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            if (!entry_valid && !info->notify_unmap) {
-                trace_vtd_page_walk_skip_perm(iova, iova_next);
-                goto next;
-            }
-            ret = vtd_page_walk_one(&entry, level, info);
-            if (ret < 0) {
-                return ret;
-            }
-        } else {
-            if (!entry_valid) {
-                if (info->notify_unmap) {
-                    /*
-                     * The whole entry is invalid; unmap it all.
-                     * Translated address is meaningless, zero it.
-                     */
-                    entry.translated_addr = 0x0;
-                    ret = vtd_page_walk_one(&entry, level, info);
-                    if (ret < 0) {
-                        return ret;
-                    }
-                } else {
-                    trace_vtd_page_walk_skip_perm(iova, iova_next);
-                }
-                goto next;
-            }
+        if (!vtd_is_last_slpte(slpte, level) && entry_valid) {
+            /*
+             * This is a valid PDE (or even bigger than PDE).  We need
+             * to walk one further level.
+             */
             ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
-            if (ret < 0) {
-                return ret;
-            }
+        } else {
+            /*
+             * This means we are either:
+             *
+             * (1) the real page entry (either 4K page, or huge page)
+             * (2) the whole range is invalid
+             *
+             * In either case, we send an IOTLB notification down.
+             */
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            entry.addr_mask = ~subpage_mask;
+            /* NOTE: this is only meaningful if entry_valid == true */
+            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
+            ret = vtd_page_walk_one(&entry, level, info);
+        }
+
+        if (ret < 0) {
+            return ret;
         }
 
 next:
@@ -960,7 +948,7 @@ next:
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
                          vtd_page_walk_hook hook_fn, void *private,
-                         bool notify_unmap, VTDAddressSpace *as)
+                         VTDAddressSpace *as)
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
@@ -968,7 +956,6 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
     vtd_page_walk_info info = {
         .hook_fn = hook_fn,
         .private = private,
-        .notify_unmap = notify_unmap,
         .as = as,
     };
 
@@ -1542,7 +1529,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                  */
                 vtd_page_walk(&ce, addr, addr + size,
                               vtd_page_invalidate_notify_hook,
-                              (void *)&vtd_as->iommu, true, vtd_as);
+                              (void *)&vtd_as->iommu, vtd_as);
             } else {
                 /*
                  * For UNMAP-only notifiers, we don't need to walk the
@@ -3013,8 +3000,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   ce.hi, ce.lo);
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
-            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
-                          vtd_as);
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, vtd_as);
         } else {
             vtd_address_space_unmap(vtd_as, n);
         }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 677f83420d..eb23ace5fb 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -43,7 +43,6 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
 vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t start, uint64_t end) "iova 0x%"PRIx64" mask 0x%"PRIx64" start 0x%"PRIx64" end 0x%"PRIx64
 vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
-vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (9 preceding siblings ...)
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 10/10] intel-iommu: remove notify_unmap for page walk Peter Xu
@ 2018-05-04  3:20 ` no-reply
  2018-05-04  3:40   ` Peter Xu
  2018-05-08  7:29 ` [Qemu-devel] [PATCH v2 11/10] tests: add interval tree unit test Peter Xu
  2018-05-16  6:30 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  12 siblings, 1 reply; 40+ messages in thread
From: no-reply @ 2018-05-04  3:20 UTC (permalink / raw)
  To: peterx
  Cc: famz, qemu-devel, kevin.tian, mst, jasowang, alex.williamson, jintack

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20180504030811.28111-1-peterx@redhat.com
Subject: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 t [tag update]            patchew/20180503213458.2749.28566.stgit@gimli.home -> patchew/20180503213458.2749.28566.stgit@gimli.home
 * [new tag]               patchew/20180504030811.28111-1-peterx@redhat.com -> patchew/20180504030811.28111-1-peterx@redhat.com
Switched to a new branch 'test'
55b6c6b1f5 intel-iommu: remove notify_unmap for page walk
72ffe9314a intel-iommu: don't unmap all for shadow page table
7bb663682b intel-iommu: maintain per-device iova ranges
e77921e63f util: implement simple interval tree logic
356489cd59 intel-iommu: pass in address space when page walk
27fcd3062f intel-iommu: introduce vtd_page_walk_info
953a6db666 intel-iommu: only do page walk for MAP notifiers
0ccfedc1e7 intel-iommu: add iommu lock
47a29fd165 intel-iommu: remove IntelIOMMUNotifierNode
aacd552d3a intel-iommu: send PSI always even if across PDEs

=== OUTPUT BEGIN ===
Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
Checking PATCH 3/10: intel-iommu: add iommu lock...
Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
Checking PATCH 7/10: util: implement simple interval tree logic...
ERROR: space prohibited between function name and open parenthesis '('
#56: FILE: include/qemu/interval-tree.h:33:
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);

total: 1 errors, 0 warnings, 343 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 8/10: intel-iommu: maintain per-device iova ranges...
Checking PATCH 9/10: intel-iommu: don't unmap all for shadow page table...
Checking PATCH 10/10: intel-iommu: remove notify_unmap for page walk...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-04  3:20 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
@ 2018-05-04  3:40   ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-04  3:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: famz, kevin.tian, mst, jasowang, alex.williamson, jintack

On Thu, May 03, 2018 at 08:20:33PM -0700, no-reply@patchew.org wrote:

[...]

> === OUTPUT BEGIN ===
> Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
> Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
> Checking PATCH 3/10: intel-iommu: add iommu lock...
> Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
> Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
> Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
> Checking PATCH 7/10: util: implement simple interval tree logic...
> ERROR: space prohibited between function name and open parenthesis '('
> #56: FILE: include/qemu/interval-tree.h:33:
> +typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);
> 
> total: 1 errors, 0 warnings, 343 lines checked
> 
> Your patch has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.

This is false positive, and should be dissolved after below patch
merged (now in Paolo's queue):

  [PATCH v2] checkpatch.pl: add common glib defines to typelist

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v2 11/10] tests: add interval tree unit test
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (10 preceding siblings ...)
  2018-05-04  3:20 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
@ 2018-05-08  7:29 ` Peter Xu
  2018-05-16  6:30 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  12 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-08  7:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Tian Kevin, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim, Jason Wang

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tests/test-interval-tree.c | 190 +++++++++++++++++++++++++++++++++++++
 tests/Makefile.include     |   2 +
 2 files changed, 192 insertions(+)
 create mode 100644 tests/test-interval-tree.c

diff --git a/tests/test-interval-tree.c b/tests/test-interval-tree.c
new file mode 100644
index 0000000000..a0e3decca0
--- /dev/null
+++ b/tests/test-interval-tree.c
@@ -0,0 +1,190 @@
+/*
+ * Interval tree tests
+ *
+ * Copyright Red Hat, Inc. 2018
+ *
+ * Authors:
+ *   Peter Xu <peterx@redhat.com>,
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/interval-tree.h"
+
+static ITRange ranges[2];
+static int range_i;
+
+static void ranges_reset(void)
+{
+    memset(&ranges, 0, sizeof(ranges));
+    range_i = 0;
+}
+
+static gboolean ranges_iterate(ITValue start, ITValue end)
+{
+    g_assert(range_i < ARRAY_SIZE(ranges));
+    ranges[range_i].start = start;
+    ranges[range_i].end = end;
+    range_i++;
+    return FALSE;
+}
+
+static void ranges_check(void)
+{
+    g_assert(range_i == 2);
+    g_assert(ranges[0].start == 10 && ranges[0].end == 19);
+    g_assert(ranges[1].start == 30 && ranges[1].end == 39);
+}
+
+static void test_interval_tree_common(void)
+{
+    int ret;
+    ITTree *tree = it_tree_new();
+    ITRange *range;
+
+    g_assert(tree);
+
+    /* Test insertion */
+    ret = it_tree_insert(tree, 10, 19);
+    g_assert(ret == 0);
+    ret = it_tree_insert(tree, 30, 39);
+    g_assert(ret == 0);
+    ret = it_tree_insert(tree, 15, 19);
+    g_assert(ret == IT_ERR_OVERLAP);
+    ret = it_tree_insert(tree, 0, 99);
+    g_assert(ret == IT_ERR_OVERLAP);
+
+    /* Test searching */
+    range = it_tree_find(tree, 0, 9);
+    g_assert(range == NULL);
+    range = it_tree_find(tree, 10, 19);
+    g_assert(range->start == 10 && range->end == 19);
+    range = it_tree_find_value(tree, 15);
+    g_assert(range->start == 10 && range->end == 19);
+    range = it_tree_find(tree, 15, 99);
+    g_assert(range->start == 10 && range->end == 19);
+    range = it_tree_find_value(tree, 35);
+    g_assert(range->start == 30 && range->end == 39);
+
+    /* Test iterations */
+    ranges_reset();
+    it_tree_foreach(tree, ranges_iterate);
+    ranges_check();
+
+    /* Remove one of them */
+    ret = it_tree_remove(tree, 10, 19);
+    g_assert(ret == 0);
+    g_assert(!it_tree_find(tree, 10, 19));
+    g_assert(it_tree_find(tree, 30, 39));
+
+    it_tree_destroy(tree);
+}
+
+static void test_interval_tree_merging(void)
+{
+    int ret;
+    ITTree *tree = it_tree_new();
+    ITRange *range;
+
+    g_assert(tree);
+
+    ret = it_tree_insert(tree, 10, 19);
+    g_assert(ret == 0);
+    ret = it_tree_insert(tree, 30, 39);
+    g_assert(ret == 0);
+
+    /* Test left side merging */
+    ret = it_tree_insert(tree, 40, 59);
+    g_assert(ret == 0);
+    range = it_tree_find(tree, 30, 39);
+    g_assert(range->start == 30 && range->end == 59);
+
+    /* Test right side merging */
+    ret = it_tree_insert(tree, 0, 9);
+    g_assert(ret == 0);
+    range = it_tree_find(tree, 10, 19);
+    g_assert(range->start == 0 && range->end == 19);
+
+    /* Test bidirectional merging */
+    ret = it_tree_insert(tree, 20, 29);
+    g_assert(ret == 0);
+    range = it_tree_find(tree, 20, 29);
+    g_assert(range->start == 0 && range->end == 59);
+    range = it_tree_find(tree, 0, 29);
+    g_assert(range->start == 0 && range->end == 59);
+    range = it_tree_find(tree, 40, 45);
+    g_assert(range->start == 0 && range->end == 59);
+
+    it_tree_destroy(tree);
+}
+
+static void test_interval_tree_removal(void)
+{
+    int ret;
+    ITTree *tree = it_tree_new();
+    ITRange *range;
+
+    g_assert(tree);
+
+    ret = it_tree_insert(tree, 10, 19);
+    g_assert(ret == 0);
+    ret = it_tree_insert(tree, 30, 39);
+    g_assert(ret == 0);
+
+    /*
+     * Remove some useless areas, which should not remove any existing
+     * ranges in the tree
+     */
+    ret = it_tree_remove(tree, 0, 9);
+    g_assert(ret == 0);
+    ret = it_tree_remove(tree, 50, 99);
+    g_assert(ret == 0);
+    ret = it_tree_remove(tree, 20, 29);
+    g_assert(ret == 0);
+    /* Make sure the elements are not removed */
+    g_assert(it_tree_find(tree, 10, 19));
+    g_assert(it_tree_find(tree, 30, 39));
+
+    /* Remove left subset of a range */
+    ret = it_tree_remove(tree, 0, 14);
+    g_assert(ret == 0);
+    range = it_tree_find(tree, 10, 19);
+    g_assert(range->start == 15 && range->end == 19);
+    it_tree_insert(tree, 10, 15);
+
+    /* Remove right subset of a range */
+    ret = it_tree_remove(tree, 35, 45);
+    g_assert(ret == 0);
+    range = it_tree_find(tree, 30, 39);
+    g_assert(range->start == 30 && range->end == 34);
+    it_tree_insert(tree, 35, 39);
+
+    /* Remove covers more than one range */
+    ret = it_tree_remove(tree, 0, 40);
+    g_assert(ret == 0);
+    g_assert(!it_tree_find(tree, 10, 19));
+    g_assert(!it_tree_find(tree, 30, 39));
+    it_tree_insert(tree, 10, 19);
+    it_tree_insert(tree, 30, 39);
+
+    /* Remove in the middle */
+    ret = it_tree_remove(tree, 12, 16);
+    g_assert(ret == 0);
+    range = it_tree_find_value(tree, 10);
+    g_assert(range->start == 10 && range->end == 11);
+    range = it_tree_find_value(tree, 17);
+    g_assert(range->start == 17 && range->end == 19);
+
+    it_tree_destroy(tree);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/interval-tree/common", test_interval_tree_common);
+    g_test_add_func("/interval-tree/merging", test_interval_tree_merging);
+    g_test_add_func("/interval-tree/removal", test_interval_tree_removal);
+    return g_test_run();
+}
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 3b9a5e31a2..db1e6c93db 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -169,6 +169,7 @@ check-unit-y += tests/ptimer-test$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
+check-unit-y += tests/test-interval-tree$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -642,6 +643,7 @@ tests/test-qht-par$(EXESUF): tests/test-qht-par.o tests/qht-bench$(EXESUF) $(tes
 tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
 tests/test-bufferiszero$(EXESUF): tests/test-bufferiszero.o $(test-util-obj-y)
 tests/atomic_add-bench$(EXESUF): tests/atomic_add-bench.o $(test-util-obj-y)
+tests/test-interval-tree$(EXESUF): tests/test-interval-tree.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (11 preceding siblings ...)
  2018-05-08  7:29 ` [Qemu-devel] [PATCH v2 11/10] tests: add interval tree unit test Peter Xu
@ 2018-05-16  6:30 ` Peter Xu
  2018-05-16 13:57   ` Jason Wang
  12 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-16  6:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	Jason Wang

On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]

We can hold a bit on reviewing this series.  Jintack reported a scp
DMAR issue that might happen even with L1 guest with this series, and
the scp can stall after copied tens or hundreds of MBs randomly.  I'm
still investigating the problem.  This problem should be related to
deferred flushing of VT-d kernel driver, since the problem will go
away if we use "intel_iommu=on,strict".  However I'm still trying to
figure out what's the thing behind the scene even with that deferred
flushing feature.

Meanwhile, during the investigation I found another "possibly valid"
use case about the modify-PTE problem that Jason has mentioned when
with deferred flushing:

         vcpu1                        vcpu2
      map page A
  explicitly send PSI for A
   queue unmap page A [1]
                                 map the same page A [2]
                                explcitly send PSI for A [3]
   flush unmap page A [4]

Due to deferred flushing, the UNMAP PSI might be postponed (or it can
be finally a DSI) from step [1] to step [4].  If we allocate the same
page somewhere else, we might trigger this modify-PTE at [2] since we
haven't yet received the deferred PSI to unmap A from vcpu1.

Note that this will not happen with latest upstream Linux, since the
IOVA allocation algorithm in current Linux kernel made sure that the
IOVA range won't be freed until [4], so we can't really allocate the
same page address at [2].  However this let me tend to agree with
Jason and Kevin's worry on future potential issues if that can be
triggered easily by common guest kernel bugs.  So now I'm considering
to drop my mergable interval tree but just use a simpler tree to cache
everything including translated addresses.  The metadata will possibly
take 2% of managed memory if with that.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-16  6:30 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
@ 2018-05-16 13:57   ` Jason Wang
  2018-05-17  2:45     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2018-05-16 13:57 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim



On 2018年05月16日 14:30, Peter Xu wrote:
> On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:
>> v2:
>> - fix patchew code style warnings
>> - interval tree: postpone malloc when inserting; simplify node remove
>>    a bit where proper [Jason]
>> - fix up comment and commit message for iommu lock patch [Kevin]
>> - protect context cache too using the iommu lock [Kevin, Jason]
>> - add vast comment in patch 8 to explain the modify-PTE problem
>>    [Jason, Kevin]
> We can hold a bit on reviewing this series.  Jintack reported a scp
> DMAR issue that might happen even with L1 guest with this series, and
> the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> still investigating the problem.  This problem should be related to
> deferred flushing of VT-d kernel driver, since the problem will go
> away if we use "intel_iommu=on,strict".  However I'm still trying to
> figure out what's the thing behind the scene even with that deferred
> flushing feature.

I vaguely remember recent upstream vfio support delayed flush, maybe 
it's related.

>
> Meanwhile, during the investigation I found another "possibly valid"
> use case about the modify-PTE problem that Jason has mentioned when
> with deferred flushing:
>
>           vcpu1                        vcpu2
>        map page A
>    explicitly send PSI for A
>     queue unmap page A [1]
>                                   map the same page A [2]
>                                  explcitly send PSI for A [3]
>     flush unmap page A [4]
>
> Due to deferred flushing, the UNMAP PSI might be postponed (or it can
> be finally a DSI) from step [1] to step [4].  If we allocate the same
> page somewhere else, we might trigger this modify-PTE at [2] since we
> haven't yet received the deferred PSI to unmap A from vcpu1.
>
> Note that this will not happen with latest upstream Linux, since the
> IOVA allocation algorithm in current Linux kernel made sure that the
> IOVA range won't be freed until [4], so we can't really allocate the
> same page address at [2].

Yes, so the vfio + vIOMMU work will probably uncover more bugs in the 
IOMMU driver (especially CM mode). I suspect CM mode does not have 
sufficient test (since it probably wasn't used in any production 
environment before the vfio + vIOMMU work).

>    However this let me tend to agree with
> Jason and Kevin's worry on future potential issues if that can be
> triggered easily by common guest kernel bugs.  So now I'm considering
> to drop my mergable interval tree but just use a simpler tree to cache
> everything including translated addresses.  The metadata will possibly
> take 2% of managed memory if with that.
>

Good to know this, we can start from the way we know correct for sure 
then optimizations on top.

Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-16 13:57   ` Jason Wang
@ 2018-05-17  2:45     ` Peter Xu
  2018-05-17  3:39       ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-17  2:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim

On Wed, May 16, 2018 at 09:57:40PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月16日 14:30, Peter Xu wrote:
> > On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:
> > > v2:
> > > - fix patchew code style warnings
> > > - interval tree: postpone malloc when inserting; simplify node remove
> > >    a bit where proper [Jason]
> > > - fix up comment and commit message for iommu lock patch [Kevin]
> > > - protect context cache too using the iommu lock [Kevin, Jason]
> > > - add vast comment in patch 8 to explain the modify-PTE problem
> > >    [Jason, Kevin]
> > We can hold a bit on reviewing this series.  Jintack reported a scp
> > DMAR issue that might happen even with L1 guest with this series, and
> > the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> > still investigating the problem.  This problem should be related to
> > deferred flushing of VT-d kernel driver, since the problem will go
> > away if we use "intel_iommu=on,strict".  However I'm still trying to
> > figure out what's the thing behind the scene even with that deferred
> > flushing feature.
> 
> I vaguely remember recent upstream vfio support delayed flush, maybe it's
> related.

I'm a bit confused on why vfio is related to the deferred flushing.
Could you provide a pointer for this?

> 
> > 
> > Meanwhile, during the investigation I found another "possibly valid"
> > use case about the modify-PTE problem that Jason has mentioned when
> > with deferred flushing:
> > 
> >           vcpu1                        vcpu2
> >        map page A
> >    explicitly send PSI for A
> >     queue unmap page A [1]
> >                                   map the same page A [2]
> >                                  explcitly send PSI for A [3]
> >     flush unmap page A [4]
> > 
> > Due to deferred flushing, the UNMAP PSI might be postponed (or it can
> > be finally a DSI) from step [1] to step [4].  If we allocate the same
> > page somewhere else, we might trigger this modify-PTE at [2] since we
> > haven't yet received the deferred PSI to unmap A from vcpu1.
> > 
> > Note that this will not happen with latest upstream Linux, since the
> > IOVA allocation algorithm in current Linux kernel made sure that the
> > IOVA range won't be freed until [4], so we can't really allocate the
> > same page address at [2].
> 
> Yes, so the vfio + vIOMMU work will probably uncover more bugs in the IOMMU
> driver (especially CM mode). I suspect CM mode does not have sufficient test
> (since it probably wasn't used in any production environment before the vfio
> + vIOMMU work).

Yes maybe. I might possibly continue investigating some of this after
this "functional" series.  So my plan is that firstly we make it work
functionally (even we can allow some trivial bugs, but scp error is
not in count), then we consider other things.

> 
> >    However this let me tend to agree with
> > Jason and Kevin's worry on future potential issues if that can be
> > triggered easily by common guest kernel bugs.  So now I'm considering
> > to drop my mergable interval tree but just use a simpler tree to cache
> > everything including translated addresses.  The metadata will possibly
> > take 2% of managed memory if with that.
> > 
> 
> Good to know this, we can start from the way we know correct for sure then
> optimizations on top.

Yes.

Btw, still it's not really "correct" here AFAIU - the correct thing
should be that we "modify" the PTE without any invalid DMA window.
Now even if I switch to the other method we still need to unmap and
remap, so we will still have an invalid window for that DMA range.
For example, if Linux kernel frees IOVA ranges after queueing the
unmap in current Linux VT-d drivers (again, this does not exist, but
I'm just assuming), then my current series will be affected while the
other way won't be affected.  I'll still add some comment there to
clarify this as TODO.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17  2:45     ` Peter Xu
@ 2018-05-17  3:39       ` Alex Williamson
  2018-05-17  4:16         ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2018-05-17  3:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, qemu-devel, Tian Kevin, Michael S . Tsirkin, Jintack Lim

On Thu, 17 May 2018 10:45:44 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, May 16, 2018 at 09:57:40PM +0800, Jason Wang wrote:
> > 
> > 
> > On 2018年05月16日 14:30, Peter Xu wrote:  
> > > On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:  
> > > > v2:
> > > > - fix patchew code style warnings
> > > > - interval tree: postpone malloc when inserting; simplify node remove
> > > >    a bit where proper [Jason]
> > > > - fix up comment and commit message for iommu lock patch [Kevin]
> > > > - protect context cache too using the iommu lock [Kevin, Jason]
> > > > - add vast comment in patch 8 to explain the modify-PTE problem
> > > >    [Jason, Kevin]  
> > > We can hold a bit on reviewing this series.  Jintack reported a scp
> > > DMAR issue that might happen even with L1 guest with this series, and
> > > the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> > > still investigating the problem.  This problem should be related to
> > > deferred flushing of VT-d kernel driver, since the problem will go
> > > away if we use "intel_iommu=on,strict".  However I'm still trying to
> > > figure out what's the thing behind the scene even with that deferred
> > > flushing feature.  
> > 
> > I vaguely remember recent upstream vfio support delayed flush, maybe it's
> > related.  
> 
> I'm a bit confused on why vfio is related to the deferred flushing.
> Could you provide a pointer for this?

Perhaps referring to this:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6bd06f5a486c06023a618a86e8153b91d26f75f4

Rather than calling iommu_unmap() for each chunk of a mapping we'll
make multiple calls to iommu_unmap_fast() and flush with
iommu_tlb_sync() to defer and batch the hardware flushing.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17  3:39       ` Alex Williamson
@ 2018-05-17  4:16         ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-17  4:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Wang, qemu-devel, Tian Kevin, Michael S . Tsirkin, Jintack Lim

On Wed, May 16, 2018 at 09:39:41PM -0600, Alex Williamson wrote:
> On Thu, 17 May 2018 10:45:44 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, May 16, 2018 at 09:57:40PM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2018年05月16日 14:30, Peter Xu wrote:  
> > > > On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:  
> > > > > v2:
> > > > > - fix patchew code style warnings
> > > > > - interval tree: postpone malloc when inserting; simplify node remove
> > > > >    a bit where proper [Jason]
> > > > > - fix up comment and commit message for iommu lock patch [Kevin]
> > > > > - protect context cache too using the iommu lock [Kevin, Jason]
> > > > > - add vast comment in patch 8 to explain the modify-PTE problem
> > > > >    [Jason, Kevin]  
> > > > We can hold a bit on reviewing this series.  Jintack reported a scp
> > > > DMAR issue that might happen even with L1 guest with this series, and
> > > > the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> > > > still investigating the problem.  This problem should be related to
> > > > deferred flushing of VT-d kernel driver, since the problem will go
> > > > away if we use "intel_iommu=on,strict".  However I'm still trying to
> > > > figure out what's the thing behind the scene even with that deferred
> > > > flushing feature.  
> > > 
> > > I vaguely remember recent upstream vfio support delayed flush, maybe it's
> > > related.  
> > 
> > I'm a bit confused on why vfio is related to the deferred flushing.
> > Could you provide a pointer for this?
> 
> Perhaps referring to this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6bd06f5a486c06023a618a86e8153b91d26f75f4
> 
> Rather than calling iommu_unmap() for each chunk of a mapping we'll
> make multiple calls to iommu_unmap_fast() and flush with
> iommu_tlb_sync() to defer and batch the hardware flushing.  Thanks,

Thanks for the link!

It seems to be a good performance enhancement for vfio, but it might
not related to current problem.

My latest clue shows that the issue is possibly caused by replaying
each IOMMU region twice on a single DSI.  For example, when we get one
DSI we'll actually call vtd_page_walk() twice (the IOMMU region is
splitted by the MSI region 0xfeeXXXXX, so we have two notifiers for
each vfio-pci device now... and memory_region_iommu_replay_all will
call vtd_page_walk twice). So that confused the IOVA tree a bit. I'll
verify that and see how I can fix it up.

(PS: it seems that in above patch unmapped_region_cnt is not needed in
 vfio_unmap_unpin considering that we already have
 unmapped_region_list there?  If that's correct, then we can remove
 all the references too, e.g. we don't need to pass in unmapped_cnt
 into unmap_unpin_fast as well.)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
@ 2018-05-17  9:46   ` Auger Eric
  2018-05-17 10:02     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17  9:46 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> That is not really necessary.  Removing that node struct and put the
> list entry directly into VTDAddressSpace.  It simplfies the code a lot.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/hw/i386/intel_iommu.h |  9 ++------
>  hw/i386/intel_iommu.c         | 41 ++++++++++-------------------------
>  2 files changed, 14 insertions(+), 36 deletions(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 45ec8919b6..220697253f 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDIrq VTDIrq;
>  typedef struct VTD_MSIMessage VTD_MSIMessage;
> -typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -93,6 +92,7 @@ struct VTDAddressSpace {
>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;
> +    QLIST_ENTRY(VTDAddressSpace) next;
>  };
>  
>  struct VTDBus {
> @@ -253,11 +253,6 @@ struct VTD_MSIMessage {
>  /* When IR is enabled, all MSI/MSI-X data bits should be zero */
>  #define VTD_IR_MSI_DATA          (0)
>  
> -struct IntelIOMMUNotifierNode {
> -    VTDAddressSpace *vtd_as;
> -    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
> -};
> -
>  /* The iommu (DMAR) device state struct */
>  struct IntelIOMMUState {
>      X86IOMMUState x86_iommu;
> @@ -295,7 +290,7 @@ struct IntelIOMMUState {
>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
>      /* list of registered notifiers */
> -    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
> +    QLIST_HEAD(, VTDAddressSpace) notifiers_list;
Wouldn't it make sense to rename notifiers_list into something more
understandable like address_spaces?
>  
>      /* interrupt remapping */
>      bool intr_enabled;              /* Whether guest enabled IR */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index b359efd6f9..5987b48d43 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1248,10 +1248,10 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
>  
>  static void vtd_iommu_replay_all(IntelIOMMUState *s)
>  {
> -    IntelIOMMUNotifierNode *node;
> +    VTDAddressSpace *vtd_as;
>  
> -    QLIST_FOREACH(node, &s->notifiers_list, next) {
> -        memory_region_iommu_replay_all(&node->vtd_as->iommu);
> +    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
> +        memory_region_iommu_replay_all(&vtd_as->iommu);
>      }
>  }
>  
> @@ -1372,7 +1372,6 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
>  
>  static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>  {
> -    IntelIOMMUNotifierNode *node;
>      VTDContextEntry ce;
>      VTDAddressSpace *vtd_as;
>  
> @@ -1381,8 +1380,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
>                                  &domain_id);
>  
> -    QLIST_FOREACH(node, &s->notifiers_list, next) {
> -        vtd_as = node->vtd_as;
> +    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
>          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                        vtd_as->devfn, &ce) &&
>              domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> @@ -1402,12 +1400,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>                                             uint16_t domain_id, hwaddr addr,
>                                             uint8_t am)
>  {
> -    IntelIOMMUNotifierNode *node;
> +    VTDAddressSpace *vtd_as;
>      VTDContextEntry ce;
>      int ret;
>  
> -    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> -        VTDAddressSpace *vtd_as = node->vtd_as;
> +    QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                         vtd_as->devfn, &ce);
>          if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> @@ -2344,8 +2341,6 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
>  {
>      VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
>      IntelIOMMUState *s = vtd_as->iommu_state;
> -    IntelIOMMUNotifierNode *node = NULL;
> -    IntelIOMMUNotifierNode *next_node = NULL;
>  
>      if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
>          error_report("We need to set caching-mode=1 for intel-iommu to enable "
> @@ -2354,21 +2349,11 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
>      }
>  
>      if (old == IOMMU_NOTIFIER_NONE) {
> -        node = g_malloc0(sizeof(*node));
> -        node->vtd_as = vtd_as;
> -        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
> -        return;
> -    }
> -
> -    /* update notifier node with new flags */
> -    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
> -        if (node->vtd_as == vtd_as) {
> -            if (new == IOMMU_NOTIFIER_NONE) {
> -                QLIST_REMOVE(node, next);
> -                g_free(node);
> -            }
> -            return;
> -        }
> +        /* Insert new ones */
s/ones/one

> +        QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
> +    } else if (new == IOMMU_NOTIFIER_NONE) {
> +        /* Remove old ones */
same. Not sure the comments are worth.
> +        QLIST_REMOVE(vtd_as, next);
>      }
>  }
>  
> @@ -2838,12 +2823,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>  
>  static void vtd_address_space_unmap_all(IntelIOMMUState *s)
>  {
> -    IntelIOMMUNotifierNode *node;
>      VTDAddressSpace *vtd_as;
>      IOMMUNotifier *n;
>  
> -    QLIST_FOREACH(node, &s->notifiers_list, next) {
> -        vtd_as = node->vtd_as;
> +    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
>          IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
>              vtd_address_space_unmap(vtd_as, n);
>          }
> 
Thanks

Eric

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode
  2018-05-17  9:46   ` Auger Eric
@ 2018-05-17 10:02     ` Peter Xu
  2018-05-17 10:10       ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-17 10:02 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 11:46:22AM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > That is not really necessary.  Removing that node struct and put the
> > list entry directly into VTDAddressSpace.  It simplfies the code a lot.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/hw/i386/intel_iommu.h |  9 ++------
> >  hw/i386/intel_iommu.c         | 41 ++++++++++-------------------------
> >  2 files changed, 14 insertions(+), 36 deletions(-)
> > 
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 45ec8919b6..220697253f 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> >  typedef struct VTDIrq VTDIrq;
> >  typedef struct VTD_MSIMessage VTD_MSIMessage;
> > -typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
> >  
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > @@ -93,6 +92,7 @@ struct VTDAddressSpace {
> >      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
> >      IntelIOMMUState *iommu_state;
> >      VTDContextCacheEntry context_cache_entry;
> > +    QLIST_ENTRY(VTDAddressSpace) next;
> >  };
> >  
> >  struct VTDBus {
> > @@ -253,11 +253,6 @@ struct VTD_MSIMessage {
> >  /* When IR is enabled, all MSI/MSI-X data bits should be zero */
> >  #define VTD_IR_MSI_DATA          (0)
> >  
> > -struct IntelIOMMUNotifierNode {
> > -    VTDAddressSpace *vtd_as;
> > -    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
> > -};
> > -
> >  /* The iommu (DMAR) device state struct */
> >  struct IntelIOMMUState {
> >      X86IOMMUState x86_iommu;
> > @@ -295,7 +290,7 @@ struct IntelIOMMUState {
> >      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
> >      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> >      /* list of registered notifiers */
> > -    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
> > +    QLIST_HEAD(, VTDAddressSpace) notifiers_list;
> Wouldn't it make sense to rename notifiers_list into something more
> understandable like address_spaces?

But address_spaces might be a bit confusing too on the other side as
"a list of all VT-d address spaces".  How about something like:

     address_spaces_with_notifiers

?

[...]

> > -    /* update notifier node with new flags */
> > -    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
> > -        if (node->vtd_as == vtd_as) {
> > -            if (new == IOMMU_NOTIFIER_NONE) {
> > -                QLIST_REMOVE(node, next);
> > -                g_free(node);
> > -            }
> > -            return;
> > -        }
> > +        /* Insert new ones */
> s/ones/one
> 
> > +        QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
> > +    } else if (new == IOMMU_NOTIFIER_NONE) {
> > +        /* Remove old ones */
> same. Not sure the comments are worth.

Will remove two "s"s there.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode
  2018-05-17 10:02     ` Peter Xu
@ 2018-05-17 10:10       ` Auger Eric
  2018-05-17 10:14         ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17 10:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

Hi Peter,

On 05/17/2018 12:02 PM, Peter Xu wrote:
> On Thu, May 17, 2018 at 11:46:22AM +0200, Auger Eric wrote:
>> Hi Peter,
>>
>> On 05/04/2018 05:08 AM, Peter Xu wrote:
>>> That is not really necessary.  Removing that node struct and put the
>>> list entry directly into VTDAddressSpace.  It simplfies the code a lot.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>  include/hw/i386/intel_iommu.h |  9 ++------
>>>  hw/i386/intel_iommu.c         | 41 ++++++++++-------------------------
>>>  2 files changed, 14 insertions(+), 36 deletions(-)
>>>
>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>> index 45ec8919b6..220697253f 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>>>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>>>  typedef struct VTDIrq VTDIrq;
>>>  typedef struct VTD_MSIMessage VTD_MSIMessage;
>>> -typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
>>>  
>>>  /* Context-Entry */
>>>  struct VTDContextEntry {
>>> @@ -93,6 +92,7 @@ struct VTDAddressSpace {
>>>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>>>      IntelIOMMUState *iommu_state;
>>>      VTDContextCacheEntry context_cache_entry;
>>> +    QLIST_ENTRY(VTDAddressSpace) next;
>>>  };
>>>  
>>>  struct VTDBus {
>>> @@ -253,11 +253,6 @@ struct VTD_MSIMessage {
>>>  /* When IR is enabled, all MSI/MSI-X data bits should be zero */
>>>  #define VTD_IR_MSI_DATA          (0)
>>>  
>>> -struct IntelIOMMUNotifierNode {
>>> -    VTDAddressSpace *vtd_as;
>>> -    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
>>> -};
>>> -
>>>  /* The iommu (DMAR) device state struct */
>>>  struct IntelIOMMUState {
>>>      X86IOMMUState x86_iommu;
>>> @@ -295,7 +290,7 @@ struct IntelIOMMUState {
>>>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>>>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
>>>      /* list of registered notifiers */
>>> -    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
>>> +    QLIST_HEAD(, VTDAddressSpace) notifiers_list;
>> Wouldn't it make sense to rename notifiers_list into something more
>> understandable like address_spaces?
> 
> But address_spaces might be a bit confusing too on the other side as
> "a list of all VT-d address spaces".  How about something like:
> 
>      address_spaces_with_notifiers
Hum I missed not all of them belonged to that list. a bit long now?
vtd_as_with_notifiers?

Thanks

Eric
> 
> ?
> 
> [...]
> 
>>> -    /* update notifier node with new flags */
>>> -    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
>>> -        if (node->vtd_as == vtd_as) {
>>> -            if (new == IOMMU_NOTIFIER_NONE) {
>>> -                QLIST_REMOVE(node, next);
>>> -                g_free(node);
>>> -            }
>>> -            return;
>>> -        }
>>> +        /* Insert new ones */
>> s/ones/one
>>
>>> +        QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
>>> +    } else if (new == IOMMU_NOTIFIER_NONE) {
>>> +        /* Remove old ones */
>> same. Not sure the comments are worth.
> 
> Will remove two "s"s there.  Thanks,
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode
  2018-05-17 10:10       ` Auger Eric
@ 2018-05-17 10:14         ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-17 10:14 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 12:10:41PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/17/2018 12:02 PM, Peter Xu wrote:
> > On Thu, May 17, 2018 at 11:46:22AM +0200, Auger Eric wrote:
> >> Hi Peter,
> >>
> >> On 05/04/2018 05:08 AM, Peter Xu wrote:
> >>> That is not really necessary.  Removing that node struct and put the
> >>> list entry directly into VTDAddressSpace.  It simplfies the code a lot.
> >>>
> >>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>> ---
> >>>  include/hw/i386/intel_iommu.h |  9 ++------
> >>>  hw/i386/intel_iommu.c         | 41 ++++++++++-------------------------
> >>>  2 files changed, 14 insertions(+), 36 deletions(-)
> >>>
> >>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> >>> index 45ec8919b6..220697253f 100644
> >>> --- a/include/hw/i386/intel_iommu.h
> >>> +++ b/include/hw/i386/intel_iommu.h
> >>> @@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> >>>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> >>>  typedef struct VTDIrq VTDIrq;
> >>>  typedef struct VTD_MSIMessage VTD_MSIMessage;
> >>> -typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
> >>>  
> >>>  /* Context-Entry */
> >>>  struct VTDContextEntry {
> >>> @@ -93,6 +92,7 @@ struct VTDAddressSpace {
> >>>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
> >>>      IntelIOMMUState *iommu_state;
> >>>      VTDContextCacheEntry context_cache_entry;
> >>> +    QLIST_ENTRY(VTDAddressSpace) next;
> >>>  };
> >>>  
> >>>  struct VTDBus {
> >>> @@ -253,11 +253,6 @@ struct VTD_MSIMessage {
> >>>  /* When IR is enabled, all MSI/MSI-X data bits should be zero */
> >>>  #define VTD_IR_MSI_DATA          (0)
> >>>  
> >>> -struct IntelIOMMUNotifierNode {
> >>> -    VTDAddressSpace *vtd_as;
> >>> -    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
> >>> -};
> >>> -
> >>>  /* The iommu (DMAR) device state struct */
> >>>  struct IntelIOMMUState {
> >>>      X86IOMMUState x86_iommu;
> >>> @@ -295,7 +290,7 @@ struct IntelIOMMUState {
> >>>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
> >>>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> >>>      /* list of registered notifiers */
> >>> -    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
> >>> +    QLIST_HEAD(, VTDAddressSpace) notifiers_list;
> >> Wouldn't it make sense to rename notifiers_list into something more
> >> understandable like address_spaces?
> > 
> > But address_spaces might be a bit confusing too on the other side as
> > "a list of all VT-d address spaces".  How about something like:
> > 
> >      address_spaces_with_notifiers
> Hum I missed not all of them belonged to that list. a bit long now?
> vtd_as_with_notifiers?

Okay I can use that.  Regarding to the other "s"s issues - I think
I'll just drop those comments since they aren't really helpful after
all.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
@ 2018-05-17 13:39   ` Auger Eric
  2018-05-18  5:53     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17 13:39 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> For UNMAP-only IOMMU notifiers, we don't really need to walk the page
s/really// ;-)
> tables.  Fasten that procedure by skipping the page table walk.  That
> should boost performance for UNMAP-only notifiers like vhost.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/hw/i386/intel_iommu.h |  2 ++
>  hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
>  2 files changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index ee517704e7..9e0a6c1c6a 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -93,6 +93,8 @@ struct VTDAddressSpace {
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;
>      QLIST_ENTRY(VTDAddressSpace) next;
> +    /* Superset of notifier flags that this address space has */
> +    IOMMUNotifierFlag notifier_flags;
>  };
>  
>  struct VTDBus {
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 112971638d..9a418abfb6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
>      qemu_mutex_unlock(&s->iommu_lock);
>  }
>  
> +/* Whether the address space needs to notify new mappings */
> +static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
would suggest vtd_as_has_map_notifier()? But tastes & colours ;-)
> +{
> +    return as->notifier_flags & IOMMU_NOTIFIER_MAP;
> +}
> +
>  /* GHashTable functions */
>  static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
>  {
> @@ -1433,14 +1439,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>      VTDAddressSpace *vtd_as;
>      VTDContextEntry ce;
>      int ret;
> +    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>  
>      QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                         vtd_as->devfn, &ce);
>          if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> -            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> -                          vtd_page_invalidate_notify_hook,
> -                          (void *)&vtd_as->iommu, true, s->aw_bits);
> +            if (vtd_as_notify_mappings(vtd_as)) {
> +                /*
> +                 * For MAP-inclusive notifiers, we need to walk the
> +                 * page table to sync the shadow page table.
> +                 */
Potentially we may have several notifiers attached to the IOMMU MR ~
vtd_as, each of them having different flags. Those flags are OR'ed in
memory_region_update_iommu_notify_flags and this is the one you now
store in the vtd_as. So maybe your comment may rather state:
as soon as we have at least one MAP notifier, we need to do the PTW?

nit: not related to this patch: vtd_page_walk kerneldoc comments misses
@notify_unmap param comment
side note: the name of the hook is a bit misleading as it suggests we
invalidate the entry, whereas we update any valid entry and invalidate
stale ones (if notify_unmap=true)?
> +                vtd_page_walk(&ce, addr, addr + size,
> +                              vtd_page_invalidate_notify_hook,
> +                              (void *)&vtd_as->iommu, true, s->aw_bits);
> +            } else {
> +                /*
> +                 * For UNMAP-only notifiers, we don't need to walk the
> +                 * page tables.  We just deliver the PSI down to
> +                 * invalidate caches.

We just unmap the range?
> +                 */
> +                IOMMUTLBEntry entry = {
> +                    .target_as = &address_space_memory,
> +                    .iova = addr,
> +                    .translated_addr = 0,
> +                    .addr_mask = size - 1,
> +                    .perm = IOMMU_NONE,
> +                };
> +                memory_region_notify_iommu(&vtd_as->iommu, entry);
> +            }
>          }
>      }
>  }
> @@ -2380,6 +2407,9 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
>          exit(1);
>      }
>  
> +    /* Update per-address-space notifier flags */
> +    vtd_as->notifier_flags = new;
> +
>      if (old == IOMMU_NOTIFIER_NONE) {
>          /* Insert new ones */
>          QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
> @@ -2890,8 +2920,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                                    PCI_FUNC(vtd_as->devfn),
>                                    VTD_CONTEXT_ENTRY_DID(ce.hi),
>                                    ce.hi, ce.lo);
> -        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> -                      s->aw_bits);
> +        if (vtd_as_notify_mappings(vtd_as)) {
> +            /* This is required only for MAP typed notifiers */
> +            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> +                          s->aw_bits);
> +        }
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
>                                      PCI_FUNC(vtd_as->devfn));
> 
A worthwhile improvement indeed!

Thanks

Eric

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/10] intel-iommu: add iommu lock
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 03/10] intel-iommu: add iommu lock Peter Xu
@ 2018-05-17 14:32   ` Auger Eric
  2018-05-18  5:32     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17 14:32 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> Add a per-iommu big lock to protect IOMMU status.  Currently the only
> thing to be protected is the IOTLB/context cache, since that can be
> accessed even without BQL, e.g., in IO dataplane.

As discussed together, Peter challenged per device mutex in
"Re: [PATCH v11 15/17] hw/arm/smmuv3: Cache/invalidate config data"
https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg02403.html
but at that time I fail to justify the use :-(

> 
> Note that we don't need to protect device page tables since that's fully
> controlled by the guest kernel.  However there is still possibility that
> malicious drivers will program the device to not obey the rule.  In that
> case QEMU can't really do anything useful, instead the guest itself will
> be responsible for all uncertainties.
> 
> Reported-by: Fam Zheng <famz@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/hw/i386/intel_iommu.h |  6 +++++
>  hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
>  2 files changed, 44 insertions(+), 5 deletions(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 220697253f..ee517704e7 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -300,6 +300,12 @@ struct IntelIOMMUState {
>      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
>      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
>      uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> +
> +    /*
> +     * Protects IOMMU states in general.  Currently it protects the
> +     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> +     */
> +    QemuMutex iommu_lock;
>  };
>  
>  /* Find the VTD Address space associated with the given bus pointer,
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 5987b48d43..112971638d 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -128,6 +128,16 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
>      return new_val;
>  }
>  
> +static inline void vtd_iommu_lock(IntelIOMMUState *s)
> +{
> +    qemu_mutex_lock(&s->iommu_lock);
> +}
> +
> +static inline void vtd_iommu_unlock(IntelIOMMUState *s)
> +{
> +    qemu_mutex_unlock(&s->iommu_lock);
> +}
> +
>  /* GHashTable functions */
>  static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
>  {
> @@ -172,7 +182,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
>  }
>  
>  /* Reset all the gen of VTDAddressSpace to zero and set the gen of
> - * IntelIOMMUState to 1.
> + * IntelIOMMUState to 1.  Must be with IOMMU lock held.
s/Must be/ Must be called ?
not done in vtd_init()
>   */
>  static void vtd_reset_context_cache(IntelIOMMUState *s)
>  {
> @@ -197,12 +207,19 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
>      s->context_cache_gen = 1;
>  }
>  
> -static void vtd_reset_iotlb(IntelIOMMUState *s)
> +static void vtd_reset_iotlb_locked(IntelIOMMUState *s)
add the above comment and keep the original name?
>  {
>      assert(s->iotlb);
>      g_hash_table_remove_all(s->iotlb);
>  }
>  
> +static void vtd_reset_iotlb(IntelIOMMUState *s)
> +{
> +    vtd_iommu_lock(s);
> +    vtd_reset_iotlb_locked(s);
> +    vtd_iommu_unlock(s);
> +}
> +
>  static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
>                                    uint32_t level)
>  {
> @@ -215,6 +232,7 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
>      return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
>  }
>  
> +/* Must be with IOMMU lock held */
>  static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
>                                         hwaddr addr)
>  {
> @@ -235,6 +253,7 @@ out:
>      return entry;
>  }
>  
> +/* Must be with IOMMU lock held */
>  static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>                               uint16_t domain_id, hwaddr addr, uint64_t slpte,
>                               uint8_t access_flags, uint32_t level)
> @@ -246,7 +265,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>      trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
>      if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
>          trace_vtd_iotlb_reset("iotlb exceeds size limit");
> -        vtd_reset_iotlb(s);
> +        vtd_reset_iotlb_locked(s);
>      }
>  
>      entry->gfn = gfn;
> @@ -1106,7 +1125,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      IntelIOMMUState *s = vtd_as->iommu_state;
>      VTDContextEntry ce;
>      uint8_t bus_num = pci_bus_num(bus);
> -    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> +    VTDContextCacheEntry *cc_entry;
>      uint64_t slpte, page_mask;
>      uint32_t level;
>      uint16_t source_id = vtd_make_source_id(bus_num, devfn);
> @@ -1123,6 +1142,10 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>       */
>      assert(!vtd_is_interrupt_addr(addr));
>  
> +    vtd_iommu_lock(s);
> +
> +    cc_entry = &vtd_as->context_cache_entry;
> +
>      /* Try to fetch slpte form IOTLB */
>      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>      if (iotlb_entry) {
> @@ -1182,7 +1205,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>           * IOMMU region can be swapped back.
>           */
>          vtd_pt_enable_fast_path(s, source_id);
> -
> +        vtd_iommu_unlock(s);
>          return true;
>      }
>  
> @@ -1203,6 +1226,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
>                       access_flags, level);
>  out:
> +    vtd_iommu_unlock(s);
>      entry->iova = addr & page_mask;
>      entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) & page_mask;
>      entry->addr_mask = ~page_mask;
> @@ -1210,6 +1234,7 @@ out:
>      return true;
>  
>  error:
> +    vtd_iommu_unlock(s);
>      entry->iova = 0;
>      entry->translated_addr = 0;
>      entry->addr_mask = 0;
> @@ -1258,10 +1283,13 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
>  static void vtd_context_global_invalidate(IntelIOMMUState *s)
>  {
>      trace_vtd_inv_desc_cc_global();
> +    /* Protects context cache */
> +    vtd_iommu_lock(s);
>      s->context_cache_gen++;
>      if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
>          vtd_reset_context_cache(s);
>      }
> +    vtd_iommu_unlock(s);
>      vtd_switch_address_space_all(s);
>      /*
>       * From VT-d spec 6.5.2.1, a global context entry invalidation
> @@ -1377,8 +1405,10 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>  
>      trace_vtd_inv_desc_iotlb_domain(domain_id);
>  
> +    vtd_iommu_lock(s);
>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
>                                  &domain_id);
> +    vtd_iommu_unlock(s);
>  
>      QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
>          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> @@ -1426,7 +1456,9 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>      info.domain_id = domain_id;
>      info.addr = addr;
>      info.mask = ~((1 << am) - 1);
> +    vtd_iommu_lock(s);
>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> +    vtd_iommu_unlock(s);
>      vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
>  }
>  
> @@ -3072,6 +3104,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>      }
>  
>      QLIST_INIT(&s->notifiers_list);
> +    qemu_mutex_init(&s->iommu_lock);
>      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>                            "intel_iommu", DMAR_REG_SIZE);
> 
Don't you need to take the lock in vtd_context_device_invalidate() which
manipulates the vtd_as->context_cache_entry.context_cache_gen?

Thanks

Eric

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/10] intel-iommu: pass in address space when page walk
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 06/10] intel-iommu: pass in address space when page walk Peter Xu
@ 2018-05-17 14:32   ` Auger Eric
  2018-05-18  6:02     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17 14:32 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> We pass in the VTDAddressSpace to replace the aw bits when doing page
> walk.  The VTDAddressSpace contains the aw bits information, meanwhile
> we'll need to do something more in the follow up patches regarding to
> the address spaces.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
>  hw/i386/intel_iommu.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index b2b2a0a441..83769f2b8c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -753,13 +753,13 @@ typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>   * @hook_fn: hook func to be called when detected page
>   * @private: private data to be passed into hook func
>   * @notify_unmap: whether we should notify invalid entries
> - * @aw: maximum address width
> + * @as: VT-d address space of the device
>   */
>  typedef struct {
> +    VTDAddressSpace *as;
>      vtd_page_walk_hook hook_fn;
>      void *private;
>      bool notify_unmap;
> -    uint8_t aw;
>  } vtd_page_walk_info;
>  
>  static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> @@ -796,6 +796,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>      uint64_t iova = start;
>      uint64_t iova_next;
>      int ret = 0;
> +    uint8_t aw = info->as->iommu_state->aw_bits;
>  
>      trace_vtd_page_walk_level(addr, level, start, end);
>  
> @@ -836,7 +837,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>  
>          if (vtd_is_last_slpte(slpte, level)) {
>              /* NOTE: this is only meaningful if entry_valid == true */
> -            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
> +            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
>              if (!entry_valid && !info->notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
> @@ -862,7 +863,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>                  }
>                  goto next;
>              }
> -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw),
>                                        iova, MIN(iova_next, end), level - 1,
>                                        read_cur, write_cur, info);
>              if (ret < 0) {
> @@ -885,19 +886,20 @@ next:
>   * @end: IOVA range end address (start <= addr < end)
>   * @hook_fn: the hook that to be called for each detected area
>   * @private: private data for the hook function
> - * @aw: maximum address width
> + * @as: the VT-d address space of the device
>   */
>  static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>                           vtd_page_walk_hook hook_fn, void *private,
> -                         bool notify_unmap, uint8_t aw)
> +                         bool notify_unmap, VTDAddressSpace *as)
>  {
>      dma_addr_t addr = vtd_ce_get_slpt_base(ce);
>      uint32_t level = vtd_ce_get_level(ce);
> +    uint8_t aw = as->iommu_state->aw_bits;
>      vtd_page_walk_info info = {
>          .hook_fn = hook_fn,
>          .private = private,
>          .notify_unmap = notify_unmap,
> -        .aw = aw,
> +        .as = as,
>      };
>  
>      if (!vtd_iova_range_check(start, ce, aw)) {
> @@ -1470,7 +1472,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>                   */
>                  vtd_page_walk(&ce, addr, addr + size,
>                                vtd_page_invalidate_notify_hook,
> -                              (void *)&vtd_as->iommu, true, s->aw_bits);
> +                              (void *)&vtd_as->iommu, true, vtd_as);
>              } else {
>                  /*
>                   * For UNMAP-only notifiers, we don't need to walk the
> @@ -2941,7 +2943,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>          if (vtd_as_notify_mappings(vtd_as)) {
>              /* This is required only for MAP typed notifiers */
>              vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> -                          s->aw_bits);
> +                          vtd_as);
>          }
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
@ 2018-05-17 14:32   ` Auger Eric
  2018-05-18  5:59     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17 14:32 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> During the recursive page walking of IOVA page tables, some stack
> variables are constant variables and never changed during the whole page
> walking procedure.  Isolate them into a struct so that we don't need to
> pass those contants down the stack every time and multiple times.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
>  hw/i386/intel_iommu.c | 56 ++++++++++++++++++++++++++++---------------
>  1 file changed, 37 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 9a418abfb6..b2b2a0a441 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -747,9 +747,27 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>  
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>  
> +/**
> + * Constant information used during page walking
> + *
> + * @hook_fn: hook func to be called when detected page
> + * @private: private data to be passed into hook func
> + * @notify_unmap: whether we should notify invalid entries
> + * @aw: maximum address width
> + */
> +typedef struct {
> +    vtd_page_walk_hook hook_fn;
> +    void *private;
> +    bool notify_unmap;
> +    uint8_t aw;
> +} vtd_page_walk_info;
> +
>  static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> -                             vtd_page_walk_hook hook_fn, void *private)
> +                             vtd_page_walk_info *info)
>  {
> +    vtd_page_walk_hook hook_fn = info->hook_fn;
> +    void *private = info->private;
> +
>      assert(hook_fn);
>      trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
>                              entry->addr_mask, entry->perm);
> @@ -762,17 +780,13 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
>   * @addr: base GPA addr to start the walk
>   * @start: IOVA range start address
>   * @end: IOVA range end address (start <= addr < end)
> - * @hook_fn: hook func to be called when detected page
> - * @private: private data to be passed into hook func
>   * @read: whether parent level has read permission
>   * @write: whether parent level has write permission
> - * @notify_unmap: whether we should notify invalid entries
> - * @aw: maximum address width
> + * @info: constant information for the page walk
>   */
>  static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> -                               uint64_t end, vtd_page_walk_hook hook_fn,
> -                               void *private, uint32_t level, bool read,
> -                               bool write, bool notify_unmap, uint8_t aw)
> +                               uint64_t end, uint32_t level, bool read,
> +                               bool write, vtd_page_walk_info *info)
>  {
>      bool read_cur, write_cur, entry_valid;
>      uint32_t offset;
> @@ -822,24 +836,24 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>  
>          if (vtd_is_last_slpte(slpte, level)) {
>              /* NOTE: this is only meaningful if entry_valid == true */
> -            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> -            if (!entry_valid && !notify_unmap) {
> +            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
> +            if (!entry_valid && !info->notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
>              }
> -            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +            ret = vtd_page_walk_one(&entry, level, info);
>              if (ret < 0) {
>                  return ret;
>              }
>          } else {
>              if (!entry_valid) {
> -                if (notify_unmap) {
> +                if (info->notify_unmap) {
>                      /*
>                       * The whole entry is invalid; unmap it all.
>                       * Translated address is meaningless, zero it.
>                       */
>                      entry.translated_addr = 0x0;
> -                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +                    ret = vtd_page_walk_one(&entry, level, info);
>                      if (ret < 0) {
>                          return ret;
>                      }
> @@ -848,10 +862,9 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>                  }
>                  goto next;
>              }
> -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
> -                                      MIN(iova_next, end), hook_fn, private,
> -                                      level - 1, read_cur, write_cur,
> -                                      notify_unmap, aw);
> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> +                                      iova, MIN(iova_next, end), level - 1,
> +                                      read_cur, write_cur, info);
>              if (ret < 0) {
>                  return ret;
>              }
> @@ -880,6 +893,12 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>  {
>      dma_addr_t addr = vtd_ce_get_slpt_base(ce);
>      uint32_t level = vtd_ce_get_level(ce);
> +    vtd_page_walk_info info = {
> +        .hook_fn = hook_fn,
> +        .private = private,
> +        .notify_unmap = notify_unmap,
> +        .aw = aw,
> +    };
>  
>      if (!vtd_iova_range_check(start, ce, aw)) {
>          return -VTD_FR_ADDR_BEYOND_MGAW;
> @@ -890,8 +909,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>          end = vtd_iova_limit(ce, aw);
>      }
>  
> -    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> -                               level, true, true, notify_unmap, aw);
> +    return vtd_page_walk_level(addr, start, end, level, true, true, &info);
>  }
>  
>  /* Map a device to its corresponding domain (context-entry) */
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
@ 2018-05-17 14:42   ` Auger Eric
  2018-05-18  3:41     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17 14:42 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> During IOVA page table walking, there is a special case when the PSI
> covers one whole PDE (Page Directory Entry, which contains 512 Page
> Table Entries) or more.  In the past, we skip that entry and we don't
> notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
> notification to registered UNMAP notifiers in this case.
> 
> For UNMAP only notifiers, this might cause IOTLBs cached in the devices
> even if they were already invalid.  For MAP/UNMAP notifiers like
> vfio-pci, this will cause stale page mappings.
> 
> This special case doesn't trigger often, but it is very easy to be
> triggered by nested device assignments, since in that case we'll
> possibly map the whole L2 guest RAM region into the device's IOVA
> address space (several GBs at least), which is far bigger than normal
> kernel driver usages of the device (tens of MBs normally).
> 
> Without this patch applied to L1 QEMU, nested device assignment to L2
> guests will dump some errors like:
> 
> qemu-system-x86_64: VFIO_MAP_DMA: -17
> qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
>                     0x7f89a920d000) = -17 (File exists)
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> [peterx: rewrite the commit message]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
>  1 file changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fb31de9416..b359efd6f9 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>  
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>  
> +static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> +                             vtd_page_walk_hook hook_fn, void *private)
I find the function  name a bit weird as it does not does a ptw but
rather call a callback on an entry. vtd_callback_wrapper?
> +{
> +    assert(hook_fn);
> +    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
> +                            entry->addr_mask, entry->perm);
> +    return hook_fn(entry, private);
> +}
> +
>  /**
>   * vtd_page_walk_level - walk over specific level for IOVA range
>   *
> @@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           */
>          entry_valid = read_cur | write_cur;
>  
> +        entry.target_as = &address_space_memory;
> +        entry.iova = iova & subpage_mask;
> +        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +        entry.addr_mask = ~subpage_mask;
> +
>          if (vtd_is_last_slpte(slpte, level)) {
> -            entry.target_as = &address_space_memory;
> -            entry.iova = iova & subpage_mask;
>              /* NOTE: this is only meaningful if entry_valid == true */
>              entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> -            entry.addr_mask = ~subpage_mask;
> -            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>              if (!entry_valid && !notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
>              }
> -            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> -                                    entry.addr_mask, entry.perm);
> -            if (hook_fn) {
> -                ret = hook_fn(&entry, private);
> -                if (ret < 0) {
> -                    return ret;
> -                }
> +            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +            if (ret < 0) {
> +                return ret;
>              }
>          } else {
>              if (!entry_valid) {
> -                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                if (notify_unmap) {
> +                    /*
> +                     * The whole entry is invalid; unmap it all.
> +                     * Translated address is meaningless, zero it.
> +                     */
> +                    entry.translated_addr = 0x0;
do you really need to zero the translated_addr and the related comment.
As soon as perm is NONE this should not be used?
> +                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +                    if (ret < 0) {
> +                        return ret;
> +                    }
> +                } else {
> +                    trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                }
>                  goto next;
>              }
>              ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
> 

Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table
  2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
@ 2018-05-17 17:23   ` Auger Eric
  2018-05-18  6:06     ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-17 17:23 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> IOMMU replay was carried out before in many use cases, e.g., context
> cache invalidations, domain flushes.  We used this mechanism to sync the
> shadow page table by firstly (1) unmap the whole address space, then
> (2) walk the page table to remap what's in the table.
> 
> This is very dangerous.
> 
> The problem is that we'll have a very small window (in my measurement,
> it can be about 3ms) during above step (1) and (2) that the device will
> see no (or incomplete) device page table.  Howerver the device never
> knows that.  This can cause DMA error of devices, who assumes the page
> table is always there.
But now you have the QemuMutex can we have a translate and an
invalidation that occur concurrently? Don't the iotlb flush and replay
happen while the lock is held?

Thanks

Eric
> 
> So the point is that, for MAP typed notifiers (vfio-pci, for example)
> they'll need the mapped page entries always be there.  We can never
> unmap any existing page entries like what we did in (1) above.
> 
> The only solution is to remove step (1).  We can't do that before since
> we didn't know what device page was mapped and what was not, so we unmap
> them all.  Now with the new IOVA tree QEMU knows what has mapped and
> what has not.  We don't need this step (1) any more.  Remove it.
> 
> Note that after removing that global unmap flushing, we'll need to
> notify unmap now during page walkings.
> 
> This should fix the DMA error problem that Jintack Lim reported with
> nested device assignment.  This problem won't not happen always, e.g., I
> cannot reproduce the error.  However after collecting logs it shows that
> this is the possible cause to Jintack's problem.
> 
> Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 16b3514afb..9439103cac 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3003,10 +3003,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>  
>      /*
>       * The replay can be triggered by either a invalidation or a newly
> -     * created entry. No matter what, we release existing mappings
> -     * (it means flushing caches for UNMAP-only registers).
> +     * created entry.
>       */
> -    vtd_address_space_unmap(vtd_as, n);
>  
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>          trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> @@ -3015,8 +3013,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                                    ce.hi, ce.lo);
>          if (vtd_as_notify_mappings(vtd_as)) {
>              /* This is required only for MAP typed notifiers */
> -            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> +            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
>                            vtd_as);
> +        } else {
> +            vtd_address_space_unmap(vtd_as, n);
>          }
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs
  2018-05-17 14:42   ` Auger Eric
@ 2018-05-18  3:41     ` Peter Xu
  2018-05-18  7:39       ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-18  3:41 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 04:42:54PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > During IOVA page table walking, there is a special case when the PSI
> > covers one whole PDE (Page Directory Entry, which contains 512 Page
> > Table Entries) or more.  In the past, we skip that entry and we don't
> > notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
> > notification to registered UNMAP notifiers in this case.
> > 
> > For UNMAP only notifiers, this might cause IOTLBs cached in the devices
> > even if they were already invalid.  For MAP/UNMAP notifiers like
> > vfio-pci, this will cause stale page mappings.
> > 
> > This special case doesn't trigger often, but it is very easy to be
> > triggered by nested device assignments, since in that case we'll
> > possibly map the whole L2 guest RAM region into the device's IOVA
> > address space (several GBs at least), which is far bigger than normal
> > kernel driver usages of the device (tens of MBs normally).
> > 
> > Without this patch applied to L1 QEMU, nested device assignment to L2
> > guests will dump some errors like:
> > 
> > qemu-system-x86_64: VFIO_MAP_DMA: -17
> > qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
> >                     0x7f89a920d000) = -17 (File exists)
> > 
> > Acked-by: Jason Wang <jasowang@redhat.com>
> > [peterx: rewrite the commit message]
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
> >  1 file changed, 30 insertions(+), 12 deletions(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index fb31de9416..b359efd6f9 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
> >  
> >  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
> >  
> > +static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > +                             vtd_page_walk_hook hook_fn, void *private)
> I find the function  name a bit weird as it does not does a ptw but
> rather call a callback on an entry. vtd_callback_wrapper?

It's a hook for the page walk process, and IMHO vtd_callback_wrapper
does not really provide any hint for the page walking.  So even if you
prefer the "callback_wrapper" naming I would still more prefer:

  vtd_page_walk_callback[_wrapper]

though if so I'd say I don't see much benefit comparing to use the old
vtd_page_walk_hook, which seems fine to me too...

> > +{
> > +    assert(hook_fn);
> > +    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
> > +                            entry->addr_mask, entry->perm);
> > +    return hook_fn(entry, private);
> > +}
> > +
> >  /**
> >   * vtd_page_walk_level - walk over specific level for IOVA range
> >   *
> > @@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> >           */
> >          entry_valid = read_cur | write_cur;
> >  
> > +        entry.target_as = &address_space_memory;
> > +        entry.iova = iova & subpage_mask;
> > +        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> > +        entry.addr_mask = ~subpage_mask;
> > +
> >          if (vtd_is_last_slpte(slpte, level)) {
> > -            entry.target_as = &address_space_memory;
> > -            entry.iova = iova & subpage_mask;
> >              /* NOTE: this is only meaningful if entry_valid == true */
> >              entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> > -            entry.addr_mask = ~subpage_mask;
> > -            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> >              if (!entry_valid && !notify_unmap) {
> >                  trace_vtd_page_walk_skip_perm(iova, iova_next);
> >                  goto next;
> >              }
> > -            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> > -                                    entry.addr_mask, entry.perm);
> > -            if (hook_fn) {
> > -                ret = hook_fn(&entry, private);
> > -                if (ret < 0) {
> > -                    return ret;
> > -                }
> > +            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> > +            if (ret < 0) {
> > +                return ret;
> >              }
> >          } else {
> >              if (!entry_valid) {
> > -                trace_vtd_page_walk_skip_perm(iova, iova_next);
> > +                if (notify_unmap) {
> > +                    /*
> > +                     * The whole entry is invalid; unmap it all.
> > +                     * Translated address is meaningless, zero it.
> > +                     */
> > +                    entry.translated_addr = 0x0;
> do you really need to zero the translated_addr and the related comment.
> As soon as perm is NONE this should not be used?

Yes here we can avoid setting it.  However that'll make sure we don't
leak strange numbers to the below notifiers, so I would still slightly
prefer to zero it here.

> > +                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> > +                    if (ret < 0) {
> > +                        return ret;
> > +                    }
> > +                } else {
> > +                    trace_vtd_page_walk_skip_perm(iova, iova_next);
> > +                }
> >                  goto next;
> >              }
> >              ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
> > 
> 
> Besides
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks for reviewing the series.

Note that this v2 is obsolete, please feel free to read version 3 of
this series, which contains quite a lot of functional changes.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/10] intel-iommu: add iommu lock
  2018-05-17 14:32   ` Auger Eric
@ 2018-05-18  5:32     ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-18  5:32 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 04:32:52PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > Add a per-iommu big lock to protect IOMMU status.  Currently the only
> > thing to be protected is the IOTLB/context cache, since that can be
> > accessed even without BQL, e.g., in IO dataplane.
> 
> As discussed together, Peter challenged per device mutex in
> "Re: [PATCH v11 15/17] hw/arm/smmuv3: Cache/invalidate config data"
> https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg02403.html
> but at that time I fail to justify the use :-(
> 
> > 
> > Note that we don't need to protect device page tables since that's fully
> > controlled by the guest kernel.  However there is still possibility that
> > malicious drivers will program the device to not obey the rule.  In that
> > case QEMU can't really do anything useful, instead the guest itself will
> > be responsible for all uncertainties.
> > 
> > Reported-by: Fam Zheng <famz@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/hw/i386/intel_iommu.h |  6 +++++
> >  hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
> >  2 files changed, 44 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 220697253f..ee517704e7 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -300,6 +300,12 @@ struct IntelIOMMUState {
> >      OnOffAuto intr_eim;             /* Toggle for EIM cabability */
> >      bool buggy_eim;                 /* Force buggy EIM unless eim=off */
> >      uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> > +
> > +    /*
> > +     * Protects IOMMU states in general.  Currently it protects the
> > +     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> > +     */
> > +    QemuMutex iommu_lock;
> >  };
> >  
> >  /* Find the VTD Address space associated with the given bus pointer,
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 5987b48d43..112971638d 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -128,6 +128,16 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
> >      return new_val;
> >  }
> >  
> > +static inline void vtd_iommu_lock(IntelIOMMUState *s)
> > +{
> > +    qemu_mutex_lock(&s->iommu_lock);
> > +}
> > +
> > +static inline void vtd_iommu_unlock(IntelIOMMUState *s)
> > +{
> > +    qemu_mutex_unlock(&s->iommu_lock);
> > +}
> > +
> >  /* GHashTable functions */
> >  static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
> >  {
> > @@ -172,7 +182,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
> >  }
> >  
> >  /* Reset all the gen of VTDAddressSpace to zero and set the gen of
> > - * IntelIOMMUState to 1.
> > + * IntelIOMMUState to 1.  Must be with IOMMU lock held.
> s/Must be/ Must be called ?

Ok.

> not done in vtd_init()

IMHO we can omit that since it's only used during either realization
or system reset.  But sure I can add it too.

> >   */
> >  static void vtd_reset_context_cache(IntelIOMMUState *s)
> >  {
> > @@ -197,12 +207,19 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
> >      s->context_cache_gen = 1;
> >  }
> >  
> > -static void vtd_reset_iotlb(IntelIOMMUState *s)
> > +static void vtd_reset_iotlb_locked(IntelIOMMUState *s)
> add the above comment and keep the original name?

I can add the comment; the original name is defined as another
function below.

> >  {
> >      assert(s->iotlb);
> >      g_hash_table_remove_all(s->iotlb);
> >  }
> >  
> > +static void vtd_reset_iotlb(IntelIOMMUState *s)

[1]

> > +{
> > +    vtd_iommu_lock(s);
> > +    vtd_reset_iotlb_locked(s);
> > +    vtd_iommu_unlock(s);
> > +}
> > +
> >  static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
> >                                    uint32_t level)
> >  {
> > @@ -215,6 +232,7 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
> >      return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
> >  }
> >  
> > +/* Must be with IOMMU lock held */
> >  static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
> >                                         hwaddr addr)
> >  {
> > @@ -235,6 +253,7 @@ out:
> >      return entry;
> >  }
> >  
> > +/* Must be with IOMMU lock held */
> >  static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
> >                               uint16_t domain_id, hwaddr addr, uint64_t slpte,
> >                               uint8_t access_flags, uint32_t level)
> > @@ -246,7 +265,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
> >      trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
> >      if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
> >          trace_vtd_iotlb_reset("iotlb exceeds size limit");
> > -        vtd_reset_iotlb(s);
> > +        vtd_reset_iotlb_locked(s);
> >      }
> >  
> >      entry->gfn = gfn;
> > @@ -1106,7 +1125,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >      IntelIOMMUState *s = vtd_as->iommu_state;
> >      VTDContextEntry ce;
> >      uint8_t bus_num = pci_bus_num(bus);
> > -    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> > +    VTDContextCacheEntry *cc_entry;
> >      uint64_t slpte, page_mask;
> >      uint32_t level;
> >      uint16_t source_id = vtd_make_source_id(bus_num, devfn);
> > @@ -1123,6 +1142,10 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >       */
> >      assert(!vtd_is_interrupt_addr(addr));
> >  
> > +    vtd_iommu_lock(s);
> > +
> > +    cc_entry = &vtd_as->context_cache_entry;
> > +
> >      /* Try to fetch slpte form IOTLB */
> >      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
> >      if (iotlb_entry) {
> > @@ -1182,7 +1205,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >           * IOMMU region can be swapped back.
> >           */
> >          vtd_pt_enable_fast_path(s, source_id);
> > -
> > +        vtd_iommu_unlock(s);
> >          return true;
> >      }
> >  
> > @@ -1203,6 +1226,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >      vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
> >                       access_flags, level);
> >  out:
> > +    vtd_iommu_unlock(s);
> >      entry->iova = addr & page_mask;
> >      entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) & page_mask;
> >      entry->addr_mask = ~page_mask;
> > @@ -1210,6 +1234,7 @@ out:
> >      return true;
> >  
> >  error:
> > +    vtd_iommu_unlock(s);
> >      entry->iova = 0;
> >      entry->translated_addr = 0;
> >      entry->addr_mask = 0;
> > @@ -1258,10 +1283,13 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
> >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> >  {
> >      trace_vtd_inv_desc_cc_global();
> > +    /* Protects context cache */
> > +    vtd_iommu_lock(s);
> >      s->context_cache_gen++;
> >      if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
> >          vtd_reset_context_cache(s);
> >      }
> > +    vtd_iommu_unlock(s);
> >      vtd_switch_address_space_all(s);
> >      /*
> >       * From VT-d spec 6.5.2.1, a global context entry invalidation
> > @@ -1377,8 +1405,10 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
> >  
> >      trace_vtd_inv_desc_iotlb_domain(domain_id);
> >  
> > +    vtd_iommu_lock(s);
> >      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
> >                                  &domain_id);
> > +    vtd_iommu_unlock(s);
> >  
> >      QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
> >          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> > @@ -1426,7 +1456,9 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >      info.domain_id = domain_id;
> >      info.addr = addr;
> >      info.mask = ~((1 << am) - 1);
> > +    vtd_iommu_lock(s);
> >      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> > +    vtd_iommu_unlock(s);
> >      vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> >  }
> >  
> > @@ -3072,6 +3104,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >      }
> >  
> >      QLIST_INIT(&s->notifiers_list);
> > +    qemu_mutex_init(&s->iommu_lock);
> >      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
> >      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> >                            "intel_iommu", DMAR_REG_SIZE);
> > 
> Don't you need to take the lock in vtd_context_device_invalidate() which
> manipulates the vtd_as->context_cache_entry.context_cache_gen?

Yes I think we need that.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers
  2018-05-17 13:39   ` Auger Eric
@ 2018-05-18  5:53     ` Peter Xu
  2018-05-18  7:38       ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-18  5:53 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 03:39:50PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > For UNMAP-only IOMMU notifiers, we don't really need to walk the page
> s/really// ;-)

Ok.

> > tables.  Fasten that procedure by skipping the page table walk.  That
> > should boost performance for UNMAP-only notifiers like vhost.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/hw/i386/intel_iommu.h |  2 ++
> >  hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
> >  2 files changed, 40 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index ee517704e7..9e0a6c1c6a 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -93,6 +93,8 @@ struct VTDAddressSpace {
> >      IntelIOMMUState *iommu_state;
> >      VTDContextCacheEntry context_cache_entry;
> >      QLIST_ENTRY(VTDAddressSpace) next;
> > +    /* Superset of notifier flags that this address space has */
> > +    IOMMUNotifierFlag notifier_flags;
> >  };
> >  
> >  struct VTDBus {
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 112971638d..9a418abfb6 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
> >      qemu_mutex_unlock(&s->iommu_lock);
> >  }
> >  
> > +/* Whether the address space needs to notify new mappings */
> > +static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
> would suggest vtd_as_has_map_notifier()? But tastes & colours ;-)

Yeah it is.  But okay, I can switch to that especially it's only used
in this patch and it's new.

> > +{
> > +    return as->notifier_flags & IOMMU_NOTIFIER_MAP;
> > +}
> > +
> >  /* GHashTable functions */
> >  static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
> >  {
> > @@ -1433,14 +1439,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >      VTDAddressSpace *vtd_as;
> >      VTDContextEntry ce;
> >      int ret;
> > +    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
> >  
> >      QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
> >          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >                                         vtd_as->devfn, &ce);
> >          if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > -            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> > -                          vtd_page_invalidate_notify_hook,
> > -                          (void *)&vtd_as->iommu, true, s->aw_bits);
> > +            if (vtd_as_notify_mappings(vtd_as)) {
> > +                /*
> > +                 * For MAP-inclusive notifiers, we need to walk the
> > +                 * page table to sync the shadow page table.
> > +                 */
> Potentially we may have several notifiers attached to the IOMMU MR ~
> vtd_as, each of them having different flags. Those flags are OR'ed in
> memory_region_update_iommu_notify_flags and this is the one you now
> store in the vtd_as. So maybe your comment may rather state:
> as soon as we have at least one MAP notifier, we need to do the PTW?

Actually this is not 100% clear too, since all the "MAP notifiers" are
actually both MAP+UNMAP notifiers...  Maybe:

  As long as we have MAP notifications registered in any of our IOMMU
  notifiers, we need to sync the shadow page table.

> 
> nit: not related to this patch: vtd_page_walk kerneldoc comments misses
> @notify_unmap param comment
> side note: the name of the hook is a bit misleading as it suggests we
> invalidate the entry, whereas we update any valid entry and invalidate
> stale ones (if notify_unmap=true)?
> > +                vtd_page_walk(&ce, addr, addr + size,
> > +                              vtd_page_invalidate_notify_hook,
> > +                              (void *)&vtd_as->iommu, true, s->aw_bits);
> > +            } else {
> > +                /*
> > +                 * For UNMAP-only notifiers, we don't need to walk the
> > +                 * page tables.  We just deliver the PSI down to
> > +                 * invalidate caches.
> 
> We just unmap the range?

Isn't it the same thing? :)

If to be explicit, here we know we only registered UNMAP
notifications, it's not really "unmap", it's really cache
invalidations only.

> > +                 */
> > +                IOMMUTLBEntry entry = {
> > +                    .target_as = &address_space_memory,
> > +                    .iova = addr,
> > +                    .translated_addr = 0,
> > +                    .addr_mask = size - 1,
> > +                    .perm = IOMMU_NONE,
> > +                };
> > +                memory_region_notify_iommu(&vtd_as->iommu, entry);
> > +            }
> >          }
> >      }
> >  }
> > @@ -2380,6 +2407,9 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
> >          exit(1);
> >      }
> >  
> > +    /* Update per-address-space notifier flags */
> > +    vtd_as->notifier_flags = new;
> > +
> >      if (old == IOMMU_NOTIFIER_NONE) {
> >          /* Insert new ones */
> >          QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
> > @@ -2890,8 +2920,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >                                    PCI_FUNC(vtd_as->devfn),
> >                                    VTD_CONTEXT_ENTRY_DID(ce.hi),
> >                                    ce.hi, ce.lo);
> > -        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> > -                      s->aw_bits);
> > +        if (vtd_as_notify_mappings(vtd_as)) {
> > +            /* This is required only for MAP typed notifiers */
> > +            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> > +                          s->aw_bits);
> > +        }
> >      } else {
> >          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> >                                      PCI_FUNC(vtd_as->devfn));
> > 
> A worthwhile improvement indeed!

I hope so. :) Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info
  2018-05-17 14:32   ` Auger Eric
@ 2018-05-18  5:59     ` Peter Xu
  2018-05-18  7:24       ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-18  5:59 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 04:32:58PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > During the recursive page walking of IOVA page tables, some stack
> > variables are constant variables and never changed during the whole page
> > walking procedure.  Isolate them into a struct so that we don't need to
> > pass those contants down the stack every time and multiple times.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks for the r-b, but this patch is changed in version 3.  Please
feel free to have a look at that version.

(For all the review comments previously to version 2, I adopted them
 when the contents are still there in version 3)

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/10] intel-iommu: pass in address space when page walk
  2018-05-17 14:32   ` Auger Eric
@ 2018-05-18  6:02     ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-18  6:02 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 04:32:55PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > We pass in the VTDAddressSpace to replace the aw bits when doing page
> > walk.  The VTDAddressSpace contains the aw bits information, meanwhile
> > we'll need to do something more in the follow up patches regarding to
> > the address spaces.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

This one is also changed in version 3.  Please have a look there too.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table
  2018-05-17 17:23   ` Auger Eric
@ 2018-05-18  6:06     ` Peter Xu
  2018-05-18  7:31       ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Xu @ 2018-05-18  6:06 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Thu, May 17, 2018 at 07:23:33PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > IOMMU replay was carried out before in many use cases, e.g., context
> > cache invalidations, domain flushes.  We used this mechanism to sync the
> > shadow page table by firstly (1) unmap the whole address space, then
> > (2) walk the page table to remap what's in the table.
> > 
> > This is very dangerous.
> > 
> > The problem is that we'll have a very small window (in my measurement,
> > it can be about 3ms) during above step (1) and (2) that the device will
> > see no (or incomplete) device page table.  Howerver the device never
> > knows that.  This can cause DMA error of devices, who assumes the page
> > table is always there.
> But now you have the QemuMutex can we have a translate and an
> invalidation that occur concurrently? Don't the iotlb flush and replay
> happen while the lock is held?

Note that when we are using vfio-pci devices we can't really know when
the device started a DMA since that's totally happening only between
the host IOMMU and the hardware.  Say, vfio-pci device page
translation is happening in the shadow page table, not really in QEMU.
So IMO we aren't protected by anything.

Also, this patch is dropped in version 3, and I did something else to
achieve similar goal (by introducing the shadow page sync helper, and
then for DSIs we'll use that instead of calling IOMMU replay here).
Please have a look.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info
  2018-05-18  5:59     ` Peter Xu
@ 2018-05-18  7:24       ` Auger Eric
  0 siblings, 0 replies; 40+ messages in thread
From: Auger Eric @ 2018-05-18  7:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

Hi Peter,
On 05/18/2018 07:59 AM, Peter Xu wrote:
> On Thu, May 17, 2018 at 04:32:58PM +0200, Auger Eric wrote:
>> Hi Peter,
>>
>> On 05/04/2018 05:08 AM, Peter Xu wrote:
>>> During the recursive page walking of IOVA page tables, some stack
>>> variables are constant variables and never changed during the whole page
>>> walking procedure.  Isolate them into a struct so that we don't need to
>>> pass those contants down the stack every time and multiple times.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Thanks for the r-b, but this patch is changed in version 3.  Please
> feel free to have a look at that version.

Sure. Please apologize, I missed your v3. Looking at it now.

Thanks

Eric
> 
> (For all the review comments previously to version 2, I adopted them
>  when the contents are still there in version 3)
> 
> Regards,
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table
  2018-05-18  6:06     ` Peter Xu
@ 2018-05-18  7:31       ` Auger Eric
  0 siblings, 0 replies; 40+ messages in thread
From: Auger Eric @ 2018-05-18  7:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

Hi Peter,

On 05/18/2018 08:06 AM, Peter Xu wrote:
> On Thu, May 17, 2018 at 07:23:33PM +0200, Auger Eric wrote:
>> Hi Peter,
>>
>> On 05/04/2018 05:08 AM, Peter Xu wrote:
>>> IOMMU replay was carried out before in many use cases, e.g., context
>>> cache invalidations, domain flushes.  We used this mechanism to sync the
>>> shadow page table by firstly (1) unmap the whole address space, then
>>> (2) walk the page table to remap what's in the table.
>>>
>>> This is very dangerous.
>>>
>>> The problem is that we'll have a very small window (in my measurement,
>>> it can be about 3ms) during above step (1) and (2) that the device will
>>> see no (or incomplete) device page table.  Howerver the device never
>>> knows that.  This can cause DMA error of devices, who assumes the page
>>> table is always there.
>> But now you have the QemuMutex can we have a translate and an
>> invalidation that occur concurrently? Don't the iotlb flush and replay
>> happen while the lock is held?
> 
> Note that when we are using vfio-pci devices we can't really know when
> the device started a DMA since that's totally happening only between
> the host IOMMU and the hardware.  

Oh yes that's fully relevant in vfio-pci use case. thank you for the
clarification.

Say, vfio-pci device page
> translation is happening in the shadow page table, not really in QEMU.
> So IMO we aren't protected by anything.
> 
> Also, this patch is dropped in version 3, and I did something else to
> achieve similar goal (by introducing the shadow page sync helper, and
> then for DSIs we'll use that instead of calling IOMMU replay here).
> Please have a look.  Thanks,

OK

Thanks

Eric
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers
  2018-05-18  5:53     ` Peter Xu
@ 2018-05-18  7:38       ` Auger Eric
  2018-05-18 10:02         ` Peter Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2018-05-18  7:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

Hi Peter,

On 05/18/2018 07:53 AM, Peter Xu wrote:
> On Thu, May 17, 2018 at 03:39:50PM +0200, Auger Eric wrote:
>> Hi Peter,
>>
>> On 05/04/2018 05:08 AM, Peter Xu wrote:
>>> For UNMAP-only IOMMU notifiers, we don't really need to walk the page
>> s/really// ;-)
> 
> Ok.
> 
>>> tables.  Fasten that procedure by skipping the page table walk.  That
>>> should boost performance for UNMAP-only notifiers like vhost.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>  include/hw/i386/intel_iommu.h |  2 ++
>>>  hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
>>>  2 files changed, 40 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>> index ee517704e7..9e0a6c1c6a 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -93,6 +93,8 @@ struct VTDAddressSpace {
>>>      IntelIOMMUState *iommu_state;
>>>      VTDContextCacheEntry context_cache_entry;
>>>      QLIST_ENTRY(VTDAddressSpace) next;
>>> +    /* Superset of notifier flags that this address space has */
>>> +    IOMMUNotifierFlag notifier_flags;
>>>  };
>>>  
>>>  struct VTDBus {
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 112971638d..9a418abfb6 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
>>>      qemu_mutex_unlock(&s->iommu_lock);
>>>  }
>>>  
>>> +/* Whether the address space needs to notify new mappings */
>>> +static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
>> would suggest vtd_as_has_map_notifier()? But tastes & colours ;-)
> 
> Yeah it is.  But okay, I can switch to that especially it's only used
> in this patch and it's new.
> 
>>> +{
>>> +    return as->notifier_flags & IOMMU_NOTIFIER_MAP;
>>> +}
>>> +
>>>  /* GHashTable functions */
>>>  static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
>>>  {
>>> @@ -1433,14 +1439,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>>      VTDAddressSpace *vtd_as;
>>>      VTDContextEntry ce;
>>>      int ret;
>>> +    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>>>  
>>>      QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
>>>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>                                         vtd_as->devfn, &ce);
>>>          if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>> -            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>> -                          vtd_page_invalidate_notify_hook,
>>> -                          (void *)&vtd_as->iommu, true, s->aw_bits);
>>> +            if (vtd_as_notify_mappings(vtd_as)) {
>>> +                /*
>>> +                 * For MAP-inclusive notifiers, we need to walk the
>>> +                 * page table to sync the shadow page table.
>>> +                 */
>> Potentially we may have several notifiers attached to the IOMMU MR ~
>> vtd_as, each of them having different flags. Those flags are OR'ed in
>> memory_region_update_iommu_notify_flags and this is the one you now
>> store in the vtd_as. So maybe your comment may rather state:
>> as soon as we have at least one MAP notifier, we need to do the PTW?
> 
> Actually this is not 100% clear too, since all the "MAP notifiers" are
> actually both MAP+UNMAP notifiers...  Maybe:

Can't IOMMU_NOTIFIER_MAP flag value be used without
IOMMU_NOTIFIER_UNMAP? I don't see such restriction in the
memory_region_register_iommu_notifier API.
> 
>   As long as we have MAP notifications registered in any of our IOMMU
>   notifiers, we need to sync the shadow page table.
> 
>>
>> nit: not related to this patch: vtd_page_walk kerneldoc comments misses
>> @notify_unmap param comment
>> side note: the name of the hook is a bit misleading as it suggests we
>> invalidate the entry, whereas we update any valid entry and invalidate
>> stale ones (if notify_unmap=true)?
>>> +                vtd_page_walk(&ce, addr, addr + size,
>>> +                              vtd_page_invalidate_notify_hook,
>>> +                              (void *)&vtd_as->iommu, true, s->aw_bits);
>>> +            } else {
>>> +                /*
>>> +                 * For UNMAP-only notifiers, we don't need to walk the
>>> +                 * page tables.  We just deliver the PSI down to
>>> +                 * invalidate caches.
>>
>> We just unmap the range?
> 
> Isn't it the same thing? :)
> 
> If to be explicit, here we know we only registered UNMAP
> notifications, it's not really "unmap", it's really cache
> invalidations only.
yes you're right I meant We just invalidate the range in cache. The
sentence "We just deliver the PSI down to invalidate caches." was not
crystal clear to me at first reading.

Thanks

Eric
> 
>>> +                 */
>>> +                IOMMUTLBEntry entry = {
>>> +                    .target_as = &address_space_memory,
>>> +                    .iova = addr,
>>> +                    .translated_addr = 0,
>>> +                    .addr_mask = size - 1,
>>> +                    .perm = IOMMU_NONE,
>>> +                };
>>> +                memory_region_notify_iommu(&vtd_as->iommu, entry);
>>> +            }
>>>          }
>>>      }
>>>  }
>>> @@ -2380,6 +2407,9 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
>>>          exit(1);
>>>      }
>>>  
>>> +    /* Update per-address-space notifier flags */
>>> +    vtd_as->notifier_flags = new;
>>> +
>>>      if (old == IOMMU_NOTIFIER_NONE) {
>>>          /* Insert new ones */
>>>          QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
>>> @@ -2890,8 +2920,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>>>                                    PCI_FUNC(vtd_as->devfn),
>>>                                    VTD_CONTEXT_ENTRY_DID(ce.hi),
>>>                                    ce.hi, ce.lo);
>>> -        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
>>> -                      s->aw_bits);
>>> +        if (vtd_as_notify_mappings(vtd_as)) {
>>> +            /* This is required only for MAP typed notifiers */
>>> +            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
>>> +                          s->aw_bits);
>>> +        }
>>>      } else {
>>>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
>>>                                      PCI_FUNC(vtd_as->devfn));
>>>
>> A worthwhile improvement indeed!
> 
> I hope so. :) Thanks,
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs
  2018-05-18  3:41     ` Peter Xu
@ 2018-05-18  7:39       ` Auger Eric
  0 siblings, 0 replies; 40+ messages in thread
From: Auger Eric @ 2018-05-18  7:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

Hi,

On 05/18/2018 05:41 AM, Peter Xu wrote:
> On Thu, May 17, 2018 at 04:42:54PM +0200, Auger Eric wrote:
>> Hi Peter,
>>
>> On 05/04/2018 05:08 AM, Peter Xu wrote:
>>> During IOVA page table walking, there is a special case when the PSI
>>> covers one whole PDE (Page Directory Entry, which contains 512 Page
>>> Table Entries) or more.  In the past, we skip that entry and we don't
>>> notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
>>> notification to registered UNMAP notifiers in this case.
>>>
>>> For UNMAP only notifiers, this might cause IOTLBs cached in the devices
>>> even if they were already invalid.  For MAP/UNMAP notifiers like
>>> vfio-pci, this will cause stale page mappings.
>>>
>>> This special case doesn't trigger often, but it is very easy to be
>>> triggered by nested device assignments, since in that case we'll
>>> possibly map the whole L2 guest RAM region into the device's IOVA
>>> address space (several GBs at least), which is far bigger than normal
>>> kernel driver usages of the device (tens of MBs normally).
>>>
>>> Without this patch applied to L1 QEMU, nested device assignment to L2
>>> guests will dump some errors like:
>>>
>>> qemu-system-x86_64: VFIO_MAP_DMA: -17
>>> qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
>>>                     0x7f89a920d000) = -17 (File exists)
>>>
>>> Acked-by: Jason Wang <jasowang@redhat.com>
>>> [peterx: rewrite the commit message]
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>  hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
>>>  1 file changed, 30 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index fb31de9416..b359efd6f9 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>>>  
>>>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>>>  
>>> +static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
>>> +                             vtd_page_walk_hook hook_fn, void *private)
>> I find the function  name a bit weird as it does not does a ptw but
>> rather call a callback on an entry. vtd_callback_wrapper?
> 
> It's a hook for the page walk process, and IMHO vtd_callback_wrapper
> does not really provide any hint for the page walking.  So even if you
> prefer the "callback_wrapper" naming I would still more prefer:
> 
>   vtd_page_walk_callback[_wrapper]
> 
> though if so I'd say I don't see much benefit comparing to use the old
> vtd_page_walk_hook, which seems fine to me too...
I preferred vtd_page_walk_hook too.

Thanks

Eric
> 
>>> +{
>>> +    assert(hook_fn);
>>> +    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
>>> +                            entry->addr_mask, entry->perm);
>>> +    return hook_fn(entry, private);
>>> +}
>>> +
>>>  /**
>>>   * vtd_page_walk_level - walk over specific level for IOVA range
>>>   *
>>> @@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>>>           */
>>>          entry_valid = read_cur | write_cur;
>>>  
>>> +        entry.target_as = &address_space_memory;
>>> +        entry.iova = iova & subpage_mask;
>>> +        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>>> +        entry.addr_mask = ~subpage_mask;
>>> +
>>>          if (vtd_is_last_slpte(slpte, level)) {
>>> -            entry.target_as = &address_space_memory;
>>> -            entry.iova = iova & subpage_mask;
>>>              /* NOTE: this is only meaningful if entry_valid == true */
>>>              entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
>>> -            entry.addr_mask = ~subpage_mask;
>>> -            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>>>              if (!entry_valid && !notify_unmap) {
>>>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>>>                  goto next;
>>>              }
>>> -            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
>>> -                                    entry.addr_mask, entry.perm);
>>> -            if (hook_fn) {
>>> -                ret = hook_fn(&entry, private);
>>> -                if (ret < 0) {
>>> -                    return ret;
>>> -                }
>>> +            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
>>> +            if (ret < 0) {
>>> +                return ret;
>>>              }
>>>          } else {
>>>              if (!entry_valid) {
>>> -                trace_vtd_page_walk_skip_perm(iova, iova_next);
>>> +                if (notify_unmap) {
>>> +                    /*
>>> +                     * The whole entry is invalid; unmap it all.
>>> +                     * Translated address is meaningless, zero it.
>>> +                     */
>>> +                    entry.translated_addr = 0x0;
>> do you really need to zero the translated_addr and the related comment.
>> As soon as perm is NONE this should not be used?
> 
> Yes here we can avoid setting it.  However that'll make sure we don't
> leak strange numbers to the below notifiers, so I would still slightly
> prefer to zero it here.
> 
>>> +                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
>>> +                    if (ret < 0) {
>>> +                        return ret;
>>> +                    }
>>> +                } else {
>>> +                    trace_vtd_page_walk_skip_perm(iova, iova_next);
>>> +                }
>>>                  goto next;
>>>              }
>>>              ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
>>>
>>
>> Besides
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Thanks for reviewing the series.
> 
> Note that this v2 is obsolete, please feel free to read version 3 of
> this series, which contains quite a lot of functional changes.
> 
> Regards,
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers
  2018-05-18  7:38       ` Auger Eric
@ 2018-05-18 10:02         ` Peter Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Xu @ 2018-05-18 10:02 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, Tian Kevin, Michael S . Tsirkin, Jason Wang,
	Alex Williamson, Jintack Lim

On Fri, May 18, 2018 at 09:38:07AM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/18/2018 07:53 AM, Peter Xu wrote:
> > On Thu, May 17, 2018 at 03:39:50PM +0200, Auger Eric wrote:
> >> Hi Peter,
> >>
> >> On 05/04/2018 05:08 AM, Peter Xu wrote:
> >>> For UNMAP-only IOMMU notifiers, we don't really need to walk the page
> >> s/really// ;-)
> > 
> > Ok.
> > 
> >>> tables.  Fasten that procedure by skipping the page table walk.  That
> >>> should boost performance for UNMAP-only notifiers like vhost.
> >>>
> >>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>> ---
> >>>  include/hw/i386/intel_iommu.h |  2 ++
> >>>  hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
> >>>  2 files changed, 40 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> >>> index ee517704e7..9e0a6c1c6a 100644
> >>> --- a/include/hw/i386/intel_iommu.h
> >>> +++ b/include/hw/i386/intel_iommu.h
> >>> @@ -93,6 +93,8 @@ struct VTDAddressSpace {
> >>>      IntelIOMMUState *iommu_state;
> >>>      VTDContextCacheEntry context_cache_entry;
> >>>      QLIST_ENTRY(VTDAddressSpace) next;
> >>> +    /* Superset of notifier flags that this address space has */
> >>> +    IOMMUNotifierFlag notifier_flags;
> >>>  };
> >>>  
> >>>  struct VTDBus {
> >>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >>> index 112971638d..9a418abfb6 100644
> >>> --- a/hw/i386/intel_iommu.c
> >>> +++ b/hw/i386/intel_iommu.c
> >>> @@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
> >>>      qemu_mutex_unlock(&s->iommu_lock);
> >>>  }
> >>>  
> >>> +/* Whether the address space needs to notify new mappings */
> >>> +static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
> >> would suggest vtd_as_has_map_notifier()? But tastes & colours ;-)
> > 
> > Yeah it is.  But okay, I can switch to that especially it's only used
> > in this patch and it's new.
> > 
> >>> +{
> >>> +    return as->notifier_flags & IOMMU_NOTIFIER_MAP;
> >>> +}
> >>> +
> >>>  /* GHashTable functions */
> >>>  static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
> >>>  {
> >>> @@ -1433,14 +1439,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>      VTDAddressSpace *vtd_as;
> >>>      VTDContextEntry ce;
> >>>      int ret;
> >>> +    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
> >>>  
> >>>      QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
> >>>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>                                         vtd_as->devfn, &ce);
> >>>          if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>> -            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>> -                          vtd_page_invalidate_notify_hook,
> >>> -                          (void *)&vtd_as->iommu, true, s->aw_bits);
> >>> +            if (vtd_as_notify_mappings(vtd_as)) {
> >>> +                /*
> >>> +                 * For MAP-inclusive notifiers, we need to walk the
> >>> +                 * page table to sync the shadow page table.
> >>> +                 */
> >> Potentially we may have several notifiers attached to the IOMMU MR ~
> >> vtd_as, each of them having different flags. Those flags are OR'ed in
> >> memory_region_update_iommu_notify_flags and this is the one you now
> >> store in the vtd_as. So maybe your comment may rather state:
> >> as soon as we have at least one MAP notifier, we need to do the PTW?
> > 
> > Actually this is not 100% clear too, since all the "MAP notifiers" are
> > actually both MAP+UNMAP notifiers...  Maybe:
> 
> Can't IOMMU_NOTIFIER_MAP flag value be used without
> IOMMU_NOTIFIER_UNMAP? I don't see such restriction in the
> memory_region_register_iommu_notifier API.

Yes from the API it can, but I can hardly think of a use case of
that.

> > 
> >   As long as we have MAP notifications registered in any of our IOMMU
> >   notifiers, we need to sync the shadow page table.
> > 
> >>
> >> nit: not related to this patch: vtd_page_walk kerneldoc comments misses
> >> @notify_unmap param comment
> >> side note: the name of the hook is a bit misleading as it suggests we
> >> invalidate the entry, whereas we update any valid entry and invalidate
> >> stale ones (if notify_unmap=true)?
> >>> +                vtd_page_walk(&ce, addr, addr + size,
> >>> +                              vtd_page_invalidate_notify_hook,
> >>> +                              (void *)&vtd_as->iommu, true, s->aw_bits);
> >>> +            } else {
> >>> +                /*
> >>> +                 * For UNMAP-only notifiers, we don't need to walk the
> >>> +                 * page tables.  We just deliver the PSI down to
> >>> +                 * invalidate caches.
> >>
> >> We just unmap the range?
> > 
> > Isn't it the same thing? :)
> > 
> > If to be explicit, here we know we only registered UNMAP
> > notifications, it's not really "unmap", it's really cache
> > invalidations only.
> yes you're right I meant We just invalidate the range in cache. The
> sentence "We just deliver the PSI down to invalidate caches." was not
> crystal clear to me at first reading.

Okay.  I just posted a new version, please feel free to comment again
if you have better suggestions.  Otherwise I'll just keep the comment
for now.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-05-18 10:03 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-04  3:08 [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
2018-05-17 14:42   ` Auger Eric
2018-05-18  3:41     ` Peter Xu
2018-05-18  7:39       ` Auger Eric
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
2018-05-17  9:46   ` Auger Eric
2018-05-17 10:02     ` Peter Xu
2018-05-17 10:10       ` Auger Eric
2018-05-17 10:14         ` Peter Xu
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 03/10] intel-iommu: add iommu lock Peter Xu
2018-05-17 14:32   ` Auger Eric
2018-05-18  5:32     ` Peter Xu
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
2018-05-17 13:39   ` Auger Eric
2018-05-18  5:53     ` Peter Xu
2018-05-18  7:38       ` Auger Eric
2018-05-18 10:02         ` Peter Xu
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
2018-05-17 14:32   ` Auger Eric
2018-05-18  5:59     ` Peter Xu
2018-05-18  7:24       ` Auger Eric
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 06/10] intel-iommu: pass in address space when page walk Peter Xu
2018-05-17 14:32   ` Auger Eric
2018-05-18  6:02     ` Peter Xu
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 07/10] util: implement simple interval tree logic Peter Xu
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
2018-05-17 17:23   ` Auger Eric
2018-05-18  6:06     ` Peter Xu
2018-05-18  7:31       ` Auger Eric
2018-05-04  3:08 ` [Qemu-devel] [PATCH v2 10/10] intel-iommu: remove notify_unmap for page walk Peter Xu
2018-05-04  3:20 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
2018-05-04  3:40   ` Peter Xu
2018-05-08  7:29 ` [Qemu-devel] [PATCH v2 11/10] tests: add interval tree unit test Peter Xu
2018-05-16  6:30 ` [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
2018-05-16 13:57   ` Jason Wang
2018-05-17  2:45     ` Peter Xu
2018-05-17  3:39       ` Alex Williamson
2018-05-17  4:16         ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.