All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
@ 2018-05-17  8:59 Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs Peter Xu
                   ` (14 more replies)
  0 siblings, 15 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

(Hello, Jintack, Feel free to test this branch again against your scp
 error case when you got free time)

I rewrote some of the patches in V3.  Major changes:

- Dropped mergable interval tree, instead introduced IOVA tree, which
  is even simpler.

- Fix the scp error issue that Jintack reported.  Please see patches
  for detailed information.  That's the major reason to rewrite a few
  of the patches.  We use replay for domain flushes are possibly
  incorrect in the past.  The thing is that IOMMU replay has an
  "definition" that "we should only send MAP when new page detected",
  while for shadow page syncing we actually need something else than
  that.  So in this version I started to use a new
  vtd_sync_shadow_page_table() helper to do the page sync.

- Some other refines after the refactoring.

I'll add unit test for the IOVA tree after this series merged to make
sure we won't switch to another new tree implementaion...

The element size in the new IOVA tree should be around
sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
worst case usage ratio would be 72/4K=2%, which still seems acceptable
(it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
mapping in QEMU).

I did explicit test with scp this time, copying 1G sized file for >10
times on each of the following case:

- L1 guest, with vIOMMU and with assigned device
- L2 guest, without vIOMMU and with assigned device
- L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device

Please review.  Thanks,

(Below are old content from previous cover letter)

==========================

v2:
- fix patchew code style warnings
- interval tree: postpone malloc when inserting; simplify node remove
  a bit where proper [Jason]
- fix up comment and commit message for iommu lock patch [Kevin]
- protect context cache too using the iommu lock [Kevin, Jason]
- add vast comment in patch 8 to explain the modify-PTE problem
  [Jason, Kevin]

Online repo:

  https://github.com/xzpeter/qemu/tree/fix-vtd-dma

This series fixes several major problems that current code has:

- Issue 1: when getting very big PSI UNMAP invalidations, the current
  code is buggy in that we might skip the notification while actually
  we should always send that notification.

- Issue 2: IOTLB is not thread safe, while block dataplane can be
  accessing and updating it in parallel.

- Issue 3: For devices that only registered with UNMAP-only notifiers,
  we don't really need to do page walking for PSIs, we can directly
  deliver the notification down.  For example, vhost.

- Issue 4: unsafe window for MAP notified devices like vfio-pci (and
  in the future, vDPA as well).  The problem is that, now for domain
  invalidations we do this to make sure the shadow page tables are
  correctly synced:

  1. unmap the whole address space
  2. replay the whole address space, map existing pages

  However during step 1 and 2 there will be a very tiny window (it can
  be as big as 3ms) that the shadow page table is either invalid or
  incomplete (since we're rebuilding it up).  That's fatal error since
  devices never know that happending and it's still possible to DMA to
  memories.

Patch 1 fixes issue 1.  I put it at the first since it's picked from
an old post.

Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.

Patch 3 fixes issue 2.

Patch 4 fixes issue 3.

Patch 5-9 fix issue 4.  Here a very simple interval tree is
implemented based on Gtree.  It's different with general interval tree
in that it does not allow user to pass in private data (e.g.,
translated addresses).  However that benefits us that then we can
merge adjacent interval leaves so that hopefully we won't consume much
memory even if the mappings are a lot (that happens for nested virt -
when mapping the whole L2 guest RAM range, it can be at least in GBs).

Patch 10 is another big cleanup only can work after patch 9.

Tests:

- device assignments to L1, even L2 guests.  With this series applied
  (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
  we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
  with assigned devices and things will work.  We can't before.

- vhost smoke test for regression.

Please review.  Thanks,

Peter Xu (12):
  intel-iommu: send PSI always even if across PDEs
  intel-iommu: remove IntelIOMMUNotifierNode
  intel-iommu: add iommu lock
  intel-iommu: only do page walk for MAP notifiers
  intel-iommu: introduce vtd_page_walk_info
  intel-iommu: pass in address space when page walk
  intel-iommu: trace domain id during page walk
  util: implement simple iova tree
  intel-iommu: maintain per-device iova ranges
  intel-iommu: simplify page walk logic
  intel-iommu: new vtd_sync_shadow_page_table_range
  intel-iommu: new sync_shadow_page_table

 include/hw/i386/intel_iommu.h |  19 +-
 include/qemu/iova-tree.h      | 134 ++++++++++++
 hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
 util/iova-tree.c              | 114 ++++++++++
 MAINTAINERS                   |   6 +
 hw/i386/trace-events          |   5 +-
 util/Makefile.objs            |   1 +
 7 files changed, 556 insertions(+), 104 deletions(-)
 create mode 100644 include/qemu/iova-tree.h
 create mode 100644 util/iova-tree.c

-- 
2.17.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17 21:00   ` Michael S. Tsirkin
  2018-05-18  8:23   ` Auger Eric
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 02/12] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

During IOVA page table walking, there is a special case when the PSI
covers one whole PDE (Page Directory Entry, which contains 512 Page
Table Entries) or more.  In the past, we skip that entry and we don't
notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
notification to registered UNMAP notifiers in this case.

For UNMAP only notifiers, this might cause IOTLBs cached in the devices
even if they were already invalid.  For MAP/UNMAP notifiers like
vfio-pci, this will cause stale page mappings.

This special case doesn't trigger often, but it is very easy to be
triggered by nested device assignments, since in that case we'll
possibly map the whole L2 guest RAM region into the device's IOVA
address space (several GBs at least), which is far bigger than normal
kernel driver usages of the device (tens of MBs normally).

Without this patch applied to L1 QEMU, nested device assignment to L2
guests will dump some errors like:

qemu-system-x86_64: VFIO_MAP_DMA: -17
qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
                    0x7f89a920d000) = -17 (File exists)

Acked-by: Jason Wang <jasowang@redhat.com>
[peterx: rewrite the commit message]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fb31de9416..b359efd6f9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
 
+static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
+                             vtd_page_walk_hook hook_fn, void *private)
+{
+    assert(hook_fn);
+    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
+                            entry->addr_mask, entry->perm);
+    return hook_fn(entry, private);
+}
+
 /**
  * vtd_page_walk_level - walk over specific level for IOVA range
  *
@@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
+        entry.target_as = &address_space_memory;
+        entry.iova = iova & subpage_mask;
+        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+        entry.addr_mask = ~subpage_mask;
+
         if (vtd_is_last_slpte(slpte, level)) {
-            entry.target_as = &address_space_memory;
-            entry.iova = iova & subpage_mask;
             /* NOTE: this is only meaningful if entry_valid == true */
             entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            entry.addr_mask = ~subpage_mask;
-            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
             if (!entry_valid && !notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
-                                    entry.addr_mask, entry.perm);
-            if (hook_fn) {
-                ret = hook_fn(&entry, private);
-                if (ret < 0) {
-                    return ret;
-                }
+            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+            if (ret < 0) {
+                return ret;
             }
         } else {
             if (!entry_valid) {
-                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                if (notify_unmap) {
+                    /*
+                     * The whole entry is invalid; unmap it all.
+                     * Translated address is meaningless, zero it.
+                     */
+                    entry.translated_addr = 0x0;
+                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+                    if (ret < 0) {
+                        return ret;
+                    }
+                } else {
+                    trace_vtd_page_walk_skip_perm(iova, iova_next);
+                }
                 goto next;
             }
             ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 02/12] intel-iommu: remove IntelIOMMUNotifierNode
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 03/12] intel-iommu: add iommu lock Peter Xu
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

That is not really necessary.  Removing that node struct and put the
list entry directly into VTDAddressSpace.  It simplfies the code a lot.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  9 ++------
 hw/i386/intel_iommu.c         | 41 ++++++++++-------------------------
 2 files changed, 14 insertions(+), 36 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 45ec8919b6..220697253f 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDIrq VTDIrq;
 typedef struct VTD_MSIMessage VTD_MSIMessage;
-typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -93,6 +92,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDAddressSpace) next;
 };
 
 struct VTDBus {
@@ -253,11 +253,6 @@ struct VTD_MSIMessage {
 /* When IR is enabled, all MSI/MSI-X data bits should be zero */
 #define VTD_IR_MSI_DATA          (0)
 
-struct IntelIOMMUNotifierNode {
-    VTDAddressSpace *vtd_as;
-    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
-};
-
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
@@ -295,7 +290,7 @@ struct IntelIOMMUState {
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
     /* list of registered notifiers */
-    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
+    QLIST_HEAD(, VTDAddressSpace) notifiers_list;
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b359efd6f9..5987b48d43 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1248,10 +1248,10 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 
 static void vtd_iommu_replay_all(IntelIOMMUState *s)
 {
-    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        memory_region_iommu_replay_all(&node->vtd_as->iommu);
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
+        memory_region_iommu_replay_all(&vtd_as->iommu);
     }
 }
 
@@ -1372,7 +1372,6 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 
 static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 {
-    IntelIOMMUNotifierNode *node;
     VTDContextEntry ce;
     VTDAddressSpace *vtd_as;
 
@@ -1381,8 +1380,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
             domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -1402,12 +1400,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                            uint16_t domain_id, hwaddr addr,
                                            uint8_t am)
 {
-    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
     VTDContextEntry ce;
     int ret;
 
-    QLIST_FOREACH(node, &(s->notifiers_list), next) {
-        VTDAddressSpace *vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -2344,8 +2341,6 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
     IntelIOMMUState *s = vtd_as->iommu_state;
-    IntelIOMMUNotifierNode *node = NULL;
-    IntelIOMMUNotifierNode *next_node = NULL;
 
     if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
         error_report("We need to set caching-mode=1 for intel-iommu to enable "
@@ -2354,21 +2349,11 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
     }
 
     if (old == IOMMU_NOTIFIER_NONE) {
-        node = g_malloc0(sizeof(*node));
-        node->vtd_as = vtd_as;
-        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
-        return;
-    }
-
-    /* update notifier node with new flags */
-    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
-        if (node->vtd_as == vtd_as) {
-            if (new == IOMMU_NOTIFIER_NONE) {
-                QLIST_REMOVE(node, next);
-                g_free(node);
-            }
-            return;
-        }
+        /* Insert new ones */
+        QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
+    } else if (new == IOMMU_NOTIFIER_NONE) {
+        /* Remove old ones */
+        QLIST_REMOVE(vtd_as, next);
     }
 }
 
@@ -2838,12 +2823,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 
 static void vtd_address_space_unmap_all(IntelIOMMUState *s)
 {
-    IntelIOMMUNotifierNode *node;
     VTDAddressSpace *vtd_as;
     IOMMUNotifier *n;
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
             vtd_address_space_unmap(vtd_as, n);
         }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 03/12] intel-iommu: add iommu lock
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 02/12] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 04/12] intel-iommu: only do page walk for MAP notifiers Peter Xu
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

Add a per-iommu big lock to protect IOMMU status.  Currently the only
thing to be protected is the IOTLB/context cache, since that can be
accessed even without BQL, e.g., in IO dataplane.

Note that we don't need to protect device page tables since that's fully
controlled by the guest kernel.  However there is still possibility that
malicious drivers will program the device to not obey the rule.  In that
case QEMU can't really do anything useful, instead the guest itself will
be responsible for all uncertainties.

Reported-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  6 +++++
 hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 220697253f..ee517704e7 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -300,6 +300,12 @@ struct IntelIOMMUState {
     OnOffAuto intr_eim;             /* Toggle for EIM cabability */
     bool buggy_eim;                 /* Force buggy EIM unless eim=off */
     uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
+
+    /*
+     * Protects IOMMU states in general.  Currently it protects the
+     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
+     */
+    QemuMutex iommu_lock;
 };
 
 /* Find the VTD Address space associated with the given bus pointer,
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5987b48d43..112971638d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -128,6 +128,16 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
     return new_val;
 }
 
+static inline void vtd_iommu_lock(IntelIOMMUState *s)
+{
+    qemu_mutex_lock(&s->iommu_lock);
+}
+
+static inline void vtd_iommu_unlock(IntelIOMMUState *s)
+{
+    qemu_mutex_unlock(&s->iommu_lock);
+}
+
 /* GHashTable functions */
 static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
 {
@@ -172,7 +182,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
 }
 
 /* Reset all the gen of VTDAddressSpace to zero and set the gen of
- * IntelIOMMUState to 1.
+ * IntelIOMMUState to 1.  Must be with IOMMU lock held.
  */
 static void vtd_reset_context_cache(IntelIOMMUState *s)
 {
@@ -197,12 +207,19 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
     s->context_cache_gen = 1;
 }
 
-static void vtd_reset_iotlb(IntelIOMMUState *s)
+static void vtd_reset_iotlb_locked(IntelIOMMUState *s)
 {
     assert(s->iotlb);
     g_hash_table_remove_all(s->iotlb);
 }
 
+static void vtd_reset_iotlb(IntelIOMMUState *s)
+{
+    vtd_iommu_lock(s);
+    vtd_reset_iotlb_locked(s);
+    vtd_iommu_unlock(s);
+}
+
 static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
                                   uint32_t level)
 {
@@ -215,6 +232,7 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
     return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
 }
 
+/* Must be with IOMMU lock held */
 static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
                                        hwaddr addr)
 {
@@ -235,6 +253,7 @@ out:
     return entry;
 }
 
+/* Must be with IOMMU lock held */
 static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
                              uint16_t domain_id, hwaddr addr, uint64_t slpte,
                              uint8_t access_flags, uint32_t level)
@@ -246,7 +265,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
         trace_vtd_iotlb_reset("iotlb exceeds size limit");
-        vtd_reset_iotlb(s);
+        vtd_reset_iotlb_locked(s);
     }
 
     entry->gfn = gfn;
@@ -1106,7 +1125,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     IntelIOMMUState *s = vtd_as->iommu_state;
     VTDContextEntry ce;
     uint8_t bus_num = pci_bus_num(bus);
-    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+    VTDContextCacheEntry *cc_entry;
     uint64_t slpte, page_mask;
     uint32_t level;
     uint16_t source_id = vtd_make_source_id(bus_num, devfn);
@@ -1123,6 +1142,10 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
      */
     assert(!vtd_is_interrupt_addr(addr));
 
+    vtd_iommu_lock(s);
+
+    cc_entry = &vtd_as->context_cache_entry;
+
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
@@ -1182,7 +1205,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
          * IOMMU region can be swapped back.
          */
         vtd_pt_enable_fast_path(s, source_id);
-
+        vtd_iommu_unlock(s);
         return true;
     }
 
@@ -1203,6 +1226,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
                      access_flags, level);
 out:
+    vtd_iommu_unlock(s);
     entry->iova = addr & page_mask;
     entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) & page_mask;
     entry->addr_mask = ~page_mask;
@@ -1210,6 +1234,7 @@ out:
     return true;
 
 error:
+    vtd_iommu_unlock(s);
     entry->iova = 0;
     entry->translated_addr = 0;
     entry->addr_mask = 0;
@@ -1258,10 +1283,13 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
     trace_vtd_inv_desc_cc_global();
+    /* Protects context cache */
+    vtd_iommu_lock(s);
     s->context_cache_gen++;
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
     }
+    vtd_iommu_unlock(s);
     vtd_switch_address_space_all(s);
     /*
      * From VT-d spec 6.5.2.1, a global context entry invalidation
@@ -1377,8 +1405,10 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 
     trace_vtd_inv_desc_iotlb_domain(domain_id);
 
+    vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
+    vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
@@ -1426,7 +1456,9 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.domain_id = domain_id;
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
+    vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+    vtd_iommu_unlock(s);
     vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
 }
 
@@ -3072,6 +3104,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->notifiers_list);
+    qemu_mutex_init(&s->iommu_lock);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 04/12] intel-iommu: only do page walk for MAP notifiers
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (2 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 03/12] intel-iommu: add iommu lock Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 05/12] intel-iommu: introduce vtd_page_walk_info Peter Xu
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

For UNMAP-only IOMMU notifiers, we don't really need to walk the page
tables.  Fasten that procedure by skipping the page table walk.  That
should boost performance for UNMAP-only notifiers like vhost.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  2 ++
 hw/i386/intel_iommu.c         | 43 +++++++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index ee517704e7..9e0a6c1c6a 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -93,6 +93,8 @@ struct VTDAddressSpace {
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
+    /* Superset of notifier flags that this address space has */
+    IOMMUNotifierFlag notifier_flags;
 };
 
 struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 112971638d..9a418abfb6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
     qemu_mutex_unlock(&s->iommu_lock);
 }
 
+/* Whether the address space needs to notify new mappings */
+static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
+{
+    return as->notifier_flags & IOMMU_NOTIFIER_MAP;
+}
+
 /* GHashTable functions */
 static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
 {
@@ -1433,14 +1439,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     VTDAddressSpace *vtd_as;
     VTDContextEntry ce;
     int ret;
+    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
 
     QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
-            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
-                          vtd_page_invalidate_notify_hook,
-                          (void *)&vtd_as->iommu, true, s->aw_bits);
+            if (vtd_as_notify_mappings(vtd_as)) {
+                /*
+                 * For MAP-inclusive notifiers, we need to walk the
+                 * page table to sync the shadow page table.
+                 */
+                vtd_page_walk(&ce, addr, addr + size,
+                              vtd_page_invalidate_notify_hook,
+                              (void *)&vtd_as->iommu, true, s->aw_bits);
+            } else {
+                /*
+                 * For UNMAP-only notifiers, we don't need to walk the
+                 * page tables.  We just deliver the PSI down to
+                 * invalidate caches.
+                 */
+                IOMMUTLBEntry entry = {
+                    .target_as = &address_space_memory,
+                    .iova = addr,
+                    .translated_addr = 0,
+                    .addr_mask = size - 1,
+                    .perm = IOMMU_NONE,
+                };
+                memory_region_notify_iommu(&vtd_as->iommu, entry);
+            }
         }
     }
 }
@@ -2380,6 +2407,9 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
         exit(1);
     }
 
+    /* Update per-address-space notifier flags */
+    vtd_as->notifier_flags = new;
+
     if (old == IOMMU_NOTIFIER_NONE) {
         /* Insert new ones */
         QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
@@ -2890,8 +2920,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   PCI_FUNC(vtd_as->devfn),
                                   VTD_CONTEXT_ENTRY_DID(ce.hi),
                                   ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
-                      s->aw_bits);
+        if (vtd_as_notify_mappings(vtd_as)) {
+            /* This is required only for MAP typed notifiers */
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+                          s->aw_bits);
+        }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 05/12] intel-iommu: introduce vtd_page_walk_info
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (3 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 04/12] intel-iommu: only do page walk for MAP notifiers Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-18  8:23   ` Auger Eric
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 06/12] intel-iommu: pass in address space when page walk Peter Xu
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

During the recursive page walking of IOVA page tables, some stack
variables are constant variables and never changed during the whole page
walking procedure.  Isolate them into a struct so that we don't need to
pass those contants down the stack every time and multiple times.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 84 ++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 32 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9a418abfb6..4953d02ed0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -747,9 +747,27 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
 
+/**
+ * Constant information used during page walking
+ *
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @notify_unmap: whether we should notify invalid entries
+ * @aw: maximum address width
+ */
+typedef struct {
+    vtd_page_walk_hook hook_fn;
+    void *private;
+    bool notify_unmap;
+    uint8_t aw;
+} vtd_page_walk_info;
+
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
-                             vtd_page_walk_hook hook_fn, void *private)
+                             vtd_page_walk_info *info)
 {
+    vtd_page_walk_hook hook_fn = info->hook_fn;
+    void *private = info->private;
+
     assert(hook_fn);
     trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
                             entry->addr_mask, entry->perm);
@@ -762,17 +780,13 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
  * @addr: base GPA addr to start the walk
  * @start: IOVA range start address
  * @end: IOVA range end address (start <= addr < end)
- * @hook_fn: hook func to be called when detected page
- * @private: private data to be passed into hook func
  * @read: whether parent level has read permission
  * @write: whether parent level has write permission
- * @notify_unmap: whether we should notify invalid entries
- * @aw: maximum address width
+ * @info: constant information for the page walk
  */
 static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
-                               uint64_t end, vtd_page_walk_hook hook_fn,
-                               void *private, uint32_t level, bool read,
-                               bool write, bool notify_unmap, uint8_t aw)
+                               uint64_t end, uint32_t level, bool read,
+                               bool write, vtd_page_walk_info *info)
 {
     bool read_cur, write_cur, entry_valid;
     uint32_t offset;
@@ -822,24 +836,24 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
 
         if (vtd_is_last_slpte(slpte, level)) {
             /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            if (!entry_valid && !notify_unmap) {
+            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+            if (!entry_valid && !info->notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+            ret = vtd_page_walk_one(&entry, level, info);
             if (ret < 0) {
                 return ret;
             }
         } else {
             if (!entry_valid) {
-                if (notify_unmap) {
+                if (info->notify_unmap) {
                     /*
                      * The whole entry is invalid; unmap it all.
                      * Translated address is meaningless, zero it.
                      */
                     entry.translated_addr = 0x0;
-                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+                    ret = vtd_page_walk_one(&entry, level, info);
                     if (ret < 0) {
                         return ret;
                     }
@@ -848,10 +862,9 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
                 }
                 goto next;
             }
-            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
-                                      MIN(iova_next, end), hook_fn, private,
-                                      level - 1, read_cur, write_cur,
-                                      notify_unmap, aw);
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+                                      iova, MIN(iova_next, end), level - 1,
+                                      read_cur, write_cur, info);
             if (ret < 0) {
                 return ret;
             }
@@ -870,28 +883,24 @@ next:
  * @ce: context entry to walk upon
  * @start: IOVA address to start the walk
  * @end: IOVA range end address (start <= addr < end)
- * @hook_fn: the hook that to be called for each detected area
- * @private: private data for the hook function
- * @aw: maximum address width
+ * @info: page walking information struct
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
-                         vtd_page_walk_hook hook_fn, void *private,
-                         bool notify_unmap, uint8_t aw)
+                         vtd_page_walk_info *info)
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
 
-    if (!vtd_iova_range_check(start, ce, aw)) {
+    if (!vtd_iova_range_check(start, ce, info->aw)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
-    if (!vtd_iova_range_check(end, ce, aw)) {
+    if (!vtd_iova_range_check(end, ce, info->aw)) {
         /* Fix end so that it reaches the maximum */
-        end = vtd_iova_limit(ce, aw);
+        end = vtd_iova_limit(ce, info->aw);
     }
 
-    return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, notify_unmap, aw);
+    return vtd_page_walk_level(addr, start, end, level, true, true, info);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1446,13 +1455,18 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
             if (vtd_as_notify_mappings(vtd_as)) {
+                vtd_page_walk_info info = {
+                    .hook_fn = vtd_page_invalidate_notify_hook,
+                    .private = (void *)&vtd_as->iommu,
+                    .notify_unmap = true,
+                    .aw = s->aw_bits,
+                };
+
                 /*
                  * For MAP-inclusive notifiers, we need to walk the
                  * page table to sync the shadow page table.
                  */
-                vtd_page_walk(&ce, addr, addr + size,
-                              vtd_page_invalidate_notify_hook,
-                              (void *)&vtd_as->iommu, true, s->aw_bits);
+                vtd_page_walk(&ce, addr, addr + size, &info);
             } else {
                 /*
                  * For UNMAP-only notifiers, we don't need to walk the
@@ -2922,8 +2936,14 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   ce.hi, ce.lo);
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
-            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
-                          s->aw_bits);
+            vtd_page_walk_info info = {
+                .hook_fn = vtd_replay_hook,
+                .private = (void *)n,
+                .notify_unmap = false,
+                .aw = s->aw_bits,
+            };
+
+            vtd_page_walk(&ce, 0, ~0ULL, &info);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 06/12] intel-iommu: pass in address space when page walk
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (4 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 05/12] intel-iommu: introduce vtd_page_walk_info Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-18  8:23   ` Auger Eric
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 07/12] intel-iommu: trace domain id during " Peter Xu
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

We pass in the VTDAddressSpace too.  It'll be used in the follow up
patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4953d02ed0..fe5ee77d46 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -753,9 +753,11 @@ typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
  * @hook_fn: hook func to be called when detected page
  * @private: private data to be passed into hook func
  * @notify_unmap: whether we should notify invalid entries
+ * @as: VT-d address space of the device
  * @aw: maximum address width
  */
 typedef struct {
+    VTDAddressSpace *as;
     vtd_page_walk_hook hook_fn;
     void *private;
     bool notify_unmap;
@@ -1460,6 +1462,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                     .private = (void *)&vtd_as->iommu,
                     .notify_unmap = true,
                     .aw = s->aw_bits,
+                    .as = vtd_as,
                 };
 
                 /*
@@ -2941,6 +2944,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .private = (void *)n,
                 .notify_unmap = false,
                 .aw = s->aw_bits,
+                .as = vtd_as,
             };
 
             vtd_page_walk(&ce, 0, ~0ULL, &info);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 07/12] intel-iommu: trace domain id during page walk
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (5 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 06/12] intel-iommu: pass in address space when page walk Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 08/12] util: implement simple iova tree Peter Xu
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

This patch only modifies the trace points.

Previously we were tracing page walk levels.  They are redundant since
we have page mask (size) already.  Now we trace something much more
useful which is the domain ID of the page walking.  That can be very
useful when we trace more than one devices on the same system, so that
we can know which map is for which domain.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 16 ++++++++++------
 hw/i386/trace-events  |  2 +-
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fe5ee77d46..29fcf2b3a8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -755,6 +755,7 @@ typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
  * @notify_unmap: whether we should notify invalid entries
  * @as: VT-d address space of the device
  * @aw: maximum address width
+ * @domain: domain ID of the page walk
  */
 typedef struct {
     VTDAddressSpace *as;
@@ -762,17 +763,18 @@ typedef struct {
     void *private;
     bool notify_unmap;
     uint8_t aw;
+    uint16_t domain_id;
 } vtd_page_walk_info;
 
-static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
-                             vtd_page_walk_info *info)
+static int vtd_page_walk_one(IOMMUTLBEntry *entry, vtd_page_walk_info *info)
 {
     vtd_page_walk_hook hook_fn = info->hook_fn;
     void *private = info->private;
 
     assert(hook_fn);
-    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
-                            entry->addr_mask, entry->perm);
+    trace_vtd_page_walk_one(info->domain_id, entry->iova,
+                            entry->translated_addr, entry->addr_mask,
+                            entry->perm);
     return hook_fn(entry, private);
 }
 
@@ -843,7 +845,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            ret = vtd_page_walk_one(&entry, level, info);
+            ret = vtd_page_walk_one(&entry, info);
             if (ret < 0) {
                 return ret;
             }
@@ -855,7 +857,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
                      * Translated address is meaningless, zero it.
                      */
                     entry.translated_addr = 0x0;
-                    ret = vtd_page_walk_one(&entry, level, info);
+                    ret = vtd_page_walk_one(&entry, info);
                     if (ret < 0) {
                         return ret;
                     }
@@ -1463,6 +1465,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                     .notify_unmap = true,
                     .aw = s->aw_bits,
                     .as = vtd_as,
+                    .domain_id = domain_id,
                 };
 
                 /*
@@ -2945,6 +2948,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .notify_unmap = false,
                 .aw = s->aw_bits,
                 .as = vtd_as,
+                .domain_id = VTD_CONTEXT_ENTRY_DID(ce.hi),
             };
 
             vtd_page_walk(&ce, 0, ~0ULL, &info);
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 22d44648af..ca23ba9fad 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -39,7 +39,7 @@ vtd_fault_disabled(void) "Fault processing disabled for context entry"
 vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
 vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
 vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
-vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_one(uint16_t domain, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "domain 0x%"PRIu16" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 08/12] util: implement simple iova tree
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (6 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 07/12] intel-iommu: trace domain id during " Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 09/12] intel-iommu: maintain per-device iova ranges Peter Xu
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

Introduce a simplest iova tree implementation based on GTree.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/iova-tree.h | 134 +++++++++++++++++++++++++++++++++++++++
 util/iova-tree.c         | 114 +++++++++++++++++++++++++++++++++
 MAINTAINERS              |   6 ++
 util/Makefile.objs       |   1 +
 4 files changed, 255 insertions(+)
 create mode 100644 include/qemu/iova-tree.h
 create mode 100644 util/iova-tree.c

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
new file mode 100644
index 0000000000..b061932097
--- /dev/null
+++ b/include/qemu/iova-tree.h
@@ -0,0 +1,134 @@
+/*
+ * An very simplified iova tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *  Peter Xu <peterx@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+#ifndef IOVA_TREE_H
+#define IOVA_TREE_H
+
+/*
+ * Currently the iova tree will only allow to keep ranges
+ * information, and no extra user data is allowed for each element.  A
+ * benefit is that we can merge adjacent ranges internally within the
+ * tree.  It can save a lot of memory when the ranges are splitted but
+ * mostly continuous.
+ *
+ * Note that current implementation does not provide any thread
+ * protections.  Callers of the iova tree should be responsible
+ * for the thread safety issue.
+ */
+
+#include "qemu/osdep.h"
+#include "exec/memory.h"
+#include "exec/hwaddr.h"
+
+#define  IOVA_OK           (0)
+#define  IOVA_ERR_INVALID  (-1) /* Invalid parameters */
+#define  IOVA_ERR_OVERLAP  (-2) /* IOVA range overlapped */
+
+typedef struct IOVATree IOVATree;
+typedef struct DMAMap {
+    hwaddr iova;
+    hwaddr translated_addr;
+    hwaddr size;                /* Inclusive */
+    IOMMUAccessFlags perm;
+} QEMU_PACKED DMAMap;
+typedef gboolean (*iova_tree_iterator)(DMAMap *map);
+
+/**
+ * iova_tree_new:
+ *
+ * Create a new iova tree.
+ *
+ * Returns: the tree pointer when succeeded, or NULL if error.
+ */
+IOVATree *iova_tree_new(void);
+
+/**
+ * iova_tree_insert:
+ *
+ * @tree: the iova tree to insert
+ * @map: the mapping to insert
+ *
+ * Insert an iova range to the tree.  If there is overlapped
+ * ranges, IOVA_ERR_OVERLAP will be returned.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int iova_tree_insert(IOVATree *tree, DMAMap *map);
+
+/**
+ * iova_tree_remove:
+ *
+ * @tree: the iova tree to remove range from
+ * @map: the map range to remove
+ *
+ * Remove mappings from the tree that are covered by the map range
+ * provided.  The range does not need to be exactly what has inserted,
+ * all the mappings that are included in the provided range will be
+ * removed from the tree.  Here map->translated_addr is meaningless.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int iova_tree_remove(IOVATree *tree, DMAMap *map);
+
+/**
+ * iova_tree_find:
+ *
+ * @tree: the iova tree to search from
+ * @map: the mapping to search
+ *
+ * Search for a mapping in the iova tree that overlaps with the
+ * mapping range specified.  Only the first found mapping will be
+ * returned.
+ *
+ * Return: DMAMap pointer if found, or NULL if not found.  Note that
+ * the returned DMAMap pointer is maintained internally.  User should
+ * only read the content but never modify or free the content.  Also,
+ * user is responsible to make sure the pointer is valid (say, no
+ * concurrent deletion in progress).
+ */
+DMAMap *iova_tree_find(IOVATree *tree, DMAMap *map);
+
+/**
+ * iova_tree_find_address:
+ *
+ * @tree: the iova tree to search from
+ * @iova: the iova address to find
+ *
+ * Similar to iova_tree_find(), but it tries to find mapping with
+ * range iova=iova & size=0.
+ *
+ * Return: same as iova_tree_find().
+ */
+DMAMap *iova_tree_find_address(IOVATree *tree, hwaddr iova);
+
+/**
+ * iova_tree_foreach:
+ *
+ * @tree: the iova tree to iterate on
+ * @iterator: the interator for the mappings, return true to stop
+ *
+ * Iterate over the iova tree.
+ *
+ * Return: 1 if found any overlap, 0 if not, <0 if error.
+ */
+void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
+
+/**
+ * iova_tree_destroy:
+ *
+ * @tree: the iova tree to destroy
+ *
+ * Destroy an existing iova tree.
+ *
+ * Return: None.
+ */
+void iova_tree_destroy(IOVATree *tree);
+
+#endif
diff --git a/util/iova-tree.c b/util/iova-tree.c
new file mode 100644
index 0000000000..2d9cebfc89
--- /dev/null
+++ b/util/iova-tree.c
@@ -0,0 +1,114 @@
+/*
+ * IOVA tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *  Peter Xu <peterx@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+
+#include <glib.h>
+#include "qemu/iova-tree.h"
+
+struct IOVATree {
+    GTree *tree;
+};
+
+static int iova_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
+{
+    const DMAMap *m1 = a, *m2 = b;
+
+    if (m1->iova > m2->iova + m2->size) {
+        return 1;
+    }
+
+    if (m1->iova + m1->size < m2->iova) {
+        return -1;
+    }
+
+    /* Overlapped */
+    return 0;
+}
+
+IOVATree *iova_tree_new(void)
+{
+    IOVATree *iova_tree = g_new0(IOVATree, 1);
+
+    /* We don't have values actually, no need to free */
+    iova_tree->tree = g_tree_new_full(iova_tree_compare, NULL, g_free, NULL);
+
+    return iova_tree;
+}
+
+DMAMap *iova_tree_find(IOVATree *tree, DMAMap *map)
+{
+    return g_tree_lookup(tree->tree, map);
+}
+
+DMAMap *iova_tree_find_address(IOVATree *tree, hwaddr iova)
+{
+    DMAMap map = { .iova = iova, .size = 0 };
+
+    return iova_tree_find(tree, &map);
+}
+
+static inline void iova_tree_insert_internal(GTree *gtree, DMAMap *range)
+{
+    /* Key and value are sharing the same range data */
+    g_tree_insert(gtree, range, range);
+}
+
+int iova_tree_insert(IOVATree *tree, DMAMap *map)
+{
+    DMAMap *new;
+
+    if (map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
+        return IOVA_ERR_INVALID;
+    }
+
+    /* We don't allow to insert range that overlaps with existings */
+    if (iova_tree_find(tree, map)) {
+        return IOVA_ERR_OVERLAP;
+    }
+
+    new = g_new0(DMAMap, 1);
+    memcpy(new, map, sizeof(*new));
+    iova_tree_insert_internal(tree->tree, new);
+
+    return IOVA_OK;
+}
+
+static gboolean iova_tree_traverse(gpointer key, gpointer value,
+                                gpointer data)
+{
+    iova_tree_iterator iterator = data;
+    DMAMap *map = key;
+
+    g_assert(key == value);
+
+    return iterator(map);
+}
+
+void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator)
+{
+    g_tree_foreach(tree->tree, iova_tree_traverse, iterator);
+}
+
+int iova_tree_remove(IOVATree *tree, DMAMap *map)
+{
+    DMAMap *overlap;
+
+    while ((overlap = iova_tree_find(tree, map))) {
+        g_tree_remove(tree->tree, overlap);
+    }
+
+    return IOVA_OK;
+}
+
+void iova_tree_destroy(IOVATree *tree)
+{
+    g_tree_destroy(tree->tree);
+    g_free(tree);
+}
diff --git a/MAINTAINERS b/MAINTAINERS
index e187b1f18f..f07fcee72e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1783,6 +1783,12 @@ F: include/sysemu/replay.h
 F: docs/replay.txt
 F: stubs/replay.c
 
+IOVA Tree
+M: Peter Xu <peterx@redhat.com>
+S: Maintained
+F: include/qemu/iova-tree.h
+F: util/iova-tree.c
+
 Usermode Emulation
 ------------------
 Overall
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 728c3541db..e1c3fed4dc 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -47,4 +47,5 @@ util-obj-y += qht.o
 util-obj-y += range.o
 util-obj-y += stats64.o
 util-obj-y += systemd.o
+util-obj-y += iova-tree.o
 util-obj-$(CONFIG_LINUX) += vfio-helpers.o
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 09/12] intel-iommu: maintain per-device iova ranges
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (7 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 08/12] util: implement simple iova tree Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  9:46   ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 10/12] intel-iommu: simplify page walk logic Peter Xu
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

For each VTDAddressSpace, now we maintain what IOVA ranges we have
mapped and what we have not.  With that information, now we only send
MAP or UNMAP when necessary.  Say, we don't send MAP notifies if we know
we have already mapped the range, meanwhile we don't send UNMAP notifies
if we know we never mapped the range at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  2 ++
 hw/i386/intel_iommu.c         | 67 +++++++++++++++++++++++++++++++++++
 hw/i386/trace-events          |  2 ++
 3 files changed, 71 insertions(+)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 9e0a6c1c6a..90190bfaa1 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -27,6 +27,7 @@
 #include "hw/i386/ioapic.h"
 #include "hw/pci/msi.h"
 #include "hw/sysbus.h"
+#include "qemu/iova-tree.h"
 
 #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
 #define INTEL_IOMMU_DEVICE(obj) \
@@ -95,6 +96,7 @@ struct VTDAddressSpace {
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
+    IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
 struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 29fcf2b3a8..5a5175a4ed 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -768,10 +768,71 @@ typedef struct {
 
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, vtd_page_walk_info *info)
 {
+    VTDAddressSpace *as = info->as;
     vtd_page_walk_hook hook_fn = info->hook_fn;
     void *private = info->private;
+    DMAMap target = {
+        .iova = entry->iova,
+        .size = entry->addr_mask,
+        .translated_addr = entry->translated_addr,
+        .perm = entry->perm,
+    };
+    DMAMap *mapped = iova_tree_find(as->iova_tree, &target);
 
     assert(hook_fn);
+
+    /* Update local IOVA mapped ranges */
+    if (entry->perm) {
+        if (mapped) {
+            /* If it's exactly the same translation, skip */
+            if (!memcmp(mapped, &target, sizeof(target))) {
+                trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
+                                                 entry->translated_addr);
+                return 0;
+            } else {
+                /*
+                 * Translation changed.  This should not happen with
+                 * "intel_iommu=on,strict", but it can happen when
+                 * delayed flushing is used in guest IOMMU driver
+                 * (when without "strict") when page A is reused
+                 * before its previous unmap (the unmap can still be
+                 * queued in the delayed flushing queue).  Now we do
+                 * our best to remap.  Note that there will be a small
+                 * window that we don't have map at all.  But that's
+                 * the best effort we can do, and logically
+                 * well-behaved guests should not really using this
+                 * DMA region yet so we should be very safe.
+                 */
+                IOMMUAccessFlags cache_perm = entry->perm;
+                int ret;
+
+                /* Emulate an UNMAP */
+                entry->perm = IOMMU_NONE;
+                trace_vtd_page_walk_one(info->domain_id,
+                                        entry->iova,
+                                        entry->translated_addr,
+                                        entry->addr_mask,
+                                        entry->perm);
+                ret = hook_fn(entry, private);
+                if (ret) {
+                    return ret;
+                }
+                /* Drop any existing mapping */
+                iova_tree_remove(as->iova_tree, &target);
+                /* Recover the correct permission */
+                entry->perm = cache_perm;
+            }
+        }
+        iova_tree_insert(as->iova_tree, &target);
+    } else {
+        if (!mapped) {
+            /* Skip since we didn't map this range at all */
+            trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
+            return 0;
+        }
+        iova_tree_remove(as->iova_tree, &target);
+    }
+
     trace_vtd_page_walk_one(info->domain_id, entry->iova,
                             entry->translated_addr, entry->addr_mask,
                             entry->perm);
@@ -2804,6 +2865,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+        vtd_dev_as->iova_tree = iova_tree_new();
 
         /*
          * Memory region relationships looks like (Address range shows
@@ -2856,6 +2918,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     hwaddr start = n->start;
     hwaddr end = n->end;
     IntelIOMMUState *s = as->iommu_state;
+    DMAMap map;
 
     /*
      * Note: all the codes in this function has a assumption that IOVA
@@ -2900,6 +2963,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
                              VTD_PCI_FUNC(as->devfn),
                              entry.iova, size);
 
+    map.iova = entry.iova;
+    map.size = entry.addr_mask;
+    iova_tree_remove(as->iova_tree, &map);
+
     memory_region_notify_one(n, &entry);
 }
 
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ca23ba9fad..d8194b80e3 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -40,6 +40,8 @@ vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint6
 vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
 vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
 vtd_page_walk_one(uint16_t domain, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "domain 0x%"PRIu16" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t translated) "iova 0x%"PRIx64" mask 0x%"PRIx64" translated 0x%"PRIx64
+vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 10/12] intel-iommu: simplify page walk logic
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (8 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 09/12] intel-iommu: maintain per-device iova ranges Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 11/12] intel-iommu: new vtd_sync_shadow_page_table_range Peter Xu
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

Let's move the notify_unmap check into the new vtd_page_walk_one()
function so that we can greatly simplify the vtd_page_walk_level()
logic.

No functional change at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 66 ++++++++++++++++++++-----------------------
 hw/i386/trace-events  |  1 -
 2 files changed, 30 insertions(+), 37 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5a5175a4ed..272e49ff66 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -779,6 +779,11 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, vtd_page_walk_info *info)
     };
     DMAMap *mapped = iova_tree_find(as->iova_tree, &target);
 
+    if (entry->perm == IOMMU_NONE && !info->notify_unmap) {
+        trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
+        return 0;
+    }
+
     assert(hook_fn);
 
     /* Update local IOVA mapped ranges */
@@ -894,45 +899,34 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
-        entry.target_as = &address_space_memory;
-        entry.iova = iova & subpage_mask;
-        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
-        entry.addr_mask = ~subpage_mask;
-
-        if (vtd_is_last_slpte(slpte, level)) {
-            /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
-            if (!entry_valid && !info->notify_unmap) {
-                trace_vtd_page_walk_skip_perm(iova, iova_next);
-                goto next;
-            }
-            ret = vtd_page_walk_one(&entry, info);
-            if (ret < 0) {
-                return ret;
-            }
-        } else {
-            if (!entry_valid) {
-                if (info->notify_unmap) {
-                    /*
-                     * The whole entry is invalid; unmap it all.
-                     * Translated address is meaningless, zero it.
-                     */
-                    entry.translated_addr = 0x0;
-                    ret = vtd_page_walk_one(&entry, info);
-                    if (ret < 0) {
-                        return ret;
-                    }
-                } else {
-                    trace_vtd_page_walk_skip_perm(iova, iova_next);
-                }
-                goto next;
-            }
+        if (!vtd_is_last_slpte(slpte, level) && entry_valid) {
+            /*
+             * This is a valid PDE (or even bigger than PDE).  We need
+             * to walk one further level.
+             */
             ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
-            if (ret < 0) {
-                return ret;
-            }
+        } else {
+            /*
+             * This means we are either:
+             *
+             * (1) the real page entry (either 4K page, or huge page)
+             * (2) the whole range is invalid
+             *
+             * In either case, we send an IOTLB notification down.
+             */
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            entry.addr_mask = ~subpage_mask;
+            /* NOTE: this is only meaningful if entry_valid == true */
+            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+            ret = vtd_page_walk_one(&entry, info);
+        }
+
+        if (ret < 0) {
+            return ret;
         }
 
 next:
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index d8194b80e3..e14d06ec83 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -43,7 +43,6 @@ vtd_page_walk_one(uint16_t domain, uint64_t iova, uint64_t gpa, uint64_t mask, i
 vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t translated) "iova 0x%"PRIx64" mask 0x%"PRIx64" translated 0x%"PRIx64
 vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
-vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 11/12] intel-iommu: new vtd_sync_shadow_page_table_range
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (9 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 10/12] intel-iommu: simplify page walk logic Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 12/12] intel-iommu: new sync_shadow_page_table Peter Xu
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

I pick part of the page walk implementation out into this single
function.  It was only used now for PSIs, but actually it can be used
not only for invalidations but for any place that we want to synchronize
the shadow page table.  No functional change at all.

Here I passed in context entry explicit to avoid fetching it again.
However I enhanced the helper to even be able to fetch it on its own,
since in follow up patches we might call without context entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 65 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 17 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 272e49ff66..a1a2a009c1 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1018,6 +1018,53 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     return 0;
 }
 
+static int vtd_sync_shadow_page_hook(IOMMUTLBEntry *entry,
+                                     void *private)
+{
+    memory_region_notify_iommu((IOMMUMemoryRegion *)private, *entry);
+    return 0;
+}
+
+/* If context entry is NULL, we'll try to fetch it on our own. */
+static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
+                                            VTDContextEntry *ce,
+                                            hwaddr addr, hwaddr size)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    vtd_page_walk_info info = {
+        .hook_fn = vtd_sync_shadow_page_hook,
+        .private = (void *)&vtd_as->iommu,
+        .notify_unmap = true,
+        .aw = s->aw_bits,
+        .as = vtd_as,
+    };
+    VTDContextEntry ce_cache;
+    int ret;
+
+    if (ce) {
+        /* If the caller provided context entry, use it */
+        ce_cache = *ce;
+    } else {
+        /* If the caller didn't provide ce, try to fetch */
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce_cache);
+        if (ret) {
+            /*
+             * This should not really happen, but in case it happens,
+             * we just skip the sync for this time.  After all we even
+             * don't have the root table pointer!
+             */
+            trace_vtd_err("Detected invalid context entry when "
+                          "trying to sync shadow page table");
+            return 0;
+        }
+    }
+
+    info.domain_id = VTD_CONTEXT_ENTRY_DID(ce_cache.hi);
+
+    return vtd_page_walk(&ce_cache, addr, addr + size, &info);
+}
+
 /*
  * Fetch translation type for specific device. Returns <0 if error
  * happens, otherwise return the shifted type to check against
@@ -1493,13 +1540,6 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     }
 }
 
-static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
-                                           void *private)
-{
-    memory_region_notify_iommu((IOMMUMemoryRegion *)private, *entry);
-    return 0;
-}
-
 static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                            uint16_t domain_id, hwaddr addr,
                                            uint8_t am)
@@ -1514,20 +1554,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
             if (vtd_as_notify_mappings(vtd_as)) {
-                vtd_page_walk_info info = {
-                    .hook_fn = vtd_page_invalidate_notify_hook,
-                    .private = (void *)&vtd_as->iommu,
-                    .notify_unmap = true,
-                    .aw = s->aw_bits,
-                    .as = vtd_as,
-                    .domain_id = domain_id,
-                };
-
                 /*
                  * For MAP-inclusive notifiers, we need to walk the
                  * page table to sync the shadow page table.
                  */
-                vtd_page_walk(&ce, addr, addr + size, &info);
+                vtd_sync_shadow_page_table_range(vtd_as, &ce, addr, size);
             } else {
                 /*
                  * For UNMAP-only notifiers, we don't need to walk the
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH v3 12/12] intel-iommu: new sync_shadow_page_table
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (10 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 11/12] intel-iommu: new vtd_sync_shadow_page_table_range Peter Xu
@ 2018-05-17  8:59 ` Peter Xu
  2018-05-17 21:06   ` Michael S. Tsirkin
  2018-05-17 19:49 ` [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Jintack Lim
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2018-05-17  8:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	peterx, Jason Wang

Firstly, introduce the sync_shadow_page_table() helper to resync the
whole shadow page table of an IOMMU address space.  Meanwhile, when we
receive domain invalidation or similar requests (for example, context
entry invalidations, global invalidations, ...), we should not really
run the replay logic, instead we can now use the new sync shadow page
table API to resync the whole shadow page table.

There will be two major differences:

1. We don't unmap-all before walking the page table, we just sync.  The
   global unmap-all can create a very small window that the page table
   is invalid or incomplete

2. We only walk the page table once now (while replay can be triggered
   multiple times depending on how many notifiers there are)

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a1a2a009c1..fbb2f763f0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1065,6 +1065,11 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
     return vtd_page_walk(&ce_cache, addr, addr + size, &info);
 }
 
+static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
+{
+    return vtd_sync_shadow_page_table_range(vtd_as, NULL, 0, UINT64_MAX);
+}
+
 /*
  * Fetch translation type for specific device. Returns <0 if error
  * happens, otherwise return the shifted type to check against
@@ -1397,7 +1402,7 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
     VTDAddressSpace *vtd_as;
 
     QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
-        memory_region_iommu_replay_all(&vtd_as->iommu);
+        vtd_sync_shadow_page_table(vtd_as);
     }
 }
 
@@ -1470,14 +1475,13 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                 vtd_switch_address_space(vtd_as);
                 /*
                  * So a device is moving out of (or moving into) a
-                 * domain, a replay() suites here to notify all the
-                 * IOMMU_NOTIFIER_MAP registers about this change.
+                 * domain, resync the shadow page table.
                  * This won't bring bad even if we have no such
                  * notifier registered - the IOMMU notification
                  * framework will skip MAP notifications if that
                  * happened.
                  */
-                memory_region_iommu_replay_all(&vtd_as->iommu);
+                vtd_sync_shadow_page_table(vtd_as);
             }
         }
     }
@@ -1535,7 +1539,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
             domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
-            memory_region_iommu_replay_all(&vtd_as->iommu);
+            vtd_sync_shadow_page_table(vtd_as);
         }
     }
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/12] intel-iommu: maintain per-device iova ranges
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 09/12] intel-iommu: maintain per-device iova ranges Peter Xu
@ 2018-05-17  9:46   ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-17  9:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	Jason Wang

On Thu, May 17, 2018 at 04:59:24PM +0800, Peter Xu wrote:

[...]

> +    /* Update local IOVA mapped ranges */
> +    if (entry->perm) {
> +        if (mapped) {
> +            /* If it's exactly the same translation, skip */
> +            if (!memcmp(mapped, &target, sizeof(target))) {
> +                trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
> +                                                 entry->translated_addr);
> +                return 0;
> +            } else {
> +                /*
> +                 * Translation changed.  This should not happen with
> +                 * "intel_iommu=on,strict", but it can happen when
> +                 * delayed flushing is used in guest IOMMU driver
> +                 * (when without "strict") when page A is reused
> +                 * before its previous unmap (the unmap can still be
> +                 * queued in the delayed flushing queue).  Now we do

This comment is wrong.  We can ignore above comments for now since as
I explained in the other thread Linux IOVA deferred flushing won't
free IOVA range until the unmap is flushed.  But still, below comment
is valid.

Regards,

> +                 * our best to remap.  Note that there will be a small
> +                 * window that we don't have map at all.  But that's
> +                 * the best effort we can do, and logically
> +                 * well-behaved guests should not really using this
> +                 * DMA region yet so we should be very safe.
> +                 */
> +                IOMMUAccessFlags cache_perm = entry->perm;
> +                int ret;
> +
> +                /* Emulate an UNMAP */
> +                entry->perm = IOMMU_NONE;
> +                trace_vtd_page_walk_one(info->domain_id,
> +                                        entry->iova,
> +                                        entry->translated_addr,
> +                                        entry->addr_mask,
> +                                        entry->perm);
> +                ret = hook_fn(entry, private);
> +                if (ret) {
> +                    return ret;
> +                }
> +                /* Drop any existing mapping */
> +                iova_tree_remove(as->iova_tree, &target);
> +                /* Recover the correct permission */
> +                entry->perm = cache_perm;
> +            }
> +        }
> +        iova_tree_insert(as->iova_tree, &target);
> +    } else {
> +        if (!mapped) {
> +            /* Skip since we didn't map this range at all */
> +            trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
> +            return 0;
> +        }
> +        iova_tree_remove(as->iova_tree, &target);
> +    }
> +
>      trace_vtd_page_walk_one(info->domain_id, entry->iova,
>                              entry->translated_addr, entry->addr_mask,
>                              entry->perm);

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (11 preceding siblings ...)
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 12/12] intel-iommu: new sync_shadow_page_table Peter Xu
@ 2018-05-17 19:49 ` Jintack Lim
  2018-05-18  6:26   ` Peter Xu
  2018-05-17 21:04 ` Michael S. Tsirkin
  2018-05-17 21:08 ` Michael S. Tsirkin
  14 siblings, 1 reply; 27+ messages in thread
From: Jintack Lim @ 2018-05-17 19:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: QEMU Devel Mailing List, Tian Kevin, Michael S . Tsirkin,
	Alex Williamson, Jason Wang

On Thu, May 17, 2018 at 4:59 AM, Peter Xu <peterx@redhat.com> wrote:
> (Hello, Jintack, Feel free to test this branch again against your scp
>  error case when you got free time)

Hi Peter,

>
> I rewrote some of the patches in V3.  Major changes:
>
> - Dropped mergable interval tree, instead introduced IOVA tree, which
>   is even simpler.
>
> - Fix the scp error issue that Jintack reported.  Please see patches
>   for detailed information.  That's the major reason to rewrite a few
>   of the patches.  We use replay for domain flushes are possibly
>   incorrect in the past.  The thing is that IOMMU replay has an
>   "definition" that "we should only send MAP when new page detected",
>   while for shadow page syncing we actually need something else than
>   that.  So in this version I started to use a new
>   vtd_sync_shadow_page_table() helper to do the page sync.

I checked that the scp problem I had (i.e. scp from the host to the
guest having virtual IOMMU and an assigned network device) was gone
with this patch series. Cool!

Please feel free to move this tag if this is not the right place!
Tested-by: Jintack Lim <jintack@cs.columbia.edu>

Thanks,
Jintack

>
> - Some other refines after the refactoring.
>
> I'll add unit test for the IOVA tree after this series merged to make
> sure we won't switch to another new tree implementaion...
>
> The element size in the new IOVA tree should be around
> sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> worst case usage ratio would be 72/4K=2%, which still seems acceptable
> (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> mapping in QEMU).
>
> I did explicit test with scp this time, copying 1G sized file for >10
> times on each of the following case:
>
> - L1 guest, with vIOMMU and with assigned device
> - L2 guest, without vIOMMU and with assigned device
> - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
>
> Please review.  Thanks,
>
> (Below are old content from previous cover letter)
>
> ==========================
>
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]
>
> Online repo:
>
>   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
>
> This series fixes several major problems that current code has:
>
> - Issue 1: when getting very big PSI UNMAP invalidations, the current
>   code is buggy in that we might skip the notification while actually
>   we should always send that notification.
>
> - Issue 2: IOTLB is not thread safe, while block dataplane can be
>   accessing and updating it in parallel.
>
> - Issue 3: For devices that only registered with UNMAP-only notifiers,
>   we don't really need to do page walking for PSIs, we can directly
>   deliver the notification down.  For example, vhost.
>
> - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
>   in the future, vDPA as well).  The problem is that, now for domain
>   invalidations we do this to make sure the shadow page tables are
>   correctly synced:
>
>   1. unmap the whole address space
>   2. replay the whole address space, map existing pages
>
>   However during step 1 and 2 there will be a very tiny window (it can
>   be as big as 3ms) that the shadow page table is either invalid or
>   incomplete (since we're rebuilding it up).  That's fatal error since
>   devices never know that happending and it's still possible to DMA to
>   memories.
>
> Patch 1 fixes issue 1.  I put it at the first since it's picked from
> an old post.
>
> Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
>
> Patch 3 fixes issue 2.
>
> Patch 4 fixes issue 3.
>
> Patch 5-9 fix issue 4.  Here a very simple interval tree is
> implemented based on Gtree.  It's different with general interval tree
> in that it does not allow user to pass in private data (e.g.,
> translated addresses).  However that benefits us that then we can
> merge adjacent interval leaves so that hopefully we won't consume much
> memory even if the mappings are a lot (that happens for nested virt -
> when mapping the whole L2 guest RAM range, it can be at least in GBs).
>
> Patch 10 is another big cleanup only can work after patch 9.
>
> Tests:
>
> - device assignments to L1, even L2 guests.  With this series applied
>   (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
>   we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
>   with assigned devices and things will work.  We can't before.
>
> - vhost smoke test for regression.
>
> Please review.  Thanks,
>
> Peter Xu (12):
>   intel-iommu: send PSI always even if across PDEs
>   intel-iommu: remove IntelIOMMUNotifierNode
>   intel-iommu: add iommu lock
>   intel-iommu: only do page walk for MAP notifiers
>   intel-iommu: introduce vtd_page_walk_info
>   intel-iommu: pass in address space when page walk
>   intel-iommu: trace domain id during page walk
>   util: implement simple iova tree
>   intel-iommu: maintain per-device iova ranges
>   intel-iommu: simplify page walk logic
>   intel-iommu: new vtd_sync_shadow_page_table_range
>   intel-iommu: new sync_shadow_page_table
>
>  include/hw/i386/intel_iommu.h |  19 +-
>  include/qemu/iova-tree.h      | 134 ++++++++++++
>  hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
>  util/iova-tree.c              | 114 ++++++++++
>  MAINTAINERS                   |   6 +
>  hw/i386/trace-events          |   5 +-
>  util/Makefile.objs            |   1 +
>  7 files changed, 556 insertions(+), 104 deletions(-)
>  create mode 100644 include/qemu/iova-tree.h
>  create mode 100644 util/iova-tree.c
>
> --
> 2.17.0
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs Peter Xu
@ 2018-05-17 21:00   ` Michael S. Tsirkin
  2018-05-18  8:23   ` Auger Eric
  1 sibling, 0 replies; 27+ messages in thread
From: Michael S. Tsirkin @ 2018-05-17 21:00 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Tian Kevin, Alex Williamson, Jintack Lim, Jason Wang

On Thu, May 17, 2018 at 04:59:16PM +0800, Peter Xu wrote:
> During IOVA page table walking, there is a special case when the PSI
> covers one whole PDE (Page Directory Entry, which contains 512 Page
> Table Entries) or more.  In the past, we skip that entry and we don't
> notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
> notification to registered UNMAP notifiers in this case.
> 
> For UNMAP only notifiers, this might cause IOTLBs cached in the devices
> even if they were already invalid.  For MAP/UNMAP notifiers like
> vfio-pci, this will cause stale page mappings.
> 
> This special case doesn't trigger often, but it is very easy to be
> triggered by nested device assignments, since in that case we'll
> possibly map the whole L2 guest RAM region into the device's IOVA
> address space (several GBs at least), which is far bigger than normal
> kernel driver usages of the device (tens of MBs normally).
> 
> Without this patch applied to L1 QEMU, nested device assignment to L2
> guests will dump some errors like:
> 
> qemu-system-x86_64: VFIO_MAP_DMA: -17
> qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
>                     0x7f89a920d000) = -17 (File exists)

Could you add description of security implications if any and their
severity?

> Acked-by: Jason Wang <jasowang@redhat.com>
> [peterx: rewrite the commit message]
> Signed-off-by: Peter Xu <peterx@redhat.com>

Cc stable here? For any other patches as well?
> ---
>  hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
>  1 file changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fb31de9416..b359efd6f9 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>  
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>  
> +static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> +                             vtd_page_walk_hook hook_fn, void *private)
> +{
> +    assert(hook_fn);
> +    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
> +                            entry->addr_mask, entry->perm);
> +    return hook_fn(entry, private);
> +}
> +
>  /**
>   * vtd_page_walk_level - walk over specific level for IOVA range
>   *
> @@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           */
>          entry_valid = read_cur | write_cur;
>  
> +        entry.target_as = &address_space_memory;
> +        entry.iova = iova & subpage_mask;
> +        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +        entry.addr_mask = ~subpage_mask;
> +
>          if (vtd_is_last_slpte(slpte, level)) {
> -            entry.target_as = &address_space_memory;
> -            entry.iova = iova & subpage_mask;
>              /* NOTE: this is only meaningful if entry_valid == true */
>              entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> -            entry.addr_mask = ~subpage_mask;
> -            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>              if (!entry_valid && !notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
>              }
> -            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> -                                    entry.addr_mask, entry.perm);
> -            if (hook_fn) {
> -                ret = hook_fn(&entry, private);
> -                if (ret < 0) {
> -                    return ret;
> -                }
> +            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +            if (ret < 0) {
> +                return ret;
>              }
>          } else {
>              if (!entry_valid) {
> -                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                if (notify_unmap) {
> +                    /*
> +                     * The whole entry is invalid; unmap it all.
> +                     * Translated address is meaningless, zero it.
> +                     */
> +                    entry.translated_addr = 0x0;
> +                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +                    if (ret < 0) {
> +                        return ret;
> +                    }
> +                } else {
> +                    trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                }
>                  goto next;
>              }
>              ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
> -- 
> 2.17.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (12 preceding siblings ...)
  2018-05-17 19:49 ` [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Jintack Lim
@ 2018-05-17 21:04 ` Michael S. Tsirkin
  2018-05-18  6:34   ` Peter Xu
  2018-05-17 21:08 ` Michael S. Tsirkin
  14 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2018-05-17 21:04 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Tian Kevin, Alex Williamson, Jintack Lim, Jason Wang

On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> (Hello, Jintack, Feel free to test this branch again against your scp
>  error case when you got free time)
> 
> I rewrote some of the patches in V3.  Major changes:
> 
> - Dropped mergable interval tree, instead introduced IOVA tree, which
>   is even simpler.
> 
> - Fix the scp error issue that Jintack reported.  Please see patches
>   for detailed information.  That's the major reason to rewrite a few
>   of the patches.  We use replay for domain flushes are possibly
>   incorrect in the past.  The thing is that IOMMU replay has an
>   "definition" that "we should only send MAP when new page detected",
>   while for shadow page syncing we actually need something else than
>   that.  So in this version I started to use a new
>   vtd_sync_shadow_page_table() helper to do the page sync.
> 
> - Some other refines after the refactoring.
> 
> I'll add unit test for the IOVA tree after this series merged to make
> sure we won't switch to another new tree implementaion...
> 
> The element size in the new IOVA tree should be around
> sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> worst case usage ratio would be 72/4K=2%, which still seems acceptable
> (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> mapping in QEMU).
> 
> I did explicit test with scp this time, copying 1G sized file for >10
> times on each of the following case:
> 
> - L1 guest, with vIOMMU and with assigned device
> - L2 guest, without vIOMMU and with assigned device
> - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> 
> Please review.  Thanks,
> 
> (Below are old content from previous cover letter)
> 
> ==========================
> 
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> 
> This series fixes several major problems that current code has:
> 
> - Issue 1: when getting very big PSI UNMAP invalidations, the current
>   code is buggy in that we might skip the notification while actually
>   we should always send that notification.

security issue

> - Issue 2: IOTLB is not thread safe, while block dataplane can be
>   accessing and updating it in parallel.

security issue

> - Issue 3: For devices that only registered with UNMAP-only notifiers,
>   we don't really need to do page walking for PSIs, we can directly
>   deliver the notification down.  For example, vhost.

optimization

> - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
>   in the future, vDPA as well).  The problem is that, now for domain
>   invalidations we do this to make sure the shadow page tables are
>   correctly synced:
> 
>   1. unmap the whole address space
>   2. replay the whole address space, map existing pages
> 
>   However during step 1 and 2 there will be a very tiny window (it can
>   be as big as 3ms) that the shadow page table is either invalid or
>   incomplete (since we're rebuilding it up).  That's fatal error since
>   devices never know that happending and it's still possible to DMA to
>   memories.

correctness but not a security issue

> Patch 1 fixes issue 1.  I put it at the first since it's picked from
> an old post.
> 
> Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> 
> Patch 3 fixes issue 2.
> 
> Patch 4 fixes issue 3.
> 
> Patch 5-9 fix issue 4.  Here a very simple interval tree is
> implemented based on Gtree.  It's different with general interval tree
> in that it does not allow user to pass in private data (e.g.,
> translated addresses).  However that benefits us that then we can
> merge adjacent interval leaves so that hopefully we won't consume much
> memory even if the mappings are a lot (that happens for nested virt -
> when mapping the whole L2 guest RAM range, it can be at least in GBs).
> 
> Patch 10 is another big cleanup only can work after patch 9.


So 1-2 are needed on stable. 1-9 would be nice to have
there too, even though they are big and it looks risky.

> Tests:
> 
> - device assignments to L1, even L2 guests.  With this series applied
>   (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
>   we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
>   with assigned devices and things will work.  We can't before.
> 
> - vhost smoke test for regression.
> 
> Please review.  Thanks,
> 
> Peter Xu (12):
>   intel-iommu: send PSI always even if across PDEs
>   intel-iommu: remove IntelIOMMUNotifierNode
>   intel-iommu: add iommu lock
>   intel-iommu: only do page walk for MAP notifiers
>   intel-iommu: introduce vtd_page_walk_info
>   intel-iommu: pass in address space when page walk
>   intel-iommu: trace domain id during page walk
>   util: implement simple iova tree
>   intel-iommu: maintain per-device iova ranges
>   intel-iommu: simplify page walk logic
>   intel-iommu: new vtd_sync_shadow_page_table_range
>   intel-iommu: new sync_shadow_page_table
> 
>  include/hw/i386/intel_iommu.h |  19 +-
>  include/qemu/iova-tree.h      | 134 ++++++++++++
>  hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
>  util/iova-tree.c              | 114 ++++++++++
>  MAINTAINERS                   |   6 +
>  hw/i386/trace-events          |   5 +-
>  util/Makefile.objs            |   1 +
>  7 files changed, 556 insertions(+), 104 deletions(-)
>  create mode 100644 include/qemu/iova-tree.h
>  create mode 100644 util/iova-tree.c
> 
> -- 
> 2.17.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 12/12] intel-iommu: new sync_shadow_page_table
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 12/12] intel-iommu: new sync_shadow_page_table Peter Xu
@ 2018-05-17 21:06   ` Michael S. Tsirkin
  2018-05-18  6:22     ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2018-05-17 21:06 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Tian Kevin, Alex Williamson, Jintack Lim, Jason Wang

On Thu, May 17, 2018 at 04:59:27PM +0800, Peter Xu wrote:
> Firstly, introduce the sync_shadow_page_table() helper to resync the
> whole shadow page table of an IOMMU address space.  Meanwhile, when we
> receive domain invalidation or similar requests (for example, context
> entry invalidations, global invalidations, ...), we should not really
> run the replay logic, instead we can now use the new sync shadow page
> table API to resync the whole shadow page table.
> 
> There will be two major differences:
> 
> 1. We don't unmap-all before walking the page table, we just sync.  The
>    global unmap-all can create a very small window that the page table
>    is invalid or incomplete

The implication is that with vfio, device might stop working
without this change.


> 2. We only walk the page table once now (while replay can be triggered
>    multiple times depending on how many notifiers there are)
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a1a2a009c1..fbb2f763f0 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1065,6 +1065,11 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
>      return vtd_page_walk(&ce_cache, addr, addr + size, &info);
>  }
>  
> +static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
> +{
> +    return vtd_sync_shadow_page_table_range(vtd_as, NULL, 0, UINT64_MAX);
> +}
> +
>  /*
>   * Fetch translation type for specific device. Returns <0 if error
>   * happens, otherwise return the shifted type to check against
> @@ -1397,7 +1402,7 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
>      VTDAddressSpace *vtd_as;
>  
>      QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
> -        memory_region_iommu_replay_all(&vtd_as->iommu);
> +        vtd_sync_shadow_page_table(vtd_as);
>      }
>  }
>  
> @@ -1470,14 +1475,13 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                  vtd_switch_address_space(vtd_as);
>                  /*
>                   * So a device is moving out of (or moving into) a
> -                 * domain, a replay() suites here to notify all the
> -                 * IOMMU_NOTIFIER_MAP registers about this change.
> +                 * domain, resync the shadow page table.
>                   * This won't bring bad even if we have no such
>                   * notifier registered - the IOMMU notification
>                   * framework will skip MAP notifications if that
>                   * happened.
>                   */
> -                memory_region_iommu_replay_all(&vtd_as->iommu);
> +                vtd_sync_shadow_page_table(vtd_as);
>              }
>          }
>      }
> @@ -1535,7 +1539,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                        vtd_as->devfn, &ce) &&
>              domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> -            memory_region_iommu_replay_all(&vtd_as->iommu);
> +            vtd_sync_shadow_page_table(vtd_as);
>          }
>      }
>  }
> -- 
> 2.17.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (13 preceding siblings ...)
  2018-05-17 21:04 ` Michael S. Tsirkin
@ 2018-05-17 21:08 ` Michael S. Tsirkin
  2018-05-18  6:30   ` Peter Xu
  14 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2018-05-17 21:08 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Tian Kevin, Alex Williamson, Jintack Lim, Jason Wang

On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> (Hello, Jintack, Feel free to test this branch again against your scp
>  error case when you got free time)
> 
> I rewrote some of the patches in V3.  Major changes:
> 
> - Dropped mergable interval tree, instead introduced IOVA tree, which
>   is even simpler.
> 
> - Fix the scp error issue that Jintack reported.  Please see patches
>   for detailed information.  That's the major reason to rewrite a few
>   of the patches.  We use replay for domain flushes are possibly
>   incorrect in the past.  The thing is that IOMMU replay has an
>   "definition" that "we should only send MAP when new page detected",
>   while for shadow page syncing we actually need something else than
>   that.  So in this version I started to use a new
>   vtd_sync_shadow_page_table() helper to do the page sync.
> 
> - Some other refines after the refactoring.
> 
> I'll add unit test for the IOVA tree after this series merged to make
> sure we won't switch to another new tree implementaion...
> 
> The element size in the new IOVA tree should be around
> sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> worst case usage ratio would be 72/4K=2%, which still seems acceptable
> (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> mapping in QEMU).
> 
> I did explicit test with scp this time, copying 1G sized file for >10
> times on each of the following case:
> 
> - L1 guest, with vIOMMU and with assigned device
> - L2 guest, without vIOMMU and with assigned device
> - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> 
> Please review.  Thanks,
> 
> (Below are old content from previous cover letter)
> 
> ==========================
> 
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> 
> This series fixes several major problems that current code has:
> 
> - Issue 1: when getting very big PSI UNMAP invalidations, the current
>   code is buggy in that we might skip the notification while actually
>   we should always send that notification.
> 
> - Issue 2: IOTLB is not thread safe, while block dataplane can be
>   accessing and updating it in parallel.
> 
> - Issue 3: For devices that only registered with UNMAP-only notifiers,
>   we don't really need to do page walking for PSIs, we can directly
>   deliver the notification down.  For example, vhost.
> 
> - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
>   in the future, vDPA as well).  The problem is that, now for domain
>   invalidations we do this to make sure the shadow page tables are
>   correctly synced:
> 
>   1. unmap the whole address space
>   2. replay the whole address space, map existing pages
> 
>   However during step 1 and 2 there will be a very tiny window (it can
>   be as big as 3ms) that the shadow page table is either invalid or
>   incomplete (since we're rebuilding it up).  That's fatal error since
>   devices never know that happending and it's still possible to DMA to
>   memories.
> 
> Patch 1 fixes issue 1.  I put it at the first since it's picked from
> an old post.
> 
> Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> 
> Patch 3 fixes issue 2.
> 
> Patch 4 fixes issue 3.
> 
> Patch 5-9 fix issue 4.  Here a very simple interval tree is
> implemented based on Gtree.  It's different with general interval tree
> in that it does not allow user to pass in private data (e.g.,
> translated addresses).  However that benefits us that then we can
> merge adjacent interval leaves so that hopefully we won't consume much
> memory even if the mappings are a lot (that happens for nested virt -
> when mapping the whole L2 guest RAM range, it can be at least in GBs).
> 
> Patch 10 is another big cleanup only can work after patch 9.

I think patch numbers are wrong somehow.

Given you also want to tweak one comment, could
you please repost with this fix, and also
in commit log for each patch
- Cc stable
- for security patches mention as much, if possible
  add data about the issue and its severity

> Tests:
> 
> - device assignments to L1, even L2 guests.  With this series applied
>   (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
>   we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
>   with assigned devices and things will work.  We can't before.
> 
> - vhost smoke test for regression.
> 
> Please review.  Thanks,
> 
> Peter Xu (12):
>   intel-iommu: send PSI always even if across PDEs
>   intel-iommu: remove IntelIOMMUNotifierNode
>   intel-iommu: add iommu lock
>   intel-iommu: only do page walk for MAP notifiers
>   intel-iommu: introduce vtd_page_walk_info
>   intel-iommu: pass in address space when page walk
>   intel-iommu: trace domain id during page walk
>   util: implement simple iova tree
>   intel-iommu: maintain per-device iova ranges
>   intel-iommu: simplify page walk logic
>   intel-iommu: new vtd_sync_shadow_page_table_range
>   intel-iommu: new sync_shadow_page_table
> 
>  include/hw/i386/intel_iommu.h |  19 +-
>  include/qemu/iova-tree.h      | 134 ++++++++++++
>  hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
>  util/iova-tree.c              | 114 ++++++++++
>  MAINTAINERS                   |   6 +
>  hw/i386/trace-events          |   5 +-
>  util/Makefile.objs            |   1 +
>  7 files changed, 556 insertions(+), 104 deletions(-)
>  create mode 100644 include/qemu/iova-tree.h
>  create mode 100644 util/iova-tree.c
> 
> -- 
> 2.17.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 12/12] intel-iommu: new sync_shadow_page_table
  2018-05-17 21:06   ` Michael S. Tsirkin
@ 2018-05-18  6:22     ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-18  6:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Tian Kevin, Alex Williamson, Jintack Lim, Jason Wang

On Fri, May 18, 2018 at 12:06:14AM +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2018 at 04:59:27PM +0800, Peter Xu wrote:
> > Firstly, introduce the sync_shadow_page_table() helper to resync the
> > whole shadow page table of an IOMMU address space.  Meanwhile, when we
> > receive domain invalidation or similar requests (for example, context
> > entry invalidations, global invalidations, ...), we should not really
> > run the replay logic, instead we can now use the new sync shadow page
> > table API to resync the whole shadow page table.
> > 
> > There will be two major differences:
> > 
> > 1. We don't unmap-all before walking the page table, we just sync.  The
> >    global unmap-all can create a very small window that the page table
> >    is invalid or incomplete
> 
> The implication is that with vfio, device might stop working
> without this change.

I guess it was working before.  However patch 9 changed the page
walking to be state-ful since we have IOVA tree now (it was not) so we
might encounter the scp DMA error after patch 9.  I still kept the
patches 9-12 to be separate since logically they are isolated.
However if you want maybe we can also squash 9-12 patches into one
single (but a bit huge) patch.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17 19:49 ` [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Jintack Lim
@ 2018-05-18  6:26   ` Peter Xu
  2018-05-18  6:28     ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2018-05-18  6:26 UTC (permalink / raw)
  To: Jintack Lim
  Cc: QEMU Devel Mailing List, Tian Kevin, Michael S . Tsirkin,
	Alex Williamson, Jason Wang

On Thu, May 17, 2018 at 03:49:05PM -0400, Jintack Lim wrote:
> On Thu, May 17, 2018 at 4:59 AM, Peter Xu <peterx@redhat.com> wrote:
> > (Hello, Jintack, Feel free to test this branch again against your scp
> >  error case when you got free time)
> 
> Hi Peter,
> 
> >
> > I rewrote some of the patches in V3.  Major changes:
> >
> > - Dropped mergable interval tree, instead introduced IOVA tree, which
> >   is even simpler.
> >
> > - Fix the scp error issue that Jintack reported.  Please see patches
> >   for detailed information.  That's the major reason to rewrite a few
> >   of the patches.  We use replay for domain flushes are possibly
> >   incorrect in the past.  The thing is that IOMMU replay has an
> >   "definition" that "we should only send MAP when new page detected",
> >   while for shadow page syncing we actually need something else than
> >   that.  So in this version I started to use a new
> >   vtd_sync_shadow_page_table() helper to do the page sync.
> 
> I checked that the scp problem I had (i.e. scp from the host to the
> guest having virtual IOMMU and an assigned network device) was gone
> with this patch series. Cool!
> 
> Please feel free to move this tag if this is not the right place!
> Tested-by: Jintack Lim <jintack@cs.columbia.edu>

Thanks for the quick feedback!

I'll temporarily consider putting it at the last patch of series if no
one jumps out and tells me another more correct way.  Also I'll
possibly make bold to append your suggested-by too to further claim
your contribution to this problem.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-18  6:26   ` Peter Xu
@ 2018-05-18  6:28     ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-18  6:28 UTC (permalink / raw)
  To: Jintack Lim
  Cc: QEMU Devel Mailing List, Tian Kevin, Michael S . Tsirkin,
	Alex Williamson, Jason Wang

On Fri, May 18, 2018 at 02:26:59PM +0800, Peter Xu wrote:

[...]

> I'll temporarily consider putting it at the last patch of series if no
> one jumps out and tells me another more correct way.  Also I'll
> possibly make bold to append your suggested-by too to further claim
                                    ^^^^^^^^^^^^
Sorry, I mean "Reported-by".

> your contribution to this problem.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17 21:08 ` Michael S. Tsirkin
@ 2018-05-18  6:30   ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-18  6:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Tian Kevin, Alex Williamson, Jintack Lim, Jason Wang

On Fri, May 18, 2018 at 12:08:01AM +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> > (Hello, Jintack, Feel free to test this branch again against your scp
> >  error case when you got free time)
> > 
> > I rewrote some of the patches in V3.  Major changes:
> > 
> > - Dropped mergable interval tree, instead introduced IOVA tree, which
> >   is even simpler.
> > 
> > - Fix the scp error issue that Jintack reported.  Please see patches
> >   for detailed information.  That's the major reason to rewrite a few
> >   of the patches.  We use replay for domain flushes are possibly
> >   incorrect in the past.  The thing is that IOMMU replay has an
> >   "definition" that "we should only send MAP when new page detected",
> >   while for shadow page syncing we actually need something else than
> >   that.  So in this version I started to use a new
> >   vtd_sync_shadow_page_table() helper to do the page sync.
> > 
> > - Some other refines after the refactoring.
> > 
> > I'll add unit test for the IOVA tree after this series merged to make
> > sure we won't switch to another new tree implementaion...
> > 
> > The element size in the new IOVA tree should be around
> > sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> > worst case usage ratio would be 72/4K=2%, which still seems acceptable
> > (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> > mapping in QEMU).
> > 
> > I did explicit test with scp this time, copying 1G sized file for >10
> > times on each of the following case:
> > 
> > - L1 guest, with vIOMMU and with assigned device
> > - L2 guest, without vIOMMU and with assigned device
> > - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> > 
> > Please review.  Thanks,
> > 
> > (Below are old content from previous cover letter)
> > 
> > ==========================
> > 
> > v2:
> > - fix patchew code style warnings
> > - interval tree: postpone malloc when inserting; simplify node remove
> >   a bit where proper [Jason]
> > - fix up comment and commit message for iommu lock patch [Kevin]
> > - protect context cache too using the iommu lock [Kevin, Jason]
> > - add vast comment in patch 8 to explain the modify-PTE problem
> >   [Jason, Kevin]
> > 
> > Online repo:
> > 
> >   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> > 
> > This series fixes several major problems that current code has:
> > 
> > - Issue 1: when getting very big PSI UNMAP invalidations, the current
> >   code is buggy in that we might skip the notification while actually
> >   we should always send that notification.
> > 
> > - Issue 2: IOTLB is not thread safe, while block dataplane can be
> >   accessing and updating it in parallel.
> > 
> > - Issue 3: For devices that only registered with UNMAP-only notifiers,
> >   we don't really need to do page walking for PSIs, we can directly
> >   deliver the notification down.  For example, vhost.
> > 
> > - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
> >   in the future, vDPA as well).  The problem is that, now for domain
> >   invalidations we do this to make sure the shadow page tables are
> >   correctly synced:
> > 
> >   1. unmap the whole address space
> >   2. replay the whole address space, map existing pages
> > 
> >   However during step 1 and 2 there will be a very tiny window (it can
> >   be as big as 3ms) that the shadow page table is either invalid or
> >   incomplete (since we're rebuilding it up).  That's fatal error since
> >   devices never know that happending and it's still possible to DMA to
> >   memories.
> > 
> > Patch 1 fixes issue 1.  I put it at the first since it's picked from
> > an old post.
> > 
> > Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> > 
> > Patch 3 fixes issue 2.
> > 
> > Patch 4 fixes issue 3.
> > 
> > Patch 5-9 fix issue 4.  Here a very simple interval tree is
> > implemented based on Gtree.  It's different with general interval tree
> > in that it does not allow user to pass in private data (e.g.,
> > translated addresses).  However that benefits us that then we can
> > merge adjacent interval leaves so that hopefully we won't consume much
> > memory even if the mappings are a lot (that happens for nested virt -
> > when mapping the whole L2 guest RAM range, it can be at least in GBs).
> > 
> > Patch 10 is another big cleanup only can work after patch 9.
> 
> I think patch numbers are wrong somehow.

Yes, this is the old cover letter, so the numbers do not match.

> 
> Given you also want to tweak one comment, could
> you please repost with this fix, and also
> in commit log for each patch
> - Cc stable
> - for security patches mention as much, if possible
>   add data about the issue and its severity

The rest I can try to do.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-05-17 21:04 ` Michael S. Tsirkin
@ 2018-05-18  6:34   ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2018-05-18  6:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Tian Kevin, Alex Williamson, Jintack Lim, Jason Wang

On Fri, May 18, 2018 at 12:04:04AM +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> > (Hello, Jintack, Feel free to test this branch again against your scp
> >  error case when you got free time)
> > 
> > I rewrote some of the patches in V3.  Major changes:
> > 
> > - Dropped mergable interval tree, instead introduced IOVA tree, which
> >   is even simpler.
> > 
> > - Fix the scp error issue that Jintack reported.  Please see patches
> >   for detailed information.  That's the major reason to rewrite a few
> >   of the patches.  We use replay for domain flushes are possibly
> >   incorrect in the past.  The thing is that IOMMU replay has an
> >   "definition" that "we should only send MAP when new page detected",
> >   while for shadow page syncing we actually need something else than
> >   that.  So in this version I started to use a new
> >   vtd_sync_shadow_page_table() helper to do the page sync.
> > 
> > - Some other refines after the refactoring.
> > 
> > I'll add unit test for the IOVA tree after this series merged to make
> > sure we won't switch to another new tree implementaion...
> > 
> > The element size in the new IOVA tree should be around
> > sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> > worst case usage ratio would be 72/4K=2%, which still seems acceptable
> > (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> > mapping in QEMU).
> > 
> > I did explicit test with scp this time, copying 1G sized file for >10
> > times on each of the following case:
> > 
> > - L1 guest, with vIOMMU and with assigned device
> > - L2 guest, without vIOMMU and with assigned device
> > - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> > 
> > Please review.  Thanks,
> > 
> > (Below are old content from previous cover letter)
> > 
> > ==========================
> > 
> > v2:
> > - fix patchew code style warnings
> > - interval tree: postpone malloc when inserting; simplify node remove
> >   a bit where proper [Jason]
> > - fix up comment and commit message for iommu lock patch [Kevin]
> > - protect context cache too using the iommu lock [Kevin, Jason]
> > - add vast comment in patch 8 to explain the modify-PTE problem
> >   [Jason, Kevin]
> > 
> > Online repo:
> > 
> >   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> > 
> > This series fixes several major problems that current code has:
> > 
> > - Issue 1: when getting very big PSI UNMAP invalidations, the current
> >   code is buggy in that we might skip the notification while actually
> >   we should always send that notification.
> 
> security issue
> 
> > - Issue 2: IOTLB is not thread safe, while block dataplane can be
> >   accessing and updating it in parallel.
> 
> security issue
> 
> > - Issue 3: For devices that only registered with UNMAP-only notifiers,
> >   we don't really need to do page walking for PSIs, we can directly
> >   deliver the notification down.  For example, vhost.
> 
> optimization
> 
> > - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
> >   in the future, vDPA as well).  The problem is that, now for domain
> >   invalidations we do this to make sure the shadow page tables are
> >   correctly synced:
> > 
> >   1. unmap the whole address space
> >   2. replay the whole address space, map existing pages
> > 
> >   However during step 1 and 2 there will be a very tiny window (it can
> >   be as big as 3ms) that the shadow page table is either invalid or
> >   incomplete (since we're rebuilding it up).  That's fatal error since
> >   devices never know that happending and it's still possible to DMA to
> >   memories.
> 
> correctness but not a security issue
> 
> > Patch 1 fixes issue 1.  I put it at the first since it's picked from
> > an old post.
> > 
> > Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> > 
> > Patch 3 fixes issue 2.
> > 
> > Patch 4 fixes issue 3.
> > 
> > Patch 5-9 fix issue 4.  Here a very simple interval tree is
> > implemented based on Gtree.  It's different with general interval tree
> > in that it does not allow user to pass in private data (e.g.,
> > translated addresses).  However that benefits us that then we can
> > merge adjacent interval leaves so that hopefully we won't consume much
> > memory even if the mappings are a lot (that happens for nested virt -
> > when mapping the whole L2 guest RAM range, it can be at least in GBs).
> > 
> > Patch 10 is another big cleanup only can work after patch 9.
> 
> 
> So 1-2 are needed on stable. 1-9 would be nice to have
> there too, even though they are big and it looks risky.

Yes, although issue 4 is not a security issue, but it might cause DMA
errors and unusability of devices to happen.  I don't know very much
on the details of how stable tree should treat patches like this, but
considering that this whole series only touches VT-d code, and as you
mentioned merely all the patches would be nice to have even for
stable, I'll just CC stable for all the patches, refine messages and
repost.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs Peter Xu
  2018-05-17 21:00   ` Michael S. Tsirkin
@ 2018-05-18  8:23   ` Auger Eric
  1 sibling, 0 replies; 27+ messages in thread
From: Auger Eric @ 2018-05-18  8:23 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Tian Kevin, Michael S . Tsirkin, Jason Wang, Alex Williamson,
	Jintack Lim

Hi Peter,

On 05/17/2018 10:59 AM, Peter Xu wrote:
> During IOVA page table walking, there is a special case when the PSI
> covers one whole PDE (Page Directory Entry, which contains 512 Page
> Table Entries) or more.  In the past, we skip that entry and we don't
> notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
> notification to registered UNMAP notifiers in this case.
> 
> For UNMAP only notifiers, this might cause IOTLBs cached in the devices
> even if they were already invalid.  For MAP/UNMAP notifiers like
> vfio-pci, this will cause stale page mappings.
> 
> This special case doesn't trigger often, but it is very easy to be
> triggered by nested device assignments, since in that case we'll
> possibly map the whole L2 guest RAM region into the device's IOVA
> address space (several GBs at least), which is far bigger than normal
> kernel driver usages of the device (tens of MBs normally).
> 
> Without this patch applied to L1 QEMU, nested device assignment to L2
> guests will dump some errors like:
> 
> qemu-system-x86_64: VFIO_MAP_DMA: -17
> qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
>                     0x7f89a920d000) = -17 (File exists)
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> [peterx: rewrite the commit message]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
>  1 file changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fb31de9416..b359efd6f9 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>  
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>  
> +static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> +                             vtd_page_walk_hook hook_fn, void *private)
Besides the name of this function (nit)
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

> +{
> +    assert(hook_fn);
> +    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
> +                            entry->addr_mask, entry->perm);
> +    return hook_fn(entry, private);
> +}
> +
>  /**
>   * vtd_page_walk_level - walk over specific level for IOVA range
>   *
> @@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           */
>          entry_valid = read_cur | write_cur;
>  
> +        entry.target_as = &address_space_memory;
> +        entry.iova = iova & subpage_mask;
> +        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +        entry.addr_mask = ~subpage_mask;
> +
>          if (vtd_is_last_slpte(slpte, level)) {
> -            entry.target_as = &address_space_memory;
> -            entry.iova = iova & subpage_mask;
>              /* NOTE: this is only meaningful if entry_valid == true */
>              entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> -            entry.addr_mask = ~subpage_mask;
> -            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>              if (!entry_valid && !notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
>              }
> -            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> -                                    entry.addr_mask, entry.perm);
> -            if (hook_fn) {
> -                ret = hook_fn(&entry, private);
> -                if (ret < 0) {
> -                    return ret;
> -                }
> +            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +            if (ret < 0) {
> +                return ret;
>              }
>          } else {
>              if (!entry_valid) {
> -                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                if (notify_unmap) {
> +                    /*
> +                     * The whole entry is invalid; unmap it all.
> +                     * Translated address is meaningless, zero it.
> +                     */
> +                    entry.translated_addr = 0x0;
> +                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +                    if (ret < 0) {
> +                        return ret;
> +                    }
> +                } else {
> +                    trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                }
>                  goto next;
>              }
>              ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/12] intel-iommu: pass in address space when page walk
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 06/12] intel-iommu: pass in address space when page walk Peter Xu
@ 2018-05-18  8:23   ` Auger Eric
  0 siblings, 0 replies; 27+ messages in thread
From: Auger Eric @ 2018-05-18  8:23 UTC (permalink / raw)
  To: qemu-devel

Hi Peter,

On 05/17/2018 10:59 AM, Peter Xu wrote:
> We pass in the VTDAddressSpace too.  It'll be used in the follow up
> patches.
So you evetually preferred to keep .aw. I don't have a strong opinion
but maybe a small preference to v2 version.

Nevertheless
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 4953d02ed0..fe5ee77d46 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -753,9 +753,11 @@ typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>   * @hook_fn: hook func to be called when detected page
>   * @private: private data to be passed into hook func
>   * @notify_unmap: whether we should notify invalid entries
> + * @as: VT-d address space of the device
>   * @aw: maximum address width
>   */
>  typedef struct {
> +    VTDAddressSpace *as;
>      vtd_page_walk_hook hook_fn;
>      void *private;
>      bool notify_unmap;
> @@ -1460,6 +1462,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>                      .private = (void *)&vtd_as->iommu,
>                      .notify_unmap = true,
>                      .aw = s->aw_bits,
> +                    .as = vtd_as,
>                  };
>  
>                  /*
> @@ -2941,6 +2944,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                  .private = (void *)n,
>                  .notify_unmap = false,
>                  .aw = s->aw_bits,
> +                .as = vtd_as,
>              };
>  
>              vtd_page_walk(&ce, 0, ~0ULL, &info);
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/12] intel-iommu: introduce vtd_page_walk_info
  2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 05/12] intel-iommu: introduce vtd_page_walk_info Peter Xu
@ 2018-05-18  8:23   ` Auger Eric
  0 siblings, 0 replies; 27+ messages in thread
From: Auger Eric @ 2018-05-18  8:23 UTC (permalink / raw)
  To: qemu-devel



On 05/17/2018 10:59 AM, Peter Xu wrote:
> During the recursive page walking of IOVA page tables, some stack
> variables are constant variables and never changed during the whole page
> walking procedure.  Isolate them into a struct so that we don't need to
> pass those contants down the stack every time and multiple times.
s/contants/constants
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 84 ++++++++++++++++++++++++++-----------------
>  1 file changed, 52 insertions(+), 32 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 9a418abfb6..4953d02ed0 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -747,9 +747,27 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>  
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>  
> +/**
> + * Constant information used during page walking
page table walk?
> + *
> + * @hook_fn: hook func to be called when detected page
nit: when detected page? Not necessarily as it may be called on invalid
entries. on leaf ptw entries?
> + * @private: private data to be passed into hook func
> + * @notify_unmap: whether we should notify invalid entries
> + * @aw: maximum address width
> + */
> +typedef struct {
> +    vtd_page_walk_hook hook_fn;
> +    void *private;
> +    bool notify_unmap;
> +    uint8_t aw;
> +} vtd_page_walk_info;
> +
>  static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> -                             vtd_page_walk_hook hook_fn, void *private)
> +                             vtd_page_walk_info *info)
>  {
> +    vtd_page_walk_hook hook_fn = info->hook_fn;
> +    void *private = info->private;
> +
>      assert(hook_fn);
>      trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
>                              entry->addr_mask, entry->perm);
> @@ -762,17 +780,13 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
>   * @addr: base GPA addr to start the walk
>   * @start: IOVA range start address
>   * @end: IOVA range end address (start <= addr < end)
> - * @hook_fn: hook func to be called when detected page
> - * @private: private data to be passed into hook func
>   * @read: whether parent level has read permission
>   * @write: whether parent level has write permission
> - * @notify_unmap: whether we should notify invalid entries
> - * @aw: maximum address width
> + * @info: constant information for the page walk
>   */
>  static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> -                               uint64_t end, vtd_page_walk_hook hook_fn,
> -                               void *private, uint32_t level, bool read,
> -                               bool write, bool notify_unmap, uint8_t aw)
> +                               uint64_t end, uint32_t level, bool read,
> +                               bool write, vtd_page_walk_info *info)
>  {
>      bool read_cur, write_cur, entry_valid;
>      uint32_t offset;
> @@ -822,24 +836,24 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>  
>          if (vtd_is_last_slpte(slpte, level)) {
>              /* NOTE: this is only meaningful if entry_valid == true */
> -            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> -            if (!entry_valid && !notify_unmap) {
> +            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
> +            if (!entry_valid && !info->notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
>              }
> -            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +            ret = vtd_page_walk_one(&entry, level, info);
>              if (ret < 0) {
>                  return ret;
>              }
>          } else {
>              if (!entry_valid) {
> -                if (notify_unmap) {
> +                if (info->notify_unmap) {
>                      /*
>                       * The whole entry is invalid; unmap it all.
>                       * Translated address is meaningless, zero it.
>                       */
>                      entry.translated_addr = 0x0;
> -                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +                    ret = vtd_page_walk_one(&entry, level, info);
>                      if (ret < 0) {
>                          return ret;
>                      }
> @@ -848,10 +862,9 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>                  }
>                  goto next;
>              }
> -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
> -                                      MIN(iova_next, end), hook_fn, private,
> -                                      level - 1, read_cur, write_cur,
> -                                      notify_unmap, aw);
> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> +                                      iova, MIN(iova_next, end), level - 1,
> +                                      read_cur, write_cur, info);
>              if (ret < 0) {
>                  return ret;
>              }
> @@ -870,28 +883,24 @@ next:
>   * @ce: context entry to walk upon
>   * @start: IOVA address to start the walk
>   * @end: IOVA range end address (start <= addr < end)
> - * @hook_fn: the hook that to be called for each detected area
> - * @private: private data for the hook function
> - * @aw: maximum address width
> + * @info: page walking information struct
>   */
>  static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> -                         vtd_page_walk_hook hook_fn, void *private,
> -                         bool notify_unmap, uint8_t aw)
> +                         vtd_page_walk_info *info)
>  {
>      dma_addr_t addr = vtd_ce_get_slpt_base(ce);
>      uint32_t level = vtd_ce_get_level(ce);
>  
> -    if (!vtd_iova_range_check(start, ce, aw)) {
> +    if (!vtd_iova_range_check(start, ce, info->aw)) {
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
>  
> -    if (!vtd_iova_range_check(end, ce, aw)) {
> +    if (!vtd_iova_range_check(end, ce, info->aw)) {
>          /* Fix end so that it reaches the maximum */
> -        end = vtd_iova_limit(ce, aw);
> +        end = vtd_iova_limit(ce, info->aw);
>      }
>  
> -    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> -                               level, true, true, notify_unmap, aw);
> +    return vtd_page_walk_level(addr, start, end, level, true, true, info);
>  }
>  
>  /* Map a device to its corresponding domain (context-entry) */
> @@ -1446,13 +1455,18 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>                                         vtd_as->devfn, &ce);
>          if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>              if (vtd_as_notify_mappings(vtd_as)) {
> +                vtd_page_walk_info info = {
> +                    .hook_fn = vtd_page_invalidate_notify_hook,
> +                    .private = (void *)&vtd_as->iommu,
> +                    .notify_unmap = true,
> +                    .aw = s->aw_bits,
> +                };
> +
>                  /*
>                   * For MAP-inclusive notifiers, we need to walk the
>                   * page table to sync the shadow page table.
>                   */
> -                vtd_page_walk(&ce, addr, addr + size,
> -                              vtd_page_invalidate_notify_hook,
> -                              (void *)&vtd_as->iommu, true, s->aw_bits);
> +                vtd_page_walk(&ce, addr, addr + size, &info);
>              } else {
>                  /*
>                   * For UNMAP-only notifiers, we don't need to walk the
> @@ -2922,8 +2936,14 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                                    ce.hi, ce.lo);
>          if (vtd_as_notify_mappings(vtd_as)) {
>              /* This is required only for MAP typed notifiers */
> -            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> -                          s->aw_bits);
> +            vtd_page_walk_info info = {
> +                .hook_fn = vtd_replay_hook,
> +                .private = (void *)n,
> +                .notify_unmap = false,
> +                .aw = s->aw_bits,
> +            };
> +
> +            vtd_page_walk(&ce, 0, ~0ULL, &info);
>          }
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> 
Besides the nit
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-05-18  8:24 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-17  8:59 [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 01/12] intel-iommu: send PSI always even if across PDEs Peter Xu
2018-05-17 21:00   ` Michael S. Tsirkin
2018-05-18  8:23   ` Auger Eric
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 02/12] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 03/12] intel-iommu: add iommu lock Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 04/12] intel-iommu: only do page walk for MAP notifiers Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 05/12] intel-iommu: introduce vtd_page_walk_info Peter Xu
2018-05-18  8:23   ` Auger Eric
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 06/12] intel-iommu: pass in address space when page walk Peter Xu
2018-05-18  8:23   ` Auger Eric
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 07/12] intel-iommu: trace domain id during " Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 08/12] util: implement simple iova tree Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 09/12] intel-iommu: maintain per-device iova ranges Peter Xu
2018-05-17  9:46   ` Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 10/12] intel-iommu: simplify page walk logic Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 11/12] intel-iommu: new vtd_sync_shadow_page_table_range Peter Xu
2018-05-17  8:59 ` [Qemu-devel] [PATCH v3 12/12] intel-iommu: new sync_shadow_page_table Peter Xu
2018-05-17 21:06   ` Michael S. Tsirkin
2018-05-18  6:22     ` Peter Xu
2018-05-17 19:49 ` [Qemu-devel] [PATCH v3 00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes Jintack Lim
2018-05-18  6:26   ` Peter Xu
2018-05-18  6:28     ` Peter Xu
2018-05-17 21:04 ` Michael S. Tsirkin
2018-05-18  6:34   ` Peter Xu
2018-05-17 21:08 ` Michael S. Tsirkin
2018-05-18  6:30   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.