[Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
@ 2018-04-25  4:51 Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
                   ` (10 more replies)
  0 siblings, 11 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

Online repo:

  https://github.com/xzpeter/qemu/tree/fix-vtd-dma

This series fixes several major problems that current code has:

- Issue 1: when getting very big PSI UNMAP invalidations, the current
  code is buggy in that we might skip the notification while actually
  we should always send that notification.

- Issue 2: IOTLB is not thread safe, while block dataplane can be
  accessing and updating it in parallel.

- Issue 3: For devices that only registered with UNMAP-only notifiers,
  we don't really need to do page walking for PSIs, we can directly
  deliver the notification down.  For example, vhost.

- Issue 4: unsafe window for MAP notified devices like vfio-pci (and
  in the future, vDPA as well).  The problem is that, now for domain
  invalidations we do this to make sure the shadow page tables are
  correctly synced:

  1. unmap the whole address space
  2. replay the whole address space, map existing pages

  However during step 1 and 2 there will be a very tiny window (it can
  be as big as 3ms) that the shadow page table is either invalid or
  incomplete (since we're rebuilding it up).  That's fatal error since
  devices never know that happending and it's still possible to DMA to
  memories.

Patch 1 fixes issue 1.  I put it at the first since it's picked from
an old post.

Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.

Patch 3 fixes issue 2.

Patch 4 fixes issue 3.

Patch 5-9 fix issue 4.  Here a very simple interval tree is
implemented based on Gtree.  It's different with general interval tree
in that it does not allow user to pass in private data (e.g.,
translated addresses).  However that benefits us that then we can
merge adjacent interval leaves so that hopefully we won't consume much
memory even if the mappings are a lot (that happens for nested virt -
when mapping the whole L2 guest RAM range, it can be at least in GBs).

Patch 10 is another big cleanup only can work after patch 9.

Tests:

- device assignments to L1, even L2 guests.  With this series applied
  (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
  we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
  with assigned devices and things will work.  We can't before.

- vhost smoke test for regression.

Please review.  Thanks,

Peter Xu (10):
  intel-iommu: send PSI always even if across PDEs
  intel-iommu: remove IntelIOMMUNotifierNode
  intel-iommu: add iommu lock
  intel-iommu: only do page walk for MAP notifiers
  intel-iommu: introduce vtd_page_walk_info
  intel-iommu: pass in address space when page walk
  util: implement simple interval tree logic
  intel-iommu: maintain per-device iova ranges
  intel-iommu: don't unmap all for shadow page table
  intel-iommu: remove notify_unmap for page walk

 include/hw/i386/intel_iommu.h |  21 ++--
 include/qemu/interval-tree.h  | 130 ++++++++++++++++++++++
 hw/i386/intel_iommu.c         | 249 +++++++++++++++++++++++++++++-------------
 util/interval-tree.c          | 217 ++++++++++++++++++++++++++++++++++++
 hw/i386/trace-events          |   3 +-
 util/Makefile.objs            |   1 +
 6 files changed, 536 insertions(+), 85 deletions(-)
 create mode 100644 include/qemu/interval-tree.h
 create mode 100644 util/interval-tree.c

-- 
2.14.3

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 01/10] intel-iommu: send PSI always even if across PDEs
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

During IOVA page table walking, there is a special case when the PSI
covers one whole PDE (Page Directory Entry, which contains 512 Page
Table Entries) or more.  In the past, we skip that entry and we don't
notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
notification to registered UNMAP notifiers in this case.

For UNMAP only notifiers, this might cause IOTLBs cached in the devices
even if they were already invalid.  For MAP/UNMAP notifiers like
vfio-pci, this will cause stale page mappings.

This special case doesn't trigger often, but it is very easy to be
triggered by nested device assignments, since in that case we'll
possibly map the whole L2 guest RAM region into the device's IOVA
address space (several GBs at least), which is far bigger than normal
kernel driver usages of the device (tens of MBs normally).

Without this patch applied to L1 QEMU, nested device assignment to L2
guests will dump some errors like:

qemu-system-x86_64: VFIO_MAP_DMA: -17
qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
                    0x7f89a920d000) = -17 (File exists)

Acked-by: Jason Wang <jasowang@redhat.com>
[peterx: rewrite the commit message]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fb31de9416..b359efd6f9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
 
+static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
+                             vtd_page_walk_hook hook_fn, void *private)
+{
+    assert(hook_fn);
+    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
+                            entry->addr_mask, entry->perm);
+    return hook_fn(entry, private);
+}
+
 /**
  * vtd_page_walk_level - walk over specific level for IOVA range
  *
@@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
+        entry.target_as = &address_space_memory;
+        entry.iova = iova & subpage_mask;
+        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+        entry.addr_mask = ~subpage_mask;
+
         if (vtd_is_last_slpte(slpte, level)) {
-            entry.target_as = &address_space_memory;
-            entry.iova = iova & subpage_mask;
             /* NOTE: this is only meaningful if entry_valid == true */
             entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            entry.addr_mask = ~subpage_mask;
-            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
             if (!entry_valid && !notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
-                                    entry.addr_mask, entry.perm);
-            if (hook_fn) {
-                ret = hook_fn(&entry, private);
-                if (ret < 0) {
-                    return ret;
-                }
+            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+            if (ret < 0) {
+                return ret;
             }
         } else {
             if (!entry_valid) {
-                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                if (notify_unmap) {
+                    /*
+                     * The whole entry is invalid; unmap it all.
+                     * Translated address is meaningless, zero it.
+                     */
+                    entry.translated_addr = 0x0;
+                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+                    if (ret < 0) {
+                        return ret;
+                    }
+                } else {
+                    trace_vtd_page_walk_skip_perm(iova, iova_next);
+                }
                 goto next;
             }
             ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 02/10] intel-iommu: remove IntelIOMMUNotifierNode
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock Peter Xu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

That is not really necessary.  Removing that node struct and put the
list entry directly into VTDAddressSpace.  It simplfies the code a lot.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  9 ++-------
 hw/i386/intel_iommu.c         | 41 ++++++++++++-----------------------------
 2 files changed, 14 insertions(+), 36 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 45ec8919b6..220697253f 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDIrq VTDIrq;
 typedef struct VTD_MSIMessage VTD_MSIMessage;
-typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -93,6 +92,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDAddressSpace) next;
 };
 
 struct VTDBus {
@@ -253,11 +253,6 @@ struct VTD_MSIMessage {
 /* When IR is enabled, all MSI/MSI-X data bits should be zero */
 #define VTD_IR_MSI_DATA          (0)
 
-struct IntelIOMMUNotifierNode {
-    VTDAddressSpace *vtd_as;
-    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
-};
-
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
@@ -295,7 +290,7 @@ struct IntelIOMMUState {
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
     /* list of registered notifiers */
-    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
+    QLIST_HEAD(, VTDAddressSpace) notifiers_list;
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b359efd6f9..5987b48d43 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1248,10 +1248,10 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 
 static void vtd_iommu_replay_all(IntelIOMMUState *s)
 {
-    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        memory_region_iommu_replay_all(&node->vtd_as->iommu);
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
+        memory_region_iommu_replay_all(&vtd_as->iommu);
     }
 }
 
@@ -1372,7 +1372,6 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 
 static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 {
-    IntelIOMMUNotifierNode *node;
     VTDContextEntry ce;
     VTDAddressSpace *vtd_as;
 
@@ -1381,8 +1380,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
             domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -1402,12 +1400,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                            uint16_t domain_id, hwaddr addr,
                                            uint8_t am)
 {
-    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
     VTDContextEntry ce;
     int ret;
 
-    QLIST_FOREACH(node, &(s->notifiers_list), next) {
-        VTDAddressSpace *vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -2344,8 +2341,6 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
     IntelIOMMUState *s = vtd_as->iommu_state;
-    IntelIOMMUNotifierNode *node = NULL;
-    IntelIOMMUNotifierNode *next_node = NULL;
 
     if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
         error_report("We need to set caching-mode=1 for intel-iommu to enable "
@@ -2354,21 +2349,11 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
     }
 
     if (old == IOMMU_NOTIFIER_NONE) {
-        node = g_malloc0(sizeof(*node));
-        node->vtd_as = vtd_as;
-        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
-        return;
-    }
-
-    /* update notifier node with new flags */
-    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
-        if (node->vtd_as == vtd_as) {
-            if (new == IOMMU_NOTIFIER_NONE) {
-                QLIST_REMOVE(node, next);
-                g_free(node);
-            }
-            return;
-        }
+        /* Insert new ones */
+        QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
+    } else if (new == IOMMU_NOTIFIER_NONE) {
+        /* Remove old ones */
+        QLIST_REMOVE(vtd_as, next);
     }
 }
 
@@ -2838,12 +2823,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 
 static void vtd_address_space_unmap_all(IntelIOMMUState *s)
 {
-    IntelIOMMUNotifierNode *node;
     VTDAddressSpace *vtd_as;
     IOMMUNotifier *n;
 
-    QLIST_FOREACH(node, &s->notifiers_list, next) {
-        vtd_as = node->vtd_as;
+    QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
             vtd_address_space_unmap(vtd_as, n);
         }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25 16:26   ` Emilio G. Cota
  2018-04-27  5:13   ` Jason Wang
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

Add a per-iommu big lock to protect IOMMU status.  Currently the only
thing to be protected is the IOTLB cache, since that can be accessed
even without BQL, e.g., in IO dataplane.

Note that device page tables should not need any protection.  The safety
of that should be provided by guest OS.  E.g., when a page entry is
freed, the guest OS should be responsible to make sure that no device
will be using that page any more.

Reported-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  8 ++++++++
 hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 220697253f..1a8ba8e415 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -262,6 +262,14 @@ struct IntelIOMMUState {
     uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
     uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
     uint32_t version;
+    /*
+     * Protects IOMMU states in general.  Normally we don't need to
+     * take this lock when we are with BQL held.  However we have code
+     * paths that may run even without BQL.  In those cases, we need
+     * to take the lock when we have access to IOMMU state
+     * informations, e.g., the IOTLB.
+     */
+    QemuMutex iommu_lock;
 
     bool caching_mode;          /* RO - is cap CM enabled? */
 
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5987b48d43..e4ee211dde 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -128,6 +128,16 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
     return new_val;
 }
 
+static inline void vtd_iommu_lock(IntelIOMMUState *s)
+{
+    qemu_mutex_lock(&s->iommu_lock);
+}
+
+static inline void vtd_iommu_unlock(IntelIOMMUState *s)
+{
+    qemu_mutex_unlock(&s->iommu_lock);
+}
+
 /* GHashTable functions */
 static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
 {
@@ -197,12 +207,19 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
     s->context_cache_gen = 1;
 }
 
-static void vtd_reset_iotlb(IntelIOMMUState *s)
+static void vtd_reset_iotlb_locked(IntelIOMMUState *s)
 {
     assert(s->iotlb);
     g_hash_table_remove_all(s->iotlb);
 }
 
+static void vtd_reset_iotlb(IntelIOMMUState *s)
+{
+    vtd_iommu_lock(s);
+    vtd_reset_iotlb_locked(s);
+    vtd_iommu_unlock(s);
+}
+
 static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
                                   uint32_t level)
 {
@@ -222,6 +239,7 @@ static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t key;
     int level;
 
+    vtd_iommu_lock(s);
     for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; level++) {
         key = vtd_get_iotlb_key(vtd_get_iotlb_gfn(addr, level),
                                 source_id, level);
@@ -232,6 +250,7 @@ static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
     }
 
 out:
+    vtd_iommu_unlock(s);
     return entry;
 }
 
@@ -244,9 +263,11 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
     trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
+
+    vtd_iommu_lock(s);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
         trace_vtd_iotlb_reset("iotlb exceeds size limit");
-        vtd_reset_iotlb(s);
+        vtd_reset_iotlb_locked(s);
     }
 
     entry->gfn = gfn;
@@ -256,6 +277,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     entry->mask = vtd_slpt_level_page_mask(level);
     *key = vtd_get_iotlb_key(gfn, source_id, level);
     g_hash_table_replace(s->iotlb, key, entry);
+    vtd_iommu_unlock(s);
 }
 
 /* Given the reg addr of both the message data and address, generate an
@@ -1377,8 +1399,10 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 
     trace_vtd_inv_desc_iotlb_domain(domain_id);
 
+    vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
+    vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
@@ -1426,7 +1450,9 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.domain_id = domain_id;
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
+    vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+    vtd_iommu_unlock(s);
     vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
 }
 
@@ -3072,6 +3098,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->notifiers_list);
+    qemu_mutex_init(&s->iommu_lock);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 04/10] intel-iommu: only do page walk for MAP notifiers
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (2 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

For UNMAP-only IOMMU notifiers, we don't really need to walk the page
tables.  Fasten that procedure by skipping the page table walk.  That
should boost performance for UNMAP-only notifiers like vhost.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  2 ++
 hw/i386/intel_iommu.c         | 43 ++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 1a8ba8e415..486e205e79 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -93,6 +93,8 @@ struct VTDAddressSpace {
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
+    /* Superset of notifier flags that this address space has */
+    IOMMUNotifierFlag notifier_flags;
 };
 
 struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e4ee211dde..1c252414a9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
     qemu_mutex_unlock(&s->iommu_lock);
 }
 
+/* Whether the address space needs to notify new mappings */
+static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
+{
+    return as->notifier_flags & IOMMU_NOTIFIER_MAP;
+}
+
 /* GHashTable functions */
 static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
 {
@@ -1427,14 +1433,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     VTDAddressSpace *vtd_as;
     VTDContextEntry ce;
     int ret;
+    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
 
     QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
-            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
-                          vtd_page_invalidate_notify_hook,
-                          (void *)&vtd_as->iommu, true, s->aw_bits);
+            if (vtd_as_notify_mappings(vtd_as)) {
+                /*
+                 * For MAP-inclusive notifiers, we need to walk the
+                 * page table to sync the shadow page table.
+                 */
+                vtd_page_walk(&ce, addr, addr + size,
+                              vtd_page_invalidate_notify_hook,
+                              (void *)&vtd_as->iommu, true, s->aw_bits);
+            } else {
+                /*
+                 * For UNMAP-only notifiers, we don't need to walk the
+                 * page tables.  We just deliver the PSI down to
+                 * invalidate caches.
+                 */
+                IOMMUTLBEntry entry = {
+                    .target_as = &address_space_memory,
+                    .iova = addr,
+                    .translated_addr = 0,
+                    .addr_mask = size - 1,
+                    .perm = IOMMU_NONE,
+                };
+                memory_region_notify_iommu(&vtd_as->iommu, entry);
+            }
         }
     }
 }
@@ -2374,6 +2401,9 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
         exit(1);
     }
 
+    /* Update per-address-space notifier flags */
+    vtd_as->notifier_flags = new;
+
     if (old == IOMMU_NOTIFIER_NONE) {
         /* Insert new ones */
         QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
@@ -2884,8 +2914,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   PCI_FUNC(vtd_as->devfn),
                                   VTD_CONTEXT_ENTRY_DID(ce.hi),
                                   ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
-                      s->aw_bits);
+        if (vtd_as_notify_mappings(vtd_as)) {
+            /* This is required only for MAP typed notifiers */
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+                          s->aw_bits);
+        }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 05/10] intel-iommu: introduce vtd_page_walk_info
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (3 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 06/10] intel-iommu: pass in address space when page walk Peter Xu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

During the recursive page walking of IOVA page tables, some stack
variables are constant variables and never changed during the whole page
walking procedure.  Isolate them into a struct so that we don't need to
pass those contants down the stack every time and multiple times.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 56 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 37 insertions(+), 19 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1c252414a9..42f607676c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -750,9 +750,27 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
 
+/**
+ * Constant information used during page walking
+ *
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @notify_unmap: whether we should notify invalid entries
+ * @aw: maximum address width
+ */
+typedef struct {
+    vtd_page_walk_hook hook_fn;
+    void *private;
+    bool notify_unmap;
+    uint8_t aw;
+} vtd_page_walk_info;
+
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
-                             vtd_page_walk_hook hook_fn, void *private)
+                             vtd_page_walk_info *info)
 {
+    vtd_page_walk_hook hook_fn = info->hook_fn;
+    void *private = info->private;
+
     assert(hook_fn);
     trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
                             entry->addr_mask, entry->perm);
@@ -765,17 +783,13 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
  * @addr: base GPA addr to start the walk
  * @start: IOVA range start address
  * @end: IOVA range end address (start <= addr < end)
- * @hook_fn: hook func to be called when detected page
- * @private: private data to be passed into hook func
  * @read: whether parent level has read permission
  * @write: whether parent level has write permission
- * @notify_unmap: whether we should notify invalid entries
- * @aw: maximum address width
+ * @info: constant information for the page walk
  */
 static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
-                               uint64_t end, vtd_page_walk_hook hook_fn,
-                               void *private, uint32_t level, bool read,
-                               bool write, bool notify_unmap, uint8_t aw)
+                               uint64_t end, uint32_t level, bool read,
+                               bool write, vtd_page_walk_info *info)
 {
     bool read_cur, write_cur, entry_valid;
     uint32_t offset;
@@ -825,24 +839,24 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
 
         if (vtd_is_last_slpte(slpte, level)) {
             /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            if (!entry_valid && !notify_unmap) {
+            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+            if (!entry_valid && !info->notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+            ret = vtd_page_walk_one(&entry, level, info);
             if (ret < 0) {
                 return ret;
             }
         } else {
             if (!entry_valid) {
-                if (notify_unmap) {
+                if (info->notify_unmap) {
                     /*
                      * The whole entry is invalid; unmap it all.
                      * Translated address is meaningless, zero it.
                      */
                     entry.translated_addr = 0x0;
-                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+                    ret = vtd_page_walk_one(&entry, level, info);
                     if (ret < 0) {
                         return ret;
                     }
@@ -851,10 +865,9 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
                 }
                 goto next;
             }
-            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
-                                      MIN(iova_next, end), hook_fn, private,
-                                      level - 1, read_cur, write_cur,
-                                      notify_unmap, aw);
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+                                      iova, MIN(iova_next, end), level - 1,
+                                      read_cur, write_cur, info);
             if (ret < 0) {
                 return ret;
             }
@@ -883,6 +896,12 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
+    vtd_page_walk_info info = {
+        .hook_fn = hook_fn,
+        .private = private,
+        .notify_unmap = notify_unmap,
+        .aw = aw,
+    };
 
     if (!vtd_iova_range_check(start, ce, aw)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
@@ -893,8 +912,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
         end = vtd_iova_limit(ce, aw);
     }
 
-    return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, notify_unmap, aw);
+    return vtd_page_walk_level(addr, start, end, level, true, true, &info);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 06/10] intel-iommu: pass in address space when page walk
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (4 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic Peter Xu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

We pass in the VTDAddressSpace to replace the aw bits when doing page
walk.  The VTDAddressSpace contains the aw bits information, meanwhile
we'll need to do something more in the follow up patches regarding to
the address spaces.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 42f607676c..a19c18b8d4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -756,13 +756,13 @@ typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
  * @hook_fn: hook func to be called when detected page
  * @private: private data to be passed into hook func
  * @notify_unmap: whether we should notify invalid entries
- * @aw: maximum address width
+ * @as: VT-d address space of the device
  */
 typedef struct {
+    VTDAddressSpace *as;
     vtd_page_walk_hook hook_fn;
     void *private;
     bool notify_unmap;
-    uint8_t aw;
 } vtd_page_walk_info;
 
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
@@ -799,6 +799,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
     uint64_t iova = start;
     uint64_t iova_next;
     int ret = 0;
+    uint8_t aw = info->as->iommu_state->aw_bits;
 
     trace_vtd_page_walk_level(addr, level, start, end);
 
@@ -839,7 +840,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
 
         if (vtd_is_last_slpte(slpte, level)) {
             /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
             if (!entry_valid && !info->notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
@@ -865,7 +866,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
                 }
                 goto next;
             }
-            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
             if (ret < 0) {
@@ -888,19 +889,20 @@ next:
  * @end: IOVA range end address (start <= addr < end)
  * @hook_fn: the hook that to be called for each detected area
  * @private: private data for the hook function
- * @aw: maximum address width
+ * @as: the VT-d address space of the device
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
                          vtd_page_walk_hook hook_fn, void *private,
-                         bool notify_unmap, uint8_t aw)
+                         bool notify_unmap, VTDAddressSpace *as)
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
+    uint8_t aw = as->iommu_state->aw_bits;
     vtd_page_walk_info info = {
         .hook_fn = hook_fn,
         .private = private,
         .notify_unmap = notify_unmap,
-        .aw = aw,
+        .as = as,
     };
 
     if (!vtd_iova_range_check(start, ce, aw)) {
@@ -1464,7 +1466,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                  */
                 vtd_page_walk(&ce, addr, addr + size,
                               vtd_page_invalidate_notify_hook,
-                              (void *)&vtd_as->iommu, true, s->aw_bits);
+                              (void *)&vtd_as->iommu, true, vtd_as);
             } else {
                 /*
                  * For UNMAP-only notifiers, we don't need to walk the
@@ -2935,7 +2937,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
             vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
-                          s->aw_bits);
+                          vtd_as);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (5 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 06/10] intel-iommu: pass in address space when page walk Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-27  5:53   ` Jason Wang
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

Introduce a simplest interval tree implementation based on GTree.
Current implementation is mostly tailored to maintain and trace device
mapped IOVA ranges, but still it might be useful to other modules in the
future.

It is naive in that it even does not allow user to pass in private
structs along with the ranges.  However it's good in that the tree can
do mergings of ranges when necessary when without such information.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/interval-tree.h | 130 ++++++++++++++++++++++++++
 util/interval-tree.c         | 217 +++++++++++++++++++++++++++++++++++++++++++
 util/Makefile.objs           |   1 +
 3 files changed, 348 insertions(+)
 create mode 100644 include/qemu/interval-tree.h
 create mode 100644 util/interval-tree.c

diff --git a/include/qemu/interval-tree.h b/include/qemu/interval-tree.h
new file mode 100644
index 0000000000..4485f4b338
--- /dev/null
+++ b/include/qemu/interval-tree.h
@@ -0,0 +1,130 @@
+/*
+ * An very simplified interval tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *  Peter Xu <peterx@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+#ifndef __INTERVAL_TREE_H__
+#define __INTERVAL_TREE_H__
+
+/*
+ * Currently the interval tree will only allow to keep ranges
+ * information, and no extra user data is allowed for each element.  A
+ * benefit is that we can merge adjacent ranges internally within the
+ * tree.  It can save a lot of memory when the ranges are splitted but
+ * mostly continuous.
+ *
+ * Note that current implementation does not provide any thread
+ * protections.  Callers of the interval tree should be responsible
+ * for the thread safety issue.
+ */
+
+#include <glib.h>
+
+#define  IT_OK           (0)
+#define  IT_ERR_OVERLAP  (-1)
+
+typedef unsigned long long ITValue;
+typedef struct ITTree ITTree;
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);
+
+struct ITRange {
+    ITValue start;
+    ITValue end;
+};
+typedef struct ITRange ITRange;
+
+/**
+ * it_tree_new:
+ *
+ * Create a new interval tree.
+ *
+ * Returns: the tree pointer when succeeded, or NULL if error.
+ */
+ITTree *it_tree_new(void);
+
+/**
+ * it_tree_insert:
+ *
+ * @tree: the interval tree to insert
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Insert an interval range to the tree.  If there is overlapped
+ * ranges, IT_ERR_OVERLAP will be returned.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int it_tree_insert(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_remove:
+ *
+ * @tree: the interval tree to remove range from
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Remove an range from the tree.  The range does not need to be
+ * exactly what has inserted.  All the ranges that are included in the
+ * provided range will be removed from the tree.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int it_tree_remove(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_find:
+ *
+ * @tree: the interval tree to search from
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Search for a range in the interval tree that overlaps with the
+ * range specified.  Only the first found range will be returned.
+ *
+ * Return: ITRange if found, or NULL if not found.  Note: the returned
+ * ITRange pointer is maintained internally.  User should only read
+ * the content but never modify or free the content.
+ */
+ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_find_value:
+ *
+ * @tree: the interval tree to search from
+ * @value: the value to find
+ *
+ * Similar to it_tree_find(), but it tries to find range (value, value).
+ *
+ * Return: same as it_tree_find().
+ */
+ITRange *it_tree_find_value(ITTree *tree, ITValue value);
+
+/**
+ * it_tree_foreach:
+ *
+ * @tree: the interval tree to iterate on
+ * @iterator: the interator for the ranges, return true to stop
+ *
+ * Search for a range in the interval tree.
+ *
+ * Return: 1 if found any overlap, 0 if not, <0 if error.
+ */
+void it_tree_foreach(ITTree *tree, it_tree_iterator iterator);
+
+/**
+ * it_tree_destroy:
+ *
+ * @tree: the interval tree to destroy
+ *
+ * Destroy an existing interval tree.
+ *
+ * Return: None.
+ */
+void it_tree_destroy(ITTree *tree);
+
+#endif
diff --git a/util/interval-tree.c b/util/interval-tree.c
new file mode 100644
index 0000000000..0e7a37c2c6
--- /dev/null
+++ b/util/interval-tree.c
@@ -0,0 +1,217 @@
+/*
+ * An very simplified interval tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *  Peter Xu <peterx@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+
+#include <glib.h>
+#include "qemu/interval-tree.h"
+
+/*
+ * Each element of the internal tree is an ITRange.  It is shared
+ * between the key and value of the element, or we can see it a tree
+ * with keys only but no values.
+ */
+
+struct ITTree {
+    GTree *tree;
+};
+
+static int it_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
+{
+    const ITRange *r1 = a, *r2 = b;
+
+    if (r1->start > r2->end) {
+        return 1;
+    }
+
+    if (r1->end < r2->start) {
+        return -1;
+    }
+
+    /* Overlapped */
+    return 0;
+}
+
+/* Find out intersection of range A and B, put into OUT */
+static inline void it_range_and(ITRange *out, ITRange *a, ITRange *b)
+{
+    out->start = MAX(a->start, b->start);
+    out->end = MIN(a->end, b->end);
+}
+
+static inline gboolean it_range_equal(ITRange *a, ITRange *b)
+{
+    return a->start == b->start && a->end == b->end;
+}
+
+/* Whether ITRange A is superset of B? */
+static inline gboolean it_range_cover(ITRange *a, ITRange *b)
+{
+    return a->start <= b->start && a->end >= b->end;
+}
+
+ITTree *it_tree_new(void)
+{
+    ITTree *ittree = g_new0(ITTree, 1);
+
+    /* We don't have values actually, no need to free */
+    ittree->tree = g_tree_new_full(it_tree_compare, NULL, g_free, NULL);
+
+    return ittree;
+}
+
+ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end)
+{
+    ITRange range;
+
+    g_assert(tree);
+
+    range.start = start;
+    range.end = end;
+
+    return g_tree_lookup(tree->tree, &range);
+}
+
+ITRange *it_tree_find_value(ITTree *tree, ITValue value)
+{
+    return it_tree_find(tree, value, value);
+}
+
+static inline void it_tree_insert_internal(GTree *gtree, ITRange *range)
+{
+    /* Key and value are sharing the same range data */
+    g_tree_insert(gtree, range, range);
+}
+
+int it_tree_insert(ITTree *tree, ITValue start, ITValue end)
+{
+    ITRange *range = g_new0(ITRange, 1), *overlap;
+    GTree *gtree;
+
+    g_assert(tree);
+    g_assert(start <= end);
+
+    gtree = tree->tree;
+
+    range->start = start;
+    range->end = end;
+
+    /* We don't allow to insert range that overlaps with existings */
+    if (g_tree_lookup(gtree, range)) {
+        g_free(range);
+        return IT_ERR_OVERLAP;
+    }
+
+    /* Merge left adjacent range */
+    overlap = it_tree_find_value(tree, start - 1);
+    if (overlap) {
+        range->start = overlap->start;
+        g_tree_remove(gtree, overlap);
+    }
+
+    /* Merge right adjacent range */
+    overlap = it_tree_find_value(tree, end + 1);
+    if (overlap) {
+        range->end = overlap->end;
+        g_tree_remove(gtree, overlap);
+    }
+
+    /* Key and value are sharing the same range data */
+    it_tree_insert_internal(gtree, range);
+
+    return IT_OK;
+}
+
+static gboolean it_tree_traverse(gpointer key, gpointer value,
+                                gpointer data)
+{
+    it_tree_iterator iterator = data;
+    ITRange *range = key;
+
+    g_assert(key == value);
+
+    return iterator(range->start, range->end);
+}
+
+void it_tree_foreach(ITTree *tree, it_tree_iterator iterator)
+{
+    g_assert(tree && iterator);
+    g_tree_foreach(tree->tree, it_tree_traverse, iterator);
+}
+
+/* Remove subset `range', which is part of `overlap'. */
+static void it_tree_remove_subset(GTree *gtree, const ITRange *overlap,
+                                  const ITRange *range)
+{
+    ITRange *range1, *range2;
+
+    if (overlap->start < range->start) {
+        range1 = g_new0(ITRange, 1);
+        range1->start = overlap->start;
+        range1->end = range->start - 1;
+    } else {
+        range1 = NULL;
+    }
+    if (range->end < overlap->end) {
+        range2 = g_new0(ITRange, 1);
+        range2->start = range->end + 1;
+        range2->end = overlap->end;
+    } else {
+        range2 = NULL;
+    }
+
+    g_tree_remove(gtree, overlap);
+
+    if (range1) {
+        it_tree_insert_internal(gtree, range1);
+    }
+    if (range2) {
+        it_tree_insert_internal(gtree, range2);
+    }
+}
+
+int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
+{
+    ITRange range = { .start = start, .end = end }, *overlap, and;
+    GTree *gtree;
+
+    g_assert(tree);
+
+    gtree = tree->tree;
+    while ((overlap = g_tree_lookup(gtree, &range))) {
+        if (it_range_equal(overlap, &range)) {
+            /* Exactly what we want to remove; done */
+            g_tree_remove(gtree, overlap);
+            break;
+        } else if (it_range_cover(overlap, &range)) {
+            /* Split existing range into two; done */
+            it_tree_remove_subset(gtree, overlap, &range);
+            break;
+        } else if (it_range_cover(&range, overlap)) {
+            /* Drop this range and continue */
+            g_tree_remove(gtree, overlap);
+        } else {
+            /*
+             * The range to remove has intersection with existing
+             * ranges.  Remove part of the range and continue
+             */
+            it_range_and(&and, overlap, &range);
+            g_assert(and.start <= and.end);
+            it_tree_remove_subset(gtree, overlap, &and);
+        }
+    }
+
+    return IT_OK;
+}
+
+void it_tree_destroy(ITTree *tree)
+{
+    g_tree_destroy(tree->tree);
+    g_free(tree);
+}
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 728c3541db..4ac33910ed 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -47,4 +47,5 @@ util-obj-y += qht.o
 util-obj-y += range.o
 util-obj-y += stats64.o
 util-obj-y += systemd.o
+util-obj-y += interval-tree.o
 util-obj-$(CONFIG_LINUX) += vfio-helpers.o
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (6 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-27  6:07   ` Jason Wang
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

For each VTDAddressSpace, now we maintain what IOVA ranges we have
mapped and what we have not.  With that information, now we only send
MAP or UNMAP when necessary.  Say, we don't send MAP notifies if we know
we have already mapped the range, meanwhile we don't send UNMAP notifies
if we know we never mapped the range at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/i386/intel_iommu.h |  2 ++
 hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
 hw/i386/trace-events          |  2 ++
 3 files changed, 32 insertions(+)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 486e205e79..09a2e94404 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -27,6 +27,7 @@
 #include "hw/i386/ioapic.h"
 #include "hw/pci/msi.h"
 #include "hw/sysbus.h"
+#include "qemu/interval-tree.h"
 
 #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
 #define INTEL_IOMMU_DEVICE(obj) \
@@ -95,6 +96,7 @@ struct VTDAddressSpace {
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
+    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
 struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a19c18b8d4..8f396a5d13 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -768,12 +768,37 @@ typedef struct {
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
                              vtd_page_walk_info *info)
 {
+    VTDAddressSpace *as = info->as;
     vtd_page_walk_hook hook_fn = info->hook_fn;
     void *private = info->private;
+    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
+                                   entry->iova + entry->addr_mask);
 
     assert(hook_fn);
+
+    /* Update local IOVA mapped ranges */
+    if (entry->perm) {
+        if (mapped) {
+            /* Skip since we have already mapped this range */
+            trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
+                                             mapped->start, mapped->end);
+            return 0;
+        }
+        it_tree_insert(as->iova_tree, entry->iova,
+                       entry->iova + entry->addr_mask);
+    } else {
+        if (!mapped) {
+            /* Skip since we didn't map this range at all */
+            trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
+            return 0;
+        }
+        it_tree_remove(as->iova_tree, entry->iova,
+                       entry->iova + entry->addr_mask);
+    }
+
     trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
                             entry->addr_mask, entry->perm);
+
     return hook_fn(entry, private);
 }
 
@@ -2798,6 +2823,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+        vtd_dev_as->iova_tree = it_tree_new();
 
         /*
          * Memory region relationships looks like (Address range shows
@@ -2894,6 +2920,8 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
                              VTD_PCI_FUNC(as->devfn),
                              entry.iova, size);
 
+    it_tree_remove(as->iova_tree, entry.iova, entry.iova + entry.addr_mask);
+
     memory_region_notify_one(n, &entry);
 }
 
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 22d44648af..677f83420d 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -40,6 +40,8 @@ vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint6
 vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
 vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
 vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t start, uint64_t end) "iova 0x%"PRIx64" mask 0x%"PRIx64" start 0x%"PRIx64" end 0x%"PRIx64
+vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 09/10] intel-iommu: don't unmap all for shadow page table
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (7 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 10/10] intel-iommu: remove notify_unmap for page walk Peter Xu
  2018-04-25  5:05 ` [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
  10 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

IOMMU replay was carried out before in many use cases, e.g., context
cache invalidations, domain flushes.  We used this mechanism to sync the
shadow page table by firstly (1) unmap the whole address space, then
(2) walk the page table to remap what's in the table.

This is very dangerous.

The problem is that we'll have a very small window (in my measurement,
it can be about 3ms) during above step (1) and (2) that the device will
see no (or incomplete) device page table.  Howerver the device never
knows that.  This can cause DMA error of devices, who assumes the page
table is always there.

So the point is that, for MAP typed notifiers (vfio-pci, for example)
they'll need the mapped page entries always be there.  We can never
unmap any existing page entries like what we did in (1) above.

The only solution is to remove step (1).  We can't do that before since
we didn't know what device page was mapped and what was not, so we unmap
them all.  Now with the new IOVA tree QEMU knows what has mapped and
what has not.  We don't need this step (1) any more.  Remove it.

Note that after removing that global unmap flushing, we'll need to
notify unmap now during page walkings.

This should fix the DMA error problem that Jintack Lim reported with
nested device assignment.  This problem won't not happen always, e.g., I
cannot reproduce the error.  However after collecting logs it shows that
this is the possible cause to Jintack's problem.

Reported-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8f396a5d13..dedaebc46b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2952,10 +2952,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)

     /*
      * The replay can be triggered by either a invalidation or a newly
-     * created entry. No matter what, we release existing mappings
-     * (it means flushing caches for UNMAP-only registers).
+     * created entry.
      */
-    vtd_address_space_unmap(vtd_as, n);

     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
@@ -2964,8 +2962,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   ce.hi, ce.lo);
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
-            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
                           vtd_as);
+        } else {
+            vtd_address_space_unmap(vtd_as, n);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 10/10] intel-iommu: remove notify_unmap for page walk
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (8 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
@ 2018-04-25  4:51 ` Peter Xu
  2018-04-25  5:05 ` [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
  10 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  4:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, peterx, Jason Wang

Now after previous changes, we will always pass that parameter as true
now.  Remove it. After removing that variable, now we can greatly
simplify the page walking logic.

No functional change at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 70 +++++++++++++++++++++------------------------------
 hw/i386/trace-events  |  1 -
 2 files changed, 28 insertions(+), 43 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index dedaebc46b..31e9b52452 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -762,7 +762,6 @@ typedef struct {
     VTDAddressSpace *as;
     vtd_page_walk_hook hook_fn;
     void *private;
-    bool notify_unmap;
 } vtd_page_walk_info;
 
 static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
@@ -858,45 +857,34 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
-        entry.target_as = &address_space_memory;
-        entry.iova = iova & subpage_mask;
-        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
-        entry.addr_mask = ~subpage_mask;
-
-        if (vtd_is_last_slpte(slpte, level)) {
-            /* NOTE: this is only meaningful if entry_valid == true */
-            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            if (!entry_valid && !info->notify_unmap) {
-                trace_vtd_page_walk_skip_perm(iova, iova_next);
-                goto next;
-            }
-            ret = vtd_page_walk_one(&entry, level, info);
-            if (ret < 0) {
-                return ret;
-            }
-        } else {
-            if (!entry_valid) {
-                if (info->notify_unmap) {
-                    /*
-                     * The whole entry is invalid; unmap it all.
-                     * Translated address is meaningless, zero it.
-                     */
-                    entry.translated_addr = 0x0;
-                    ret = vtd_page_walk_one(&entry, level, info);
-                    if (ret < 0) {
-                        return ret;
-                    }
-                } else {
-                    trace_vtd_page_walk_skip_perm(iova, iova_next);
-                }
-                goto next;
-            }
+        if (!vtd_is_last_slpte(slpte, level) && entry_valid) {
+            /*
+             * This is a valid PDE (or even bigger than PDE).  We need
+             * to walk one further level.
+             */
             ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
-            if (ret < 0) {
-                return ret;
-            }
+        } else {
+            /*
+             * This means we are either:
+             *
+             * (1) the real page entry (either 4K page, or huge page)
+             * (2) the whole range is invalid
+             *
+             * In either case, we send an IOTLB notification down.
+             */
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            entry.addr_mask = ~subpage_mask;
+            /* NOTE: this is only meaningful if entry_valid == true */
+            entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
+            ret = vtd_page_walk_one(&entry, level, info);
+        }
+
+        if (ret < 0) {
+            return ret;
         }
 
 next:
@@ -918,7 +906,7 @@ next:
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
                          vtd_page_walk_hook hook_fn, void *private,
-                         bool notify_unmap, VTDAddressSpace *as)
+                         VTDAddressSpace *as)
 {
     dma_addr_t addr = vtd_ce_get_slpt_base(ce);
     uint32_t level = vtd_ce_get_level(ce);
@@ -926,7 +914,6 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
     vtd_page_walk_info info = {
         .hook_fn = hook_fn,
         .private = private,
-        .notify_unmap = notify_unmap,
         .as = as,
     };
 
@@ -1491,7 +1478,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                  */
                 vtd_page_walk(&ce, addr, addr + size,
                               vtd_page_invalidate_notify_hook,
-                              (void *)&vtd_as->iommu, true, vtd_as);
+                              (void *)&vtd_as->iommu, vtd_as);
             } else {
                 /*
                  * For UNMAP-only notifiers, we don't need to walk the
@@ -2962,8 +2949,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   ce.hi, ce.lo);
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
-            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
-                          vtd_as);
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, vtd_as);
         } else {
             vtd_address_space_unmap(vtd_as, n);
         }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 677f83420d..eb23ace5fb 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -43,7 +43,6 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
 vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t start, uint64_t end) "iova 0x%"PRIx64" mask 0x%"PRIx64" start 0x%"PRIx64" end 0x%"PRIx64
 vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
-vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
                   ` (9 preceding siblings ...)
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 10/10] intel-iommu: remove notify_unmap for page walk Peter Xu
@ 2018-04-25  5:05 ` no-reply
  2018-04-25  5:34   ` Peter Xu
  10 siblings, 1 reply; 52+ messages in thread
From: no-reply @ 2018-04-25  5:05 UTC (permalink / raw)
  To: peterx; +Cc: famz, qemu-devel, jasowang, alex.williamson, jintack, mst

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20180425045129.17449-1-peterx@redhat.com
Subject: [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 t [tag update]            patchew/1524592709-6553-1-git-send-email-ian.jackson@eu.citrix.com -> patchew/1524592709-6553-1-git-send-email-ian.jackson@eu.citrix.com
 t [tag update]            patchew/20180424152405.10304-1-alex.bennee@linaro.org -> patchew/20180424152405.10304-1-alex.bennee@linaro.org
 * [new tag]               patchew/20180425045129.17449-1-peterx@redhat.com -> patchew/20180425045129.17449-1-peterx@redhat.com
Switched to a new branch 'test'
985e4bedbc intel-iommu: remove notify_unmap for page walk
d83649a53c intel-iommu: don't unmap all for shadow page table
e5c03427c9 intel-iommu: maintain per-device iova ranges
7719aa1ac9 util: implement simple interval tree logic
f62d669eaa intel-iommu: pass in address space when page walk
95259f34e9 intel-iommu: introduce vtd_page_walk_info
ef83611a65 intel-iommu: only do page walk for MAP notifiers
8e606cbed9 intel-iommu: add iommu lock
dbfb3683ce intel-iommu: remove IntelIOMMUNotifierNode
73e8c605b3 intel-iommu: send PSI always even if across PDEs

=== OUTPUT BEGIN ===
Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
Checking PATCH 3/10: intel-iommu: add iommu lock...
Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
Checking PATCH 7/10: util: implement simple interval tree logic...
WARNING: architecture specific defines should be avoided
#34: FILE: include/qemu/interval-tree.h:11:
+#ifndef __INTERVAL_TREE_H__

ERROR: space prohibited between function name and open parenthesis '('
#56: FILE: include/qemu/interval-tree.h:33:
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);

total: 1 errors, 1 warnings, 352 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 8/10: intel-iommu: maintain per-device iova ranges...
Checking PATCH 9/10: intel-iommu: don't unmap all for shadow page table...
Checking PATCH 10/10: intel-iommu: remove notify_unmap for page walk...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
  2018-04-25  5:05 ` [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
@ 2018-04-25  5:34   ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-25  5:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: famz, jasowang, alex.williamson, jintack, mst

On Tue, Apr 24, 2018 at 10:05:14PM -0700, no-reply@patchew.org wrote:

[...]

> === OUTPUT BEGIN ===
> Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
> Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
> Checking PATCH 3/10: intel-iommu: add iommu lock...
> Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
> Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
> Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
> Checking PATCH 7/10: util: implement simple interval tree logic...
> WARNING: architecture specific defines should be avoided
> #34: FILE: include/qemu/interval-tree.h:11:
> +#ifndef __INTERVAL_TREE_H__

This is valid; I'll remove __ prefix/suffix and it'll fix.

> 
> ERROR: space prohibited between function name and open parenthesis '('
> #56: FILE: include/qemu/interval-tree.h:33:
> +typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);

While this is not; We possibly need the gboolean type into type list.
I'll post a standalone patch for checkpatch instead.

> 
> total: 1 errors, 1 warnings, 352 lines checked
> 
> Your patch has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 
> Checking PATCH 8/10: intel-iommu: maintain per-device iova ranges...
> Checking PATCH 9/10: intel-iommu: don't unmap all for shadow page table...
> Checking PATCH 10/10: intel-iommu: remove notify_unmap for page walk...
> === OUTPUT END ===
> 
> Test command exited with code: 1
> 
> 
> ---
> Email generated automatically by Patchew [http://patchew.org/].
> Please send your feedback to patchew-devel@redhat.com

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock Peter Xu
@ 2018-04-25 16:26   ` Emilio G. Cota
  2018-04-26  5:45     ` Peter Xu
  2018-04-27  5:13   ` Jason Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-04-25 16:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Jason Wang, Alex Williamson, Jintack Lim,
	Michael S . Tsirkin

On Wed, Apr 25, 2018 at 12:51:22 +0800, Peter Xu wrote:
> Add a per-iommu big lock to protect IOMMU status.  Currently the only
> thing to be protected is the IOTLB cache, since that can be accessed
> even without BQL, e.g., in IO dataplane.

Is the added lock going to protect anything else beyond a g_hash_table?
If not, did you consider using qht instead of the lock + hash table?
It would give you lockless lookups as well as scalable updates.

[Sorry if my question isn't relevant, I am not familiar with
this code and just briefly skimmed over this series.]

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-25 16:26   ` Emilio G. Cota
@ 2018-04-26  5:45     ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-26  5:45 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-devel, Jason Wang, Alex Williamson, Jintack Lim,
	Michael S . Tsirkin

On Wed, Apr 25, 2018 at 12:26:58PM -0400, Emilio G. Cota wrote:
> On Wed, Apr 25, 2018 at 12:51:22 +0800, Peter Xu wrote:
> > Add a per-iommu big lock to protect IOMMU status.  Currently the only
> > thing to be protected is the IOTLB cache, since that can be accessed
> > even without BQL, e.g., in IO dataplane.
> 
> Is the added lock going to protect anything else beyond a g_hash_table?

It's only for that now.  Maybe there can be other candidates.

> If not, did you consider using qht instead of the lock + hash table?
> It would give you lockless lookups as well as scalable updates.

Thanks for pointing that out.  I didn't even know that we have such a
nice hash there. :) Though for this series I would still prefer to
keep the mutex way as is for now.

Firstly, I'd better fix the problem first before considering
performance - after all this series is not really persuing for
performance but fixing holes.  So we can have follow up works for QHT
convertions.  Meanwhile, IOTLB is a bit special in that it might
really update more often than we thought, so I'm not really sure
whether a RCU-based hash would bring better performance.  This is even
not considering the fact that Linux guest intel-iommu driver will do
more tricks (like frequently flushing IOTLBs) to make the update much
more than usual.

But again, the suggestion is valid and it worths further
investigations to make sure my understandings are correct.

> 
> [Sorry if my question isn't relevant, I am not familiar with
> this code and just briefly skimmed over this series.

It's of course relevant.  Thanks for your input!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock Peter Xu
  2018-04-25 16:26   ` Emilio G. Cota
@ 2018-04-27  5:13   ` Jason Wang
  2018-04-27  6:26     ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Jason Wang @ 2018-04-27  5:13 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim



On 2018年04月25日 12:51, Peter Xu wrote:
> Add a per-iommu big lock to protect IOMMU status.  Currently the only
> thing to be protected is the IOTLB cache, since that can be accessed
> even without BQL, e.g., in IO dataplane.
>
> Note that device page tables should not need any protection.  The safety
> of that should be provided by guest OS.  E.g., when a page entry is
> freed, the guest OS should be responsible to make sure that no device
> will be using that page any more.
>
> Reported-by: Fam Zheng<famz@redhat.com>
> Signed-off-by: Peter Xu<peterx@redhat.com>
> ---
>   include/hw/i386/intel_iommu.h |  8 ++++++++
>   hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
>   2 files changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 220697253f..1a8ba8e415 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -262,6 +262,14 @@ struct IntelIOMMUState {
>       uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
>       uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>       uint32_t version;
> +    /*
> +     * Protects IOMMU states in general.  Normally we don't need to
> +     * take this lock when we are with BQL held.  However we have code
> +     * paths that may run even without BQL.  In those cases, we need
> +     * to take the lock when we have access to IOMMU state
> +     * informations, e.g., the IOTLB.
> +     */
> +    QemuMutex iommu_lock;

Some questions:

1) Do we need to protect context cache too?
2) Can we just reuse qemu BQL here?
3) I think the issue is common to all other kinds of IOMMU, so can we 
simply synchronize before calling ->translate() in memory.c. This seems 
a more common solution.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic Peter Xu
@ 2018-04-27  5:53   ` Jason Wang
  2018-04-27  6:27     ` Peter Xu
  2018-05-03  7:10     ` Peter Xu
  0 siblings, 2 replies; 52+ messages in thread
From: Jason Wang @ 2018-04-27  5:53 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim



On 2018年04月25日 12:51, Peter Xu wrote:
> Introduce a simplest interval tree implementation based on GTree.
> Current implementation is mostly tailored to maintain and trace device
> mapped IOVA ranges, but still it might be useful to other modules in the
> future.
>
> It is naive in that it even does not allow user to pass in private
> structs along with the ranges.  However it's good in that the tree can
> do mergings of ranges when necessary when without such information.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/qemu/interval-tree.h | 130 ++++++++++++++++++++++++++
>   util/interval-tree.c         | 217 +++++++++++++++++++++++++++++++++++++++++++
>   util/Makefile.objs           |   1 +
>   3 files changed, 348 insertions(+)
>   create mode 100644 include/qemu/interval-tree.h
>   create mode 100644 util/interval-tree.c
>
> diff --git a/include/qemu/interval-tree.h b/include/qemu/interval-tree.h
> new file mode 100644
> index 0000000000..4485f4b338
> --- /dev/null
> +++ b/include/qemu/interval-tree.h
> @@ -0,0 +1,130 @@
> +/*
> + * An very simplified interval tree implementation based on GTree.
> + *
> + * Copyright 2018 Red Hat, Inc.
> + *
> + * Authors:
> + *  Peter Xu <peterx@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + */
> +#ifndef __INTERVAL_TREE_H__
> +#define __INTERVAL_TREE_H__
> +
> +/*
> + * Currently the interval tree will only allow to keep ranges
> + * information, and no extra user data is allowed for each element.  A
> + * benefit is that we can merge adjacent ranges internally within the
> + * tree.  It can save a lot of memory when the ranges are splitted but
> + * mostly continuous.
> + *
> + * Note that current implementation does not provide any thread
> + * protections.  Callers of the interval tree should be responsible
> + * for the thread safety issue.
> + */
> +
> +#include <glib.h>
> +
> +#define  IT_OK           (0)
> +#define  IT_ERR_OVERLAP  (-1)
> +
> +typedef unsigned long long ITValue;
> +typedef struct ITTree ITTree;
> +typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);
> +
> +struct ITRange {
> +    ITValue start;
> +    ITValue end;
> +};
> +typedef struct ITRange ITRange;
> +
> +/**
> + * it_tree_new:
> + *
> + * Create a new interval tree.
> + *
> + * Returns: the tree pointer when succeeded, or NULL if error.
> + */
> +ITTree *it_tree_new(void);
> +
> +/**
> + * it_tree_insert:
> + *
> + * @tree: the interval tree to insert
> + * @start: the start of range, inclusive
> + * @end: the end of range, inclusive
> + *
> + * Insert an interval range to the tree.  If there is overlapped
> + * ranges, IT_ERR_OVERLAP will be returned.
> + *
> + * Return: 0 if succeeded, or <0 if error.
> + */
> +int it_tree_insert(ITTree *tree, ITValue start, ITValue end);
> +
> +/**
> + * it_tree_remove:
> + *
> + * @tree: the interval tree to remove range from
> + * @start: the start of range, inclusive
> + * @end: the end of range, inclusive
> + *
> + * Remove an range from the tree.  The range does not need to be
> + * exactly what has inserted.  All the ranges that are included in the
> + * provided range will be removed from the tree.
> + *
> + * Return: 0 if succeeded, or <0 if error.
> + */
> +int it_tree_remove(ITTree *tree, ITValue start, ITValue end);
> +
> +/**
> + * it_tree_find:
> + *
> + * @tree: the interval tree to search from
> + * @start: the start of range, inclusive
> + * @end: the end of range, inclusive
> + *
> + * Search for a range in the interval tree that overlaps with the
> + * range specified.  Only the first found range will be returned.
> + *
> + * Return: ITRange if found, or NULL if not found.  Note: the returned
> + * ITRange pointer is maintained internally.  User should only read
> + * the content but never modify or free the content.
> + */
> +ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end);
> +
> +/**
> + * it_tree_find_value:
> + *
> + * @tree: the interval tree to search from
> + * @value: the value to find
> + *
> + * Similar to it_tree_find(), but it tries to find range (value, value).
> + *
> + * Return: same as it_tree_find().
> + */
> +ITRange *it_tree_find_value(ITTree *tree, ITValue value);
> +
> +/**
> + * it_tree_foreach:
> + *
> + * @tree: the interval tree to iterate on
> + * @iterator: the interator for the ranges, return true to stop
> + *
> + * Search for a range in the interval tree.
> + *
> + * Return: 1 if found any overlap, 0 if not, <0 if error.
> + */
> +void it_tree_foreach(ITTree *tree, it_tree_iterator iterator);
> +
> +/**
> + * it_tree_destroy:
> + *
> + * @tree: the interval tree to destroy
> + *
> + * Destroy an existing interval tree.
> + *
> + * Return: None.
> + */
> +void it_tree_destroy(ITTree *tree);
> +
> +#endif
> diff --git a/util/interval-tree.c b/util/interval-tree.c
> new file mode 100644
> index 0000000000..0e7a37c2c6
> --- /dev/null
> +++ b/util/interval-tree.c
> @@ -0,0 +1,217 @@
> +/*
> + * An very simplified interval tree implementation based on GTree.
> + *
> + * Copyright 2018 Red Hat, Inc.
> + *
> + * Authors:
> + *  Peter Xu <peterx@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + */
> +
> +#include <glib.h>
> +#include "qemu/interval-tree.h"
> +
> +/*
> + * Each element of the internal tree is an ITRange.  It is shared
> + * between the key and value of the element, or we can see it a tree
> + * with keys only but no values.
> + */
> +
> +struct ITTree {
> +    GTree *tree;
> +};
> +
> +static int it_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
> +{
> +    const ITRange *r1 = a, *r2 = b;
> +
> +    if (r1->start > r2->end) {
> +        return 1;
> +    }
> +
> +    if (r1->end < r2->start) {
> +        return -1;
> +    }
> +
> +    /* Overlapped */
> +    return 0;
> +}
> +
> +/* Find out intersection of range A and B, put into OUT */
> +static inline void it_range_and(ITRange *out, ITRange *a, ITRange *b)
> +{
> +    out->start = MAX(a->start, b->start);
> +    out->end = MIN(a->end, b->end);
> +}
> +
> +static inline gboolean it_range_equal(ITRange *a, ITRange *b)
> +{
> +    return a->start == b->start && a->end == b->end;
> +}
> +
> +/* Whether ITRange A is superset of B? */
> +static inline gboolean it_range_cover(ITRange *a, ITRange *b)
> +{
> +    return a->start <= b->start && a->end >= b->end;
> +}
> +
> +ITTree *it_tree_new(void)
> +{
> +    ITTree *ittree = g_new0(ITTree, 1);
> +
> +    /* We don't have values actually, no need to free */
> +    ittree->tree = g_tree_new_full(it_tree_compare, NULL, g_free, NULL);
> +
> +    return ittree;
> +}
> +
> +ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end)
> +{
> +    ITRange range;
> +
> +    g_assert(tree);
> +
> +    range.start = start;
> +    range.end = end;
> +
> +    return g_tree_lookup(tree->tree, &range);
> +}
> +
> +ITRange *it_tree_find_value(ITTree *tree, ITValue value)
> +{
> +    return it_tree_find(tree, value, value);
> +}
> +
> +static inline void it_tree_insert_internal(GTree *gtree, ITRange *range)
> +{
> +    /* Key and value are sharing the same range data */
> +    g_tree_insert(gtree, range, range);
> +}
> +
> +int it_tree_insert(ITTree *tree, ITValue start, ITValue end)
> +{
> +    ITRange *range = g_new0(ITRange, 1), *overlap;
> +    GTree *gtree;
> +
> +    g_assert(tree);
> +    g_assert(start <= end);
> +
> +    gtree = tree->tree;
> +
> +    range->start = start;
> +    range->end = end;
> +
> +    /* We don't allow to insert range that overlaps with existings */
> +    if (g_tree_lookup(gtree, range)) {
> +        g_free(range);
> +        return IT_ERR_OVERLAP;
> +    }
> +
> +    /* Merge left adjacent range */
> +    overlap = it_tree_find_value(tree, start - 1);
> +    if (overlap) {
> +        range->start = overlap->start;
> +        g_tree_remove(gtree, overlap);
> +    }
> +
> +    /* Merge right adjacent range */
> +    overlap = it_tree_find_value(tree, end + 1);
> +    if (overlap) {
> +        range->end = overlap->end;
> +        g_tree_remove(gtree, overlap);
> +    }
> +
> +    /* Key and value are sharing the same range data */
> +    it_tree_insert_internal(gtree, range);

A small optimization here is to delay the allocation until you're sure 
there's not overlapping.

> +
> +    return IT_OK;
> +}
> +
> +static gboolean it_tree_traverse(gpointer key, gpointer value,
> +                                gpointer data)
> +{
> +    it_tree_iterator iterator = data;
> +    ITRange *range = key;
> +
> +    g_assert(key == value);
> +
> +    return iterator(range->start, range->end);
> +}
> +
> +void it_tree_foreach(ITTree *tree, it_tree_iterator iterator)
> +{
> +    g_assert(tree && iterator);
> +    g_tree_foreach(tree->tree, it_tree_traverse, iterator);
> +}
> +
> +/* Remove subset `range', which is part of `overlap'. */
> +static void it_tree_remove_subset(GTree *gtree, const ITRange *overlap,
> +                                  const ITRange *range)
> +{
> +    ITRange *range1, *range2;
> +
> +    if (overlap->start < range->start) {
> +        range1 = g_new0(ITRange, 1);
> +        range1->start = overlap->start;
> +        range1->end = range->start - 1;
> +    } else {
> +        range1 = NULL;
> +    }
> +    if (range->end < overlap->end) {
> +        range2 = g_new0(ITRange, 1);
> +        range2->start = range->end + 1;
> +        range2->end = overlap->end;
> +    } else {
> +        range2 = NULL;
> +    }
> +
> +    g_tree_remove(gtree, overlap);
> +
> +    if (range1) {
> +        it_tree_insert_internal(gtree, range1);
> +    }
> +    if (range2) {
> +        it_tree_insert_internal(gtree, range2);
> +    }
> +}
> +
> +int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
> +{
> +    ITRange range = { .start = start, .end = end }, *overlap, and;
> +    GTree *gtree;
> +
> +    g_assert(tree);
> +
> +    gtree = tree->tree;
> +    while ((overlap = g_tree_lookup(gtree, &range))) {
> +        if (it_range_equal(overlap, &range)) {
> +            /* Exactly what we want to remove; done */
> +            g_tree_remove(gtree, overlap);
> +            break;
> +        } else if (it_range_cover(overlap, &range)) {
> +            /* Split existing range into two; done */
> +            it_tree_remove_subset(gtree, overlap, &range);
> +            break;
> +        } else if (it_range_cover(&range, overlap)) {
> +            /* Drop this range and continue */
> +            g_tree_remove(gtree, overlap);
> +        } else {
> +            /*
> +             * The range to remove has intersection with existing
> +             * ranges.  Remove part of the range and continue
> +             */
> +            it_range_and(&and, overlap, &range);
> +            g_assert(and.start <= and.end);
> +            it_tree_remove_subset(gtree, overlap, &and);
> +        }
> +    }
> +

Looks like what we need here is just calculate the intersection part the 
remove it.

Thanks

> +    return IT_OK;
> +}
> +
> +void it_tree_destroy(ITTree *tree)
> +{
> +    g_tree_destroy(tree->tree);
> +    g_free(tree);
> +}
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 728c3541db..4ac33910ed 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -47,4 +47,5 @@ util-obj-y += qht.o
>   util-obj-y += range.o
>   util-obj-y += stats64.o
>   util-obj-y += systemd.o
> +util-obj-y += interval-tree.o
>   util-obj-$(CONFIG_LINUX) += vfio-helpers.o

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-25  4:51 ` [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
@ 2018-04-27  6:07   ` Jason Wang
  2018-04-27  6:34     ` Peter Xu
  2018-04-27  7:02     ` Tian, Kevin
  0 siblings, 2 replies; 52+ messages in thread
From: Jason Wang @ 2018-04-27  6:07 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim, Tian, Kevin



On 2018年04月25日 12:51, Peter Xu wrote:
> For each VTDAddressSpace, now we maintain what IOVA ranges we have
> mapped and what we have not.  With that information, now we only send
> MAP or UNMAP when necessary.  Say, we don't send MAP notifies if we know
> we have already mapped the range, meanwhile we don't send UNMAP notifies
> if we know we never mapped the range at all.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/hw/i386/intel_iommu.h |  2 ++
>   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
>   hw/i386/trace-events          |  2 ++
>   3 files changed, 32 insertions(+)
>
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 486e205e79..09a2e94404 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -27,6 +27,7 @@
>   #include "hw/i386/ioapic.h"
>   #include "hw/pci/msi.h"
>   #include "hw/sysbus.h"
> +#include "qemu/interval-tree.h"
>   
>   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
>   #define INTEL_IOMMU_DEVICE(obj) \
> @@ -95,6 +96,7 @@ struct VTDAddressSpace {
>       QLIST_ENTRY(VTDAddressSpace) next;
>       /* Superset of notifier flags that this address space has */
>       IOMMUNotifierFlag notifier_flags;
> +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
>   };
>   
>   struct VTDBus {
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a19c18b8d4..8f396a5d13 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -768,12 +768,37 @@ typedef struct {
>   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
>                                vtd_page_walk_info *info)
>   {
> +    VTDAddressSpace *as = info->as;
>       vtd_page_walk_hook hook_fn = info->hook_fn;
>       void *private = info->private;
> +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> +                                   entry->iova + entry->addr_mask);
>   
>       assert(hook_fn);
> +
> +    /* Update local IOVA mapped ranges */
> +    if (entry->perm) {
> +        if (mapped) {
> +            /* Skip since we have already mapped this range */
> +            trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
> +                                             mapped->start, mapped->end);
> +            return 0;
> +        }
> +        it_tree_insert(as->iova_tree, entry->iova,
> +                       entry->iova + entry->addr_mask);

I was consider a case e.g:

1) map A (iova) to B (pa)
2) invalidate A
3) map A (iova) to C (pa)
4) invalidate A

In this case, we will probably miss a walk here. But I'm not sure it was 
allowed by the spec (though I think so).

Thanks

> +    } else {
> +        if (!mapped) {
> +            /* Skip since we didn't map this range at all */
> +            trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
> +            return 0;
> +        }
> +        it_tree_remove(as->iova_tree, entry->iova,
> +                       entry->iova + entry->addr_mask);
> +    }
> +
>       trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
>                               entry->addr_mask, entry->perm);
> +
>       return hook_fn(entry, private);
>   }
>   
> @@ -2798,6 +2823,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>           vtd_dev_as->devfn = (uint8_t)devfn;
>           vtd_dev_as->iommu_state = s;
>           vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> +        vtd_dev_as->iova_tree = it_tree_new();
>   
>           /*
>            * Memory region relationships looks like (Address range shows
> @@ -2894,6 +2920,8 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>                                VTD_PCI_FUNC(as->devfn),
>                                entry.iova, size);
>   
> +    it_tree_remove(as->iova_tree, entry.iova, entry.iova + entry.addr_mask);
> +
>       memory_region_notify_one(n, &entry);
>   }
>   
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 22d44648af..677f83420d 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -40,6 +40,8 @@ vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint6
>   vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
>   vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
>   vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
> +vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t start, uint64_t end) "iova 0x%"PRIx64" mask 0x%"PRIx64" start 0x%"PRIx64" end 0x%"PRIx64
> +vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
>   vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
>   vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
>   vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-27  5:13   ` Jason Wang
@ 2018-04-27  6:26     ` Peter Xu
  2018-04-27  7:19       ` Tian, Kevin
  2018-04-28  1:43       ` Jason Wang
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-27  6:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	David Gibson

On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月25日 12:51, Peter Xu wrote:
> > Add a per-iommu big lock to protect IOMMU status.  Currently the only
> > thing to be protected is the IOTLB cache, since that can be accessed
> > even without BQL, e.g., in IO dataplane.
> > 
> > Note that device page tables should not need any protection.  The safety
> > of that should be provided by guest OS.  E.g., when a page entry is
> > freed, the guest OS should be responsible to make sure that no device
> > will be using that page any more.
> > 
> > Reported-by: Fam Zheng<famz@redhat.com>
> > Signed-off-by: Peter Xu<peterx@redhat.com>
> > ---
> >   include/hw/i386/intel_iommu.h |  8 ++++++++
> >   hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
> >   2 files changed, 37 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 220697253f..1a8ba8e415 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -262,6 +262,14 @@ struct IntelIOMMUState {
> >       uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
> >       uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
> >       uint32_t version;
> > +    /*
> > +     * Protects IOMMU states in general.  Normally we don't need to
> > +     * take this lock when we are with BQL held.  However we have code
> > +     * paths that may run even without BQL.  In those cases, we need
> > +     * to take the lock when we have access to IOMMU state
> > +     * informations, e.g., the IOTLB.
> > +     */
> > +    QemuMutex iommu_lock;
> 
> Some questions:
> 
> 1) Do we need to protect context cache too?

IMHO the context cache entry should work even without lock.  That's a
bit trickly since we have two cases that this cache will be updated:

  (1) first translation of the address space of a device
  (2) invalidation of context entries

For (2) IMHO we don't need to worry about since guest OS should be
controlling that part, say, device should not be doing any translation
(DMA operations) when the context entry is invalidated.

For (1) the worst case is that the context entry cache be updated
multiple times with the same value by multiple threads.  IMHO that'll
be fine too.

But yes for sure we can protect that too with the iommu lock.

> 2) Can we just reuse qemu BQL here?

I would prefer not.  As I mentioned, at least I have spent too much
time on fighting BQL already.  I really hope we can start to use
isolated locks when capable.  BQL is always the worst choice to me.

> 3) I think the issue is common to all other kinds of IOMMU, so can we simply
> synchronize before calling ->translate() in memory.c. This seems a more
> common solution.

I suspect Power and s390 live well with that.  I think it mean at
least these platforms won't have problem in concurrency.  I'm adding
DavidG in loop in case there is further comment.  IMHO we should just
make sure IOMMU code be thread safe, and we fix problem if there is.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic
  2018-04-27  5:53   ` Jason Wang
@ 2018-04-27  6:27     ` Peter Xu
  2018-05-03  7:10     ` Peter Xu
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-27  6:27 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim

On Fri, Apr 27, 2018 at 01:53:35PM +0800, Jason Wang wrote:

[...]

> > +    /* Merge right adjacent range */
> > +    overlap = it_tree_find_value(tree, end + 1);
> > +    if (overlap) {
> > +        range->end = overlap->end;
> > +        g_tree_remove(gtree, overlap);
> > +    }
> > +
> > +    /* Key and value are sharing the same range data */
> > +    it_tree_insert_internal(gtree, range);
> 
> A small optimization here is to delay the allocation until you're sure
> there's not overlapping.

Yes I can.

[...]

> > +int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
> > +{
> > +    ITRange range = { .start = start, .end = end }, *overlap, and;
> > +    GTree *gtree;
> > +
> > +    g_assert(tree);
> > +
> > +    gtree = tree->tree;
> > +    while ((overlap = g_tree_lookup(gtree, &range))) {
> > +        if (it_range_equal(overlap, &range)) {
> > +            /* Exactly what we want to remove; done */
> > +            g_tree_remove(gtree, overlap);
> > +            break;
> > +        } else if (it_range_cover(overlap, &range)) {
> > +            /* Split existing range into two; done */
> > +            it_tree_remove_subset(gtree, overlap, &range);
> > +            break;
> > +        } else if (it_range_cover(&range, overlap)) {
> > +            /* Drop this range and continue */
> > +            g_tree_remove(gtree, overlap);
> > +        } else {
> > +            /*
> > +             * The range to remove has intersection with existing
> > +             * ranges.  Remove part of the range and continue
> > +             */
> > +            it_range_and(&and, overlap, &range);
> > +            g_assert(and.start <= and.end);
> > +            it_tree_remove_subset(gtree, overlap, &and);
> > +        }
> > +    }
> > +
> 
> Looks like what we need here is just calculate the intersection part the
> remove it.

Yes.  Will fix that.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27  6:07   ` Jason Wang
@ 2018-04-27  6:34     ` Peter Xu
  2018-04-27  7:02     ` Tian, Kevin
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-27  6:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	Tian, Kevin

On Fri, Apr 27, 2018 at 02:07:46PM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月25日 12:51, Peter Xu wrote:
> > For each VTDAddressSpace, now we maintain what IOVA ranges we have
> > mapped and what we have not.  With that information, now we only send
> > MAP or UNMAP when necessary.  Say, we don't send MAP notifies if we know
> > we have already mapped the range, meanwhile we don't send UNMAP notifies
> > if we know we never mapped the range at all.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/hw/i386/intel_iommu.h |  2 ++
> >   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
> >   hw/i386/trace-events          |  2 ++
> >   3 files changed, 32 insertions(+)
> > 
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 486e205e79..09a2e94404 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -27,6 +27,7 @@
> >   #include "hw/i386/ioapic.h"
> >   #include "hw/pci/msi.h"
> >   #include "hw/sysbus.h"
> > +#include "qemu/interval-tree.h"
> >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> >   #define INTEL_IOMMU_DEVICE(obj) \
> > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> >       QLIST_ENTRY(VTDAddressSpace) next;
> >       /* Superset of notifier flags that this address space has */
> >       IOMMUNotifierFlag notifier_flags;
> > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> >   };
> >   struct VTDBus {
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index a19c18b8d4..8f396a5d13 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -768,12 +768,37 @@ typedef struct {
> >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> >                                vtd_page_walk_info *info)
> >   {
> > +    VTDAddressSpace *as = info->as;
> >       vtd_page_walk_hook hook_fn = info->hook_fn;
> >       void *private = info->private;
> > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > +                                   entry->iova + entry->addr_mask);
> >       assert(hook_fn);
> > +
> > +    /* Update local IOVA mapped ranges */
> > +    if (entry->perm) {
> > +        if (mapped) {
> > +            /* Skip since we have already mapped this range */
> > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
> > +                                             mapped->start, mapped->end);
> > +            return 0;
> > +        }
> > +        it_tree_insert(as->iova_tree, entry->iova,
> > +                       entry->iova + entry->addr_mask);
> 
> I was consider a case e.g:
> 
> 1) map A (iova) to B (pa)
> 2) invalidate A

Here to be more explicit you mean guest sends a PSI, not really
invalidation of the mapping.

> 3) map A (iova) to C (pa)
> 4) invalidate A

Here too.

> 
> In this case, we will probably miss a walk here. But I'm not sure it was
> allowed by the spec (though I think so).

IMHO IOMMU page tables should not be modified by guest directly.  It
can be mapped, unmapped, but should not be modified directly.  I
suppose that's why Linux IOMMU API won't provide iommu_modify() but
only iommu_[un]map(), etc.. I don't know whether there is anything in
the spec, but AFAIU this can cause coherence issue on device side
since after step (1) device should be able to know the mapping
already, then modifying that mapping without UNMAP that on device side
will cause undefined behaviors.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27  6:07   ` Jason Wang
  2018-04-27  6:34     ` Peter Xu
@ 2018-04-27  7:02     ` Tian, Kevin
  2018-04-27  7:28       ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2018-04-27  7:02 UTC (permalink / raw)
  To: Jason Wang, Peter Xu, qemu-devel
  Cc: Michael S . Tsirkin, Alex Williamson, Jintack Lim

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Friday, April 27, 2018 2:08 PM
> 
> On 2018年04月25日 12:51, Peter Xu wrote:
> > For each VTDAddressSpace, now we maintain what IOVA ranges we have
> > mapped and what we have not.  With that information, now we only
> send
> > MAP or UNMAP when necessary.  Say, we don't send MAP notifies if we
> know
> > we have already mapped the range, meanwhile we don't send UNMAP
> notifies
> > if we know we never mapped the range at all.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/hw/i386/intel_iommu.h |  2 ++
> >   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
> >   hw/i386/trace-events          |  2 ++
> >   3 files changed, 32 insertions(+)
> >
> > diff --git a/include/hw/i386/intel_iommu.h
> b/include/hw/i386/intel_iommu.h
> > index 486e205e79..09a2e94404 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -27,6 +27,7 @@
> >   #include "hw/i386/ioapic.h"
> >   #include "hw/pci/msi.h"
> >   #include "hw/sysbus.h"
> > +#include "qemu/interval-tree.h"
> >
> >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> >   #define INTEL_IOMMU_DEVICE(obj) \
> > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> >       QLIST_ENTRY(VTDAddressSpace) next;
> >       /* Superset of notifier flags that this address space has */
> >       IOMMUNotifierFlag notifier_flags;
> > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> >   };
> >
> >   struct VTDBus {
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index a19c18b8d4..8f396a5d13 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -768,12 +768,37 @@ typedef struct {
> >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> >                                vtd_page_walk_info *info)
> >   {
> > +    VTDAddressSpace *as = info->as;
> >       vtd_page_walk_hook hook_fn = info->hook_fn;
> >       void *private = info->private;
> > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > +                                   entry->iova + entry->addr_mask);
> >
> >       assert(hook_fn);
> > +
> > +    /* Update local IOVA mapped ranges */
> > +    if (entry->perm) {
> > +        if (mapped) {
> > +            /* Skip since we have already mapped this range */
> > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> >addr_mask,
> > +                                             mapped->start, mapped->end);
> > +            return 0;
> > +        }
> > +        it_tree_insert(as->iova_tree, entry->iova,
> > +                       entry->iova + entry->addr_mask);
> 
> I was consider a case e.g:
> 
> 1) map A (iova) to B (pa)
> 2) invalidate A
> 3) map A (iova) to C (pa)
> 4) invalidate A
> 
> In this case, we will probably miss a walk here. But I'm not sure it was
> allowed by the spec (though I think so).
> 

I thought it was wrong in a glimpse, but then changed my mind after
another thinking. As long as device driver can quiescent the device
to not access A (iova) within above window, then above sequence
has no problem since any stale mappings (A->B) added before step 4)
won't be used and then will get flushed after step 4). Regarding to
that actually the 1st invalidation can be skipped:

1) map A (iova) to B (pa)
2) driver programs device to use A
3) driver programs device to not use A
4) map A (iova) to C (pa)
	A->B may be still valid in IOTLB
5) invalidate A
6) driver programs device to use A

Of course above doesn't generate a sane IOMMU API framework,
just as Peter pointed out. But from hardware p.o.v it looks no
problem.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-27  6:26     ` Peter Xu
@ 2018-04-27  7:19       ` Tian, Kevin
  2018-04-27  9:53         ` Peter Xu
  2018-04-28  1:43       ` Jason Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2018-04-27  7:19 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: Jintack Lim, Alex Williamson, David Gibson, qemu-devel,
	Michael S . Tsirkin

> From: Peter Xu
> Sent: Friday, April 27, 2018 2:26 PM
> 
> On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
> >
> >
> > On 2018年04月25日 12:51, Peter Xu wrote:
> > > Add a per-iommu big lock to protect IOMMU status.  Currently the only
> > > thing to be protected is the IOTLB cache, since that can be accessed
> > > even without BQL, e.g., in IO dataplane.
> > >
> > > Note that device page tables should not need any protection.  The
> safety
> > > of that should be provided by guest OS.  E.g., when a page entry is
> > > freed, the guest OS should be responsible to make sure that no device
> > > will be using that page any more.

device page table definitely doesn't require protection, since it is
in-memory structure managed by guest. However the reason
above is not accurate - there is no way that guest OS can make
sure no device uses non-present page entry, otherwise it doesn't
require virtual IOMMU to protect itself. There could be bogus/
malicious drivers which surely may program the device to attempt so.

> > >
> > > Reported-by: Fam Zheng<famz@redhat.com>
> > > Signed-off-by: Peter Xu<peterx@redhat.com>
> > > ---
> > >   include/hw/i386/intel_iommu.h |  8 ++++++++
> > >   hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
> > >   2 files changed, 37 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/hw/i386/intel_iommu.h
> b/include/hw/i386/intel_iommu.h
> > > index 220697253f..1a8ba8e415 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -262,6 +262,14 @@ struct IntelIOMMUState {
> > >       uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes
> */
> > >       uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns
> 0) */
> > >       uint32_t version;
> > > +    /*
> > > +     * Protects IOMMU states in general.  Normally we don't need to
> > > +     * take this lock when we are with BQL held.  However we have code
> > > +     * paths that may run even without BQL.  In those cases, we need
> > > +     * to take the lock when we have access to IOMMU state
> > > +     * informations, e.g., the IOTLB.

better if you can whitelist those paths here to clarify.

> > > +     */
> > > +    QemuMutex iommu_lock;
> >
> > Some questions:
> >
> > 1) Do we need to protect context cache too?
> 
> IMHO the context cache entry should work even without lock.  That's a
> bit trickly since we have two cases that this cache will be updated:
> 
>   (1) first translation of the address space of a device
>   (2) invalidation of context entries
> 
> For (2) IMHO we don't need to worry about since guest OS should be
> controlling that part, say, device should not be doing any translation
> (DMA operations) when the context entry is invalidated.

again above cannot be assumed.

> 
> For (1) the worst case is that the context entry cache be updated
> multiple times with the same value by multiple threads.  IMHO that'll
> be fine too.
> 
> But yes for sure we can protect that too with the iommu lock.
> 
> > 2) Can we just reuse qemu BQL here?
> 
> I would prefer not.  As I mentioned, at least I have spent too much
> time on fighting BQL already.  I really hope we can start to use
> isolated locks when capable.  BQL is always the worst choice to me.
> 
> > 3) I think the issue is common to all other kinds of IOMMU, so can we
> simply
> > synchronize before calling ->translate() in memory.c. This seems a more
> > common solution.
> 
> I suspect Power and s390 live well with that.  I think it mean at
> least these platforms won't have problem in concurrency.  I'm adding
> DavidG in loop in case there is further comment.  IMHO we should just
> make sure IOMMU code be thread safe, and we fix problem if there is.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27  7:02     ` Tian, Kevin
@ 2018-04-27  7:28       ` Peter Xu
  2018-04-27  7:44         ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-04-27  7:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, qemu-devel, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim

On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Friday, April 27, 2018 2:08 PM
> > 
> > On 2018年04月25日 12:51, Peter Xu wrote:
> > > For each VTDAddressSpace, now we maintain what IOVA ranges we have
> > > mapped and what we have not.  With that information, now we only
> > send
> > > MAP or UNMAP when necessary.  Say, we don't send MAP notifies if we
> > know
> > > we have already mapped the range, meanwhile we don't send UNMAP
> > notifies
> > > if we know we never mapped the range at all.
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >   include/hw/i386/intel_iommu.h |  2 ++
> > >   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
> > >   hw/i386/trace-events          |  2 ++
> > >   3 files changed, 32 insertions(+)
> > >
> > > diff --git a/include/hw/i386/intel_iommu.h
> > b/include/hw/i386/intel_iommu.h
> > > index 486e205e79..09a2e94404 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -27,6 +27,7 @@
> > >   #include "hw/i386/ioapic.h"
> > >   #include "hw/pci/msi.h"
> > >   #include "hw/sysbus.h"
> > > +#include "qemu/interval-tree.h"
> > >
> > >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> > >   #define INTEL_IOMMU_DEVICE(obj) \
> > > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> > >       QLIST_ENTRY(VTDAddressSpace) next;
> > >       /* Superset of notifier flags that this address space has */
> > >       IOMMUNotifierFlag notifier_flags;
> > > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> > >   };
> > >
> > >   struct VTDBus {
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index a19c18b8d4..8f396a5d13 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -768,12 +768,37 @@ typedef struct {
> > >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > >                                vtd_page_walk_info *info)
> > >   {
> > > +    VTDAddressSpace *as = info->as;
> > >       vtd_page_walk_hook hook_fn = info->hook_fn;
> > >       void *private = info->private;
> > > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > > +                                   entry->iova + entry->addr_mask);
> > >
> > >       assert(hook_fn);
> > > +
> > > +    /* Update local IOVA mapped ranges */
> > > +    if (entry->perm) {
> > > +        if (mapped) {
> > > +            /* Skip since we have already mapped this range */
> > > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > >addr_mask,
> > > +                                             mapped->start, mapped->end);
> > > +            return 0;
> > > +        }
> > > +        it_tree_insert(as->iova_tree, entry->iova,
> > > +                       entry->iova + entry->addr_mask);
> > 
> > I was consider a case e.g:
> > 
> > 1) map A (iova) to B (pa)
> > 2) invalidate A
> > 3) map A (iova) to C (pa)
> > 4) invalidate A
> > 
> > In this case, we will probably miss a walk here. But I'm not sure it was
> > allowed by the spec (though I think so).
> > 

Hi, Kevin,

Thanks for joining the discussion.

> 
> I thought it was wrong in a glimpse, but then changed my mind after
> another thinking. As long as device driver can quiescent the device
> to not access A (iova) within above window, then above sequence
> has no problem since any stale mappings (A->B) added before step 4)
> won't be used and then will get flushed after step 4). Regarding to
> that actually the 1st invalidation can be skipped:
> 
> 1) map A (iova) to B (pa)
> 2) driver programs device to use A
> 3) driver programs device to not use A
> 4) map A (iova) to C (pa)
> 	A->B may be still valid in IOTLB
> 5) invalidate A
> 6) driver programs device to use A

Note that IMHO this is a bit different from Jason's example, and it'll
be fine.  Current code should work well with this scenario since the
emulation code will not aware of the map A until step (5).  Then we'll
have the correct mapping.

While for Jason's example it's exactly the extra PSI that might cause
stale mappings (though again I think it's still problematic...).

Actually I think I can just fix up the code even if Jason's case
happens by unmapping that first then remap:

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 31e9b52452..2a9584f9d8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -778,13 +778,21 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
     /* Update local IOVA mapped ranges */
     if (entry->perm) {
         if (mapped) {
-            /* Skip since we have already mapped this range */
-            trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
-                                             mapped->start, mapped->end);
-            return 0;
+            int ret;
+            /* Cache the write permission */
+            IOMMUAccessFlags flags = entry->perm;
+
+            /* UNMAP the old first then remap.  No need to touch IOVA tree */
+            entry->perm = IOMMU_NONE;
+            ret = hook_fn(entry, private);
+            if (ret) {
+                return ret;
+            }
+            entry->perm = flags;
+        } else {
+            it_tree_insert(as->iova_tree, entry->iova,
+                           entry->iova + entry->addr_mask);
         }
-        it_tree_insert(as->iova_tree, entry->iova,
-                       entry->iova + entry->addr_mask);
     } else {
         if (!mapped) {
             /* Skip since we didn't map this range at all */

If we really think it necessary, I can squash this in, though this is
a bit ugly.  But I just want to confirm whether this would be anything
we want...

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27  7:28       ` Peter Xu
@ 2018-04-27  7:44         ` Tian, Kevin
  2018-04-27  9:55           ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2018-04-27  7:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, qemu-devel, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, April 27, 2018 3:28 PM
> 
> On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
> > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > Sent: Friday, April 27, 2018 2:08 PM
> > >
> > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > For each VTDAddressSpace, now we maintain what IOVA ranges we
> have
> > > > mapped and what we have not.  With that information, now we only
> > > send
> > > > MAP or UNMAP when necessary.  Say, we don't send MAP notifies if
> we
> > > know
> > > > we have already mapped the range, meanwhile we don't send
> UNMAP
> > > notifies
> > > > if we know we never mapped the range at all.
> > > >
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >   include/hw/i386/intel_iommu.h |  2 ++
> > > >   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
> > > >   hw/i386/trace-events          |  2 ++
> > > >   3 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/include/hw/i386/intel_iommu.h
> > > b/include/hw/i386/intel_iommu.h
> > > > index 486e205e79..09a2e94404 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -27,6 +27,7 @@
> > > >   #include "hw/i386/ioapic.h"
> > > >   #include "hw/pci/msi.h"
> > > >   #include "hw/sysbus.h"
> > > > +#include "qemu/interval-tree.h"
> > > >
> > > >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> > > >   #define INTEL_IOMMU_DEVICE(obj) \
> > > > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> > > >       QLIST_ENTRY(VTDAddressSpace) next;
> > > >       /* Superset of notifier flags that this address space has */
> > > >       IOMMUNotifierFlag notifier_flags;
> > > > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> > > >   };
> > > >
> > > >   struct VTDBus {
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index a19c18b8d4..8f396a5d13 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -768,12 +768,37 @@ typedef struct {
> > > >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > > >                                vtd_page_walk_info *info)
> > > >   {
> > > > +    VTDAddressSpace *as = info->as;
> > > >       vtd_page_walk_hook hook_fn = info->hook_fn;
> > > >       void *private = info->private;
> > > > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > > > +                                   entry->iova + entry->addr_mask);
> > > >
> > > >       assert(hook_fn);
> > > > +
> > > > +    /* Update local IOVA mapped ranges */
> > > > +    if (entry->perm) {
> > > > +        if (mapped) {
> > > > +            /* Skip since we have already mapped this range */
> > > > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > >addr_mask,
> > > > +                                             mapped->start, mapped->end);
> > > > +            return 0;
> > > > +        }
> > > > +        it_tree_insert(as->iova_tree, entry->iova,
> > > > +                       entry->iova + entry->addr_mask);
> > >
> > > I was consider a case e.g:
> > >
> > > 1) map A (iova) to B (pa)
> > > 2) invalidate A
> > > 3) map A (iova) to C (pa)
> > > 4) invalidate A
> > >
> > > In this case, we will probably miss a walk here. But I'm not sure it was
> > > allowed by the spec (though I think so).
> > >
> 
> Hi, Kevin,
> 
> Thanks for joining the discussion.
> 
> >
> > I thought it was wrong in a glimpse, but then changed my mind after
> > another thinking. As long as device driver can quiescent the device
> > to not access A (iova) within above window, then above sequence
> > has no problem since any stale mappings (A->B) added before step 4)
> > won't be used and then will get flushed after step 4). Regarding to
> > that actually the 1st invalidation can be skipped:
> >
> > 1) map A (iova) to B (pa)
> > 2) driver programs device to use A
> > 3) driver programs device to not use A
> > 4) map A (iova) to C (pa)
> > 	A->B may be still valid in IOTLB
> > 5) invalidate A
> > 6) driver programs device to use A
> 
> Note that IMHO this is a bit different from Jason's example, and it'll
> be fine.  Current code should work well with this scenario since the
> emulation code will not aware of the map A until step (5).  Then we'll
> have the correct mapping.

you are right. we still need the extra PSI otherwise the 1st mapping
is problematic for use. So back to Jason's example.

> 
> While for Jason's example it's exactly the extra PSI that might cause
> stale mappings (though again I think it's still problematic...).

problematic in software side (e.g. that way IOMMU core relies on
device drivers which conflict with the value of using IOMMU) but
it is OK from hardware p.o.v. btw the extra PSI itself doesn't cause
stale mapping. Instead it is device activity after that PSI may cause it.

> 
> Actually I think I can just fix up the code even if Jason's case
> happens by unmapping that first then remap:
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 31e9b52452..2a9584f9d8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -778,13 +778,21 @@ static int vtd_page_walk_one(IOMMUTLBEntry
> *entry, int level,
>      /* Update local IOVA mapped ranges */
>      if (entry->perm) {
>          if (mapped) {
> -            /* Skip since we have already mapped this range */
> -            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> >addr_mask,
> -                                             mapped->start, mapped->end);
> -            return 0;
> +            int ret;
> +            /* Cache the write permission */
> +            IOMMUAccessFlags flags = entry->perm;
> +
> +            /* UNMAP the old first then remap.  No need to touch IOVA tree */
> +            entry->perm = IOMMU_NONE;
> +            ret = hook_fn(entry, private);
> +            if (ret) {
> +                return ret;
> +            }
> +            entry->perm = flags;
> +        } else {
> +            it_tree_insert(as->iova_tree, entry->iova,
> +                           entry->iova + entry->addr_mask);
>          }
> -        it_tree_insert(as->iova_tree, entry->iova,
> -                       entry->iova + entry->addr_mask);
>      } else {
>          if (!mapped) {
>              /* Skip since we didn't map this range at all */
> 
> If we really think it necessary, I can squash this in, though this is
> a bit ugly.  But I just want to confirm whether this would be anything
> we want...
> 

I didn’t look into your actual code yet. If others think above
change is OK then it's definitely good as we conform to hardware
behavior here. Otherwise if there is a way to detect such unusual 
usage and then adopt some action (say kill the VM), it's also fine 
since user knows he runs a bad OS which is not supported by 
Qemu. It's just not good if such situation is not handled, which 
leads to some undefined behavior which nobody knows the reason 
w/o hard debug into. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-27  7:19       ` Tian, Kevin
@ 2018-04-27  9:53         ` Peter Xu
  2018-04-28  1:54           ` Tian, Kevin
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-04-27  9:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Jintack Lim, Alex Williamson, David Gibson,
	qemu-devel, Michael S . Tsirkin

On Fri, Apr 27, 2018 at 07:19:25AM +0000, Tian, Kevin wrote:
> > From: Peter Xu
> > Sent: Friday, April 27, 2018 2:26 PM
> > 
> > On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
> > >
> > >
> > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > Add a per-iommu big lock to protect IOMMU status.  Currently the only
> > > > thing to be protected is the IOTLB cache, since that can be accessed
> > > > even without BQL, e.g., in IO dataplane.
> > > >
> > > > Note that device page tables should not need any protection.  The
> > safety
> > > > of that should be provided by guest OS.  E.g., when a page entry is
> > > > freed, the guest OS should be responsible to make sure that no device
> > > > will be using that page any more.
> 
> device page table definitely doesn't require protection, since it is
> in-memory structure managed by guest. However the reason
> above is not accurate - there is no way that guest OS can make
> sure no device uses non-present page entry, otherwise it doesn't
> require virtual IOMMU to protect itself. There could be bogus/
> malicious drivers which surely may program the device to attempt so.

How about this:

  Note that we don't need to protect device page tables since that's
  fully controlled by the guest kernel.  However there is still
  possibilities that malicious drivers will still program the device
  to not disobey the rule.  In that case QEMU can't really do anything
  useful, instead the guest itself will be responsible for all
  uncertainties.

> 
> > > >
> > > > Reported-by: Fam Zheng<famz@redhat.com>
> > > > Signed-off-by: Peter Xu<peterx@redhat.com>
> > > > ---
> > > >   include/hw/i386/intel_iommu.h |  8 ++++++++
> > > >   hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
> > > >   2 files changed, 37 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/hw/i386/intel_iommu.h
> > b/include/hw/i386/intel_iommu.h
> > > > index 220697253f..1a8ba8e415 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -262,6 +262,14 @@ struct IntelIOMMUState {
> > > >       uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes
> > */
> > > >       uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns
> > 0) */
> > > >       uint32_t version;
> > > > +    /*
> > > > +     * Protects IOMMU states in general.  Normally we don't need to
> > > > +     * take this lock when we are with BQL held.  However we have code
> > > > +     * paths that may run even without BQL.  In those cases, we need
> > > > +     * to take the lock when we have access to IOMMU state
> > > > +     * informations, e.g., the IOTLB.
> 
> better if you can whitelist those paths here to clarify.

Sure. Basically it's the translation path (vtd_iommu_translate).

> 
> > > > +     */
> > > > +    QemuMutex iommu_lock;
> > >
> > > Some questions:
> > >
> > > 1) Do we need to protect context cache too?
> > 
> > IMHO the context cache entry should work even without lock.  That's a
> > bit trickly since we have two cases that this cache will be updated:
> > 
> >   (1) first translation of the address space of a device
> >   (2) invalidation of context entries
> > 
> > For (2) IMHO we don't need to worry about since guest OS should be
> > controlling that part, say, device should not be doing any translation
> > (DMA operations) when the context entry is invalidated.
> 
> again above cannot be assumed.

Yeah, but in that case IMHO it's still the same just like page tables
- we can't control anything really, and the guest itself will be
responsible for any undefined subsequences.

Anyway, let me protect that field too in my next version.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27  7:44         ` Tian, Kevin
@ 2018-04-27  9:55           ` Peter Xu
  2018-04-27 11:40             ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-04-27  9:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, qemu-devel, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim

On Fri, Apr 27, 2018 at 07:44:07AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, April 27, 2018 3:28 PM
> > 
> > On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
> > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > Sent: Friday, April 27, 2018 2:08 PM
> > > >
> > > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > > For each VTDAddressSpace, now we maintain what IOVA ranges we
> > have
> > > > > mapped and what we have not.  With that information, now we only
> > > > send
> > > > > MAP or UNMAP when necessary.  Say, we don't send MAP notifies if
> > we
> > > > know
> > > > > we have already mapped the range, meanwhile we don't send
> > UNMAP
> > > > notifies
> > > > > if we know we never mapped the range at all.
> > > > >
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >   include/hw/i386/intel_iommu.h |  2 ++
> > > > >   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
> > > > >   hw/i386/trace-events          |  2 ++
> > > > >   3 files changed, 32 insertions(+)
> > > > >
> > > > > diff --git a/include/hw/i386/intel_iommu.h
> > > > b/include/hw/i386/intel_iommu.h
> > > > > index 486e205e79..09a2e94404 100644
> > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > @@ -27,6 +27,7 @@
> > > > >   #include "hw/i386/ioapic.h"
> > > > >   #include "hw/pci/msi.h"
> > > > >   #include "hw/sysbus.h"
> > > > > +#include "qemu/interval-tree.h"
> > > > >
> > > > >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> > > > >   #define INTEL_IOMMU_DEVICE(obj) \
> > > > > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> > > > >       QLIST_ENTRY(VTDAddressSpace) next;
> > > > >       /* Superset of notifier flags that this address space has */
> > > > >       IOMMUNotifierFlag notifier_flags;
> > > > > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> > > > >   };
> > > > >
> > > > >   struct VTDBus {
> > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > index a19c18b8d4..8f396a5d13 100644
> > > > > --- a/hw/i386/intel_iommu.c
> > > > > +++ b/hw/i386/intel_iommu.c
> > > > > @@ -768,12 +768,37 @@ typedef struct {
> > > > >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > > > >                                vtd_page_walk_info *info)
> > > > >   {
> > > > > +    VTDAddressSpace *as = info->as;
> > > > >       vtd_page_walk_hook hook_fn = info->hook_fn;
> > > > >       void *private = info->private;
> > > > > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > > > > +                                   entry->iova + entry->addr_mask);
> > > > >
> > > > >       assert(hook_fn);
> > > > > +
> > > > > +    /* Update local IOVA mapped ranges */
> > > > > +    if (entry->perm) {
> > > > > +        if (mapped) {
> > > > > +            /* Skip since we have already mapped this range */
> > > > > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > > >addr_mask,
> > > > > +                                             mapped->start, mapped->end);
> > > > > +            return 0;
> > > > > +        }
> > > > > +        it_tree_insert(as->iova_tree, entry->iova,
> > > > > +                       entry->iova + entry->addr_mask);
> > > >
> > > > I was consider a case e.g:
> > > >
> > > > 1) map A (iova) to B (pa)
> > > > 2) invalidate A
> > > > 3) map A (iova) to C (pa)
> > > > 4) invalidate A
> > > >
> > > > In this case, we will probably miss a walk here. But I'm not sure it was
> > > > allowed by the spec (though I think so).
> > > >
> > 
> > Hi, Kevin,
> > 
> > Thanks for joining the discussion.
> > 
> > >
> > > I thought it was wrong in a glimpse, but then changed my mind after
> > > another thinking. As long as device driver can quiescent the device
> > > to not access A (iova) within above window, then above sequence
> > > has no problem since any stale mappings (A->B) added before step 4)
> > > won't be used and then will get flushed after step 4). Regarding to
> > > that actually the 1st invalidation can be skipped:
> > >
> > > 1) map A (iova) to B (pa)
> > > 2) driver programs device to use A
> > > 3) driver programs device to not use A
> > > 4) map A (iova) to C (pa)
> > > 	A->B may be still valid in IOTLB
> > > 5) invalidate A
> > > 6) driver programs device to use A
> > 
> > Note that IMHO this is a bit different from Jason's example, and it'll
> > be fine.  Current code should work well with this scenario since the
> > emulation code will not aware of the map A until step (5).  Then we'll
> > have the correct mapping.
> 
> you are right. we still need the extra PSI otherwise the 1st mapping
> is problematic for use. So back to Jason's example.
> 
> > 
> > While for Jason's example it's exactly the extra PSI that might cause
> > stale mappings (though again I think it's still problematic...).
> 
> problematic in software side (e.g. that way IOMMU core relies on
> device drivers which conflict with the value of using IOMMU) but
> it is OK from hardware p.o.v. btw the extra PSI itself doesn't cause
> stale mapping. Instead it is device activity after that PSI may cause it.
> 
> > 
> > Actually I think I can just fix up the code even if Jason's case
> > happens by unmapping that first then remap:
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 31e9b52452..2a9584f9d8 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -778,13 +778,21 @@ static int vtd_page_walk_one(IOMMUTLBEntry
> > *entry, int level,
> >      /* Update local IOVA mapped ranges */
> >      if (entry->perm) {
> >          if (mapped) {
> > -            /* Skip since we have already mapped this range */
> > -            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > >addr_mask,
> > -                                             mapped->start, mapped->end);
> > -            return 0;
> > +            int ret;
> > +            /* Cache the write permission */
> > +            IOMMUAccessFlags flags = entry->perm;
> > +
> > +            /* UNMAP the old first then remap.  No need to touch IOVA tree */
> > +            entry->perm = IOMMU_NONE;
> > +            ret = hook_fn(entry, private);
> > +            if (ret) {
> > +                return ret;
> > +            }
> > +            entry->perm = flags;
> > +        } else {
> > +            it_tree_insert(as->iova_tree, entry->iova,
> > +                           entry->iova + entry->addr_mask);
> >          }
> > -        it_tree_insert(as->iova_tree, entry->iova,
> > -                       entry->iova + entry->addr_mask);
> >      } else {
> >          if (!mapped) {
> >              /* Skip since we didn't map this range at all */
> > 
> > If we really think it necessary, I can squash this in, though this is
> > a bit ugly.  But I just want to confirm whether this would be anything
> > we want...
> > 
> 
> I didn’t look into your actual code yet. If others think above
> change is OK then it's definitely good as we conform to hardware
> behavior here. Otherwise if there is a way to detect such unusual 
> usage and then adopt some action (say kill the VM), it's also fine 
> since user knows he runs a bad OS which is not supported by 
> Qemu. It's just not good if such situation is not handled, which 
> leads to some undefined behavior which nobody knows the reason 
> w/o hard debug into. :-)

Yeah, then let me do this. :)

Jason, would you be good with above change squashed?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27  9:55           ` Peter Xu
@ 2018-04-27 11:40             ` Peter Xu
  2018-04-27 23:37               ` Tian, Kevin
  2018-04-28  1:49               ` Jason Wang
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-27 11:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, qemu-devel, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim

On Fri, Apr 27, 2018 at 05:55:27PM +0800, Peter Xu wrote:
> On Fri, Apr 27, 2018 at 07:44:07AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Friday, April 27, 2018 3:28 PM
> > > 
> > > On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > > Sent: Friday, April 27, 2018 2:08 PM
> > > > >
> > > > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > > > For each VTDAddressSpace, now we maintain what IOVA ranges we
> > > have
> > > > > > mapped and what we have not.  With that information, now we only
> > > > > send
> > > > > > MAP or UNMAP when necessary.  Say, we don't send MAP notifies if
> > > we
> > > > > know
> > > > > > we have already mapped the range, meanwhile we don't send
> > > UNMAP
> > > > > notifies
> > > > > > if we know we never mapped the range at all.
> > > > > >
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >   include/hw/i386/intel_iommu.h |  2 ++
> > > > > >   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
> > > > > >   hw/i386/trace-events          |  2 ++
> > > > > >   3 files changed, 32 insertions(+)
> > > > > >
> > > > > > diff --git a/include/hw/i386/intel_iommu.h
> > > > > b/include/hw/i386/intel_iommu.h
> > > > > > index 486e205e79..09a2e94404 100644
> > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > @@ -27,6 +27,7 @@
> > > > > >   #include "hw/i386/ioapic.h"
> > > > > >   #include "hw/pci/msi.h"
> > > > > >   #include "hw/sysbus.h"
> > > > > > +#include "qemu/interval-tree.h"
> > > > > >
> > > > > >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> > > > > >   #define INTEL_IOMMU_DEVICE(obj) \
> > > > > > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> > > > > >       QLIST_ENTRY(VTDAddressSpace) next;
> > > > > >       /* Superset of notifier flags that this address space has */
> > > > > >       IOMMUNotifierFlag notifier_flags;
> > > > > > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> > > > > >   };
> > > > > >
> > > > > >   struct VTDBus {
> > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > index a19c18b8d4..8f396a5d13 100644
> > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > @@ -768,12 +768,37 @@ typedef struct {
> > > > > >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > > > > >                                vtd_page_walk_info *info)
> > > > > >   {
> > > > > > +    VTDAddressSpace *as = info->as;
> > > > > >       vtd_page_walk_hook hook_fn = info->hook_fn;
> > > > > >       void *private = info->private;
> > > > > > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > > > > > +                                   entry->iova + entry->addr_mask);
> > > > > >
> > > > > >       assert(hook_fn);
> > > > > > +
> > > > > > +    /* Update local IOVA mapped ranges */
> > > > > > +    if (entry->perm) {
> > > > > > +        if (mapped) {
> > > > > > +            /* Skip since we have already mapped this range */
> > > > > > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > > > >addr_mask,
> > > > > > +                                             mapped->start, mapped->end);
> > > > > > +            return 0;
> > > > > > +        }
> > > > > > +        it_tree_insert(as->iova_tree, entry->iova,
> > > > > > +                       entry->iova + entry->addr_mask);
> > > > >
> > > > > I was consider a case e.g:
> > > > >
> > > > > 1) map A (iova) to B (pa)
> > > > > 2) invalidate A
> > > > > 3) map A (iova) to C (pa)
> > > > > 4) invalidate A
> > > > >
> > > > > In this case, we will probably miss a walk here. But I'm not sure it was
> > > > > allowed by the spec (though I think so).
> > > > >
> > > 
> > > Hi, Kevin,
> > > 
> > > Thanks for joining the discussion.
> > > 
> > > >
> > > > I thought it was wrong in a glimpse, but then changed my mind after
> > > > another thinking. As long as device driver can quiescent the device
> > > > to not access A (iova) within above window, then above sequence
> > > > has no problem since any stale mappings (A->B) added before step 4)
> > > > won't be used and then will get flushed after step 4). Regarding to
> > > > that actually the 1st invalidation can be skipped:
> > > >
> > > > 1) map A (iova) to B (pa)
> > > > 2) driver programs device to use A
> > > > 3) driver programs device to not use A
> > > > 4) map A (iova) to C (pa)
> > > > 	A->B may be still valid in IOTLB
> > > > 5) invalidate A
> > > > 6) driver programs device to use A
> > > 
> > > Note that IMHO this is a bit different from Jason's example, and it'll
> > > be fine.  Current code should work well with this scenario since the
> > > emulation code will not aware of the map A until step (5).  Then we'll
> > > have the correct mapping.
> > 
> > you are right. we still need the extra PSI otherwise the 1st mapping
> > is problematic for use. So back to Jason's example.
> > 
> > > 
> > > While for Jason's example it's exactly the extra PSI that might cause
> > > stale mappings (though again I think it's still problematic...).
> > 
> > problematic in software side (e.g. that way IOMMU core relies on
> > device drivers which conflict with the value of using IOMMU) but
> > it is OK from hardware p.o.v. btw the extra PSI itself doesn't cause
> > stale mapping. Instead it is device activity after that PSI may cause it.
> > 
> > > 
> > > Actually I think I can just fix up the code even if Jason's case
> > > happens by unmapping that first then remap:
> > > 
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 31e9b52452..2a9584f9d8 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -778,13 +778,21 @@ static int vtd_page_walk_one(IOMMUTLBEntry
> > > *entry, int level,
> > >      /* Update local IOVA mapped ranges */
> > >      if (entry->perm) {
> > >          if (mapped) {
> > > -            /* Skip since we have already mapped this range */
> > > -            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > >addr_mask,
> > > -                                             mapped->start, mapped->end);
> > > -            return 0;
> > > +            int ret;
> > > +            /* Cache the write permission */
> > > +            IOMMUAccessFlags flags = entry->perm;
> > > +
> > > +            /* UNMAP the old first then remap.  No need to touch IOVA tree */
> > > +            entry->perm = IOMMU_NONE;
> > > +            ret = hook_fn(entry, private);
> > > +            if (ret) {
> > > +                return ret;
> > > +            }
> > > +            entry->perm = flags;
> > > +        } else {
> > > +            it_tree_insert(as->iova_tree, entry->iova,
> > > +                           entry->iova + entry->addr_mask);
> > >          }
> > > -        it_tree_insert(as->iova_tree, entry->iova,
> > > -                       entry->iova + entry->addr_mask);
> > >      } else {
> > >          if (!mapped) {
> > >              /* Skip since we didn't map this range at all */
> > > 
> > > If we really think it necessary, I can squash this in, though this is
> > > a bit ugly.  But I just want to confirm whether this would be anything
> > > we want...
> > > 
> > 
> > I didn’t look into your actual code yet. If others think above
> > change is OK then it's definitely good as we conform to hardware
> > behavior here. Otherwise if there is a way to detect such unusual 
> > usage and then adopt some action (say kill the VM), it's also fine 
> > since user knows he runs a bad OS which is not supported by 
> > Qemu. It's just not good if such situation is not handled, which 
> > leads to some undefined behavior which nobody knows the reason 
> > w/o hard debug into. :-)
> 
> Yeah, then let me do this. :)
> 
> Jason, would you be good with above change squashed?

Self NAK on this...

More than half of the whole series tries to solve the solo problem
that we unmapped some pages that were already mapped, which proved to
be wrong.  Now if we squash the change we will do the same wrong
thing, so we'll still have a very small window that the remapped page
be missing from a device's POV.

Now to solve this I suppose we'll need to cache every translation then
we can know whether a mapping has changed, and we only remap when it
really has changed.  But I'm afraid that can be a big amount of data
for nested guests.  For a most common 4G L2 guest, I think the worst
case (e.g., no huge page at all, no continuous pages) is 4G/4K=1M
entries in that tree.

Is it really worth it to solve this possibly-buggy-guest-OS problem
with such a overhead?  I don't know..

I'm not sure whether it's still acceptable that we put this issue
aside.  We should know that normal OSs should not do this, and if they
do, IMHO it proves a buggy OS already (so even from hardware POV we
allow this, from software POV it should still be problematic), then
it'll have problem for sure, but only within the VM itself, and it
won't affect other VMs or the host.  That sounds still reasonable to
me so far.

Of course I'm always willing to listen to more advice on this.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27 11:40             ` Peter Xu
@ 2018-04-27 23:37               ` Tian, Kevin
  2018-05-03  6:04                 ` Peter Xu
  2018-04-28  1:49               ` Jason Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Tian, Kevin @ 2018-04-27 23:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, qemu-devel, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, April 27, 2018 7:40 PM
> 
> On Fri, Apr 27, 2018 at 05:55:27PM +0800, Peter Xu wrote:
> > On Fri, Apr 27, 2018 at 07:44:07AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Friday, April 27, 2018 3:28 PM
> > > >
> > > > On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
> > > > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > > > Sent: Friday, April 27, 2018 2:08 PM
> > > > > >
> > > > > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > > > > For each VTDAddressSpace, now we maintain what IOVA ranges
> we
> > > > have
> > > > > > > mapped and what we have not.  With that information, now we
> only
> > > > > > send
> > > > > > > MAP or UNMAP when necessary.  Say, we don't send MAP
> notifies if
> > > > we
> > > > > > know
> > > > > > > we have already mapped the range, meanwhile we don't send
> > > > UNMAP
> > > > > > notifies
> > > > > > > if we know we never mapped the range at all.
> > > > > > >
> > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > > ---
> > > > > > >   include/hw/i386/intel_iommu.h |  2 ++
> > > > > > >   hw/i386/intel_iommu.c         | 28
> ++++++++++++++++++++++++++++
> > > > > > >   hw/i386/trace-events          |  2 ++
> > > > > > >   3 files changed, 32 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/hw/i386/intel_iommu.h
> > > > > > b/include/hw/i386/intel_iommu.h
> > > > > > > index 486e205e79..09a2e94404 100644
> > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > @@ -27,6 +27,7 @@
> > > > > > >   #include "hw/i386/ioapic.h"
> > > > > > >   #include "hw/pci/msi.h"
> > > > > > >   #include "hw/sysbus.h"
> > > > > > > +#include "qemu/interval-tree.h"
> > > > > > >
> > > > > > >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> > > > > > >   #define INTEL_IOMMU_DEVICE(obj) \
> > > > > > > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> > > > > > >       QLIST_ENTRY(VTDAddressSpace) next;
> > > > > > >       /* Superset of notifier flags that this address space has */
> > > > > > >       IOMMUNotifierFlag notifier_flags;
> > > > > > > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> > > > > > >   };
> > > > > > >
> > > > > > >   struct VTDBus {
> > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > index a19c18b8d4..8f396a5d13 100644
> > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > @@ -768,12 +768,37 @@ typedef struct {
> > > > > > >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > > > > > >                                vtd_page_walk_info *info)
> > > > > > >   {
> > > > > > > +    VTDAddressSpace *as = info->as;
> > > > > > >       vtd_page_walk_hook hook_fn = info->hook_fn;
> > > > > > >       void *private = info->private;
> > > > > > > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > > > > > > +                                   entry->iova + entry->addr_mask);
> > > > > > >
> > > > > > >       assert(hook_fn);
> > > > > > > +
> > > > > > > +    /* Update local IOVA mapped ranges */
> > > > > > > +    if (entry->perm) {
> > > > > > > +        if (mapped) {
> > > > > > > +            /* Skip since we have already mapped this range */
> > > > > > > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > > > > >addr_mask,
> > > > > > > +                                             mapped->start, mapped->end);
> > > > > > > +            return 0;
> > > > > > > +        }
> > > > > > > +        it_tree_insert(as->iova_tree, entry->iova,
> > > > > > > +                       entry->iova + entry->addr_mask);
> > > > > >
> > > > > > I was consider a case e.g:
> > > > > >
> > > > > > 1) map A (iova) to B (pa)
> > > > > > 2) invalidate A
> > > > > > 3) map A (iova) to C (pa)
> > > > > > 4) invalidate A
> > > > > >
> > > > > > In this case, we will probably miss a walk here. But I'm not sure it
> was
> > > > > > allowed by the spec (though I think so).
> > > > > >
> > > >
> > > > Hi, Kevin,
> > > >
> > > > Thanks for joining the discussion.
> > > >
> > > > >
> > > > > I thought it was wrong in a glimpse, but then changed my mind after
> > > > > another thinking. As long as device driver can quiescent the device
> > > > > to not access A (iova) within above window, then above sequence
> > > > > has no problem since any stale mappings (A->B) added before step 4)
> > > > > won't be used and then will get flushed after step 4). Regarding to
> > > > > that actually the 1st invalidation can be skipped:
> > > > >
> > > > > 1) map A (iova) to B (pa)
> > > > > 2) driver programs device to use A
> > > > > 3) driver programs device to not use A
> > > > > 4) map A (iova) to C (pa)
> > > > > 	A->B may be still valid in IOTLB
> > > > > 5) invalidate A
> > > > > 6) driver programs device to use A
> > > >
> > > > Note that IMHO this is a bit different from Jason's example, and it'll
> > > > be fine.  Current code should work well with this scenario since the
> > > > emulation code will not aware of the map A until step (5).  Then we'll
> > > > have the correct mapping.
> > >
> > > you are right. we still need the extra PSI otherwise the 1st mapping
> > > is problematic for use. So back to Jason's example.
> > >
> > > >
> > > > While for Jason's example it's exactly the extra PSI that might cause
> > > > stale mappings (though again I think it's still problematic...).
> > >
> > > problematic in software side (e.g. that way IOMMU core relies on
> > > device drivers which conflict with the value of using IOMMU) but
> > > it is OK from hardware p.o.v. btw the extra PSI itself doesn't cause
> > > stale mapping. Instead it is device activity after that PSI may cause it.
> > >
> > > >
> > > > Actually I think I can just fix up the code even if Jason's case
> > > > happens by unmapping that first then remap:
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 31e9b52452..2a9584f9d8 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -778,13 +778,21 @@ static int
> vtd_page_walk_one(IOMMUTLBEntry
> > > > *entry, int level,
> > > >      /* Update local IOVA mapped ranges */
> > > >      if (entry->perm) {
> > > >          if (mapped) {
> > > > -            /* Skip since we have already mapped this range */
> > > > -            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > > >addr_mask,
> > > > -                                             mapped->start, mapped->end);
> > > > -            return 0;
> > > > +            int ret;
> > > > +            /* Cache the write permission */
> > > > +            IOMMUAccessFlags flags = entry->perm;
> > > > +
> > > > +            /* UNMAP the old first then remap.  No need to touch IOVA
> tree */
> > > > +            entry->perm = IOMMU_NONE;
> > > > +            ret = hook_fn(entry, private);
> > > > +            if (ret) {
> > > > +                return ret;
> > > > +            }
> > > > +            entry->perm = flags;
> > > > +        } else {
> > > > +            it_tree_insert(as->iova_tree, entry->iova,
> > > > +                           entry->iova + entry->addr_mask);
> > > >          }
> > > > -        it_tree_insert(as->iova_tree, entry->iova,
> > > > -                       entry->iova + entry->addr_mask);
> > > >      } else {
> > > >          if (!mapped) {
> > > >              /* Skip since we didn't map this range at all */
> > > >
> > > > If we really think it necessary, I can squash this in, though this is
> > > > a bit ugly.  But I just want to confirm whether this would be anything
> > > > we want...
> > > >
> > >
> > > I didn’t look into your actual code yet. If others think above
> > > change is OK then it's definitely good as we conform to hardware
> > > behavior here. Otherwise if there is a way to detect such unusual
> > > usage and then adopt some action (say kill the VM), it's also fine
> > > since user knows he runs a bad OS which is not supported by
> > > Qemu. It's just not good if such situation is not handled, which
> > > leads to some undefined behavior which nobody knows the reason
> > > w/o hard debug into. :-)
> >
> > Yeah, then let me do this. :)
> >
> > Jason, would you be good with above change squashed?
> 
> Self NAK on this...
> 
> More than half of the whole series tries to solve the solo problem
> that we unmapped some pages that were already mapped, which proved
> to
> be wrong.  Now if we squash the change we will do the same wrong
> thing, so we'll still have a very small window that the remapped page
> be missing from a device's POV.
> 
> Now to solve this I suppose we'll need to cache every translation then
> we can know whether a mapping has changed, and we only remap when it
> really has changed.  But I'm afraid that can be a big amount of data
> for nested guests.  For a most common 4G L2 guest, I think the worst
> case (e.g., no huge page at all, no continuous pages) is 4G/4K=1M
> entries in that tree.

I think one key factor to think about is the effect of PSI. From
VT-d spec, all internal caches (IOTLB entries, paging-structure
cache entries, etc.) associated with specified address range
must be flushed accordingly, i.e. no cache on stale mapping.
Now per-device iova ranges is new type of cache introduced
by Qemu VT-d. It doesn't cache actual mapping but its purpose
is to decide whether to notify VFIO for updated mapping. In
this manner if we don't differentiate whether an entry is
for stale mapping, looks the definition of PSI is broken.

ask another question. In reality how much improvement this
patch can bring, i.e. is it usual to see guest map on an already
mapped region, or unmap an already unmapped region?

> 
> Is it really worth it to solve this possibly-buggy-guest-OS problem
> with such a overhead?  I don't know..

If adding overhead removes the benefit of this patch, then 
definitely not a good thing.

> 
> I'm not sure whether it's still acceptable that we put this issue
> aside.  We should know that normal OSs should not do this, and if they
> do, IMHO it proves a buggy OS already (so even from hardware POV we
> allow this, from software POV it should still be problematic), then
> it'll have problem for sure, but only within the VM itself, and it
> won't affect other VMs or the host.  That sounds still reasonable to
> me so far.

As said earlier, what I'm worried is whether there is a way to
detect such case when your assumption is violated. usually
emulation can choose to not implement all the features which
are supported on the real device, but it's done in a way that
non-supported features/behaviors can be reported to guest
(based on spec definition) thus guest knows the expectation
from the emulated device...

> 
> Of course I'm always willing to listen to more advice on this.
> 
> Thanks,
> 
> --
> Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-27  6:26     ` Peter Xu
  2018-04-27  7:19       ` Tian, Kevin
@ 2018-04-28  1:43       ` Jason Wang
  2018-04-28  2:24         ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Jason Wang @ 2018-04-28  1:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	David Gibson



On 2018年04月27日 14:26, Peter Xu wrote:
> On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
>>
>> On 2018年04月25日 12:51, Peter Xu wrote:
>>> Add a per-iommu big lock to protect IOMMU status.  Currently the only
>>> thing to be protected is the IOTLB cache, since that can be accessed
>>> even without BQL, e.g., in IO dataplane.
>>>
>>> Note that device page tables should not need any protection.  The safety
>>> of that should be provided by guest OS.  E.g., when a page entry is
>>> freed, the guest OS should be responsible to make sure that no device
>>> will be using that page any more.
>>>
>>> Reported-by: Fam Zheng<famz@redhat.com>
>>> Signed-off-by: Peter Xu<peterx@redhat.com>
>>> ---
>>>    include/hw/i386/intel_iommu.h |  8 ++++++++
>>>    hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
>>>    2 files changed, 37 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>> index 220697253f..1a8ba8e415 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -262,6 +262,14 @@ struct IntelIOMMUState {
>>>        uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
>>>        uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>>>        uint32_t version;
>>> +    /*
>>> +     * Protects IOMMU states in general.  Normally we don't need to
>>> +     * take this lock when we are with BQL held.  However we have code
>>> +     * paths that may run even without BQL.  In those cases, we need
>>> +     * to take the lock when we have access to IOMMU state
>>> +     * informations, e.g., the IOTLB.
>>> +     */
>>> +    QemuMutex iommu_lock;
>> Some questions:
>>
>> 1) Do we need to protect context cache too?
> IMHO the context cache entry should work even without lock.  That's a
> bit trickly since we have two cases that this cache will be updated:
>
>    (1) first translation of the address space of a device
>    (2) invalidation of context entries
>
> For (2) IMHO we don't need to worry about since guest OS should be
> controlling that part, say, device should not be doing any translation
> (DMA operations) when the context entry is invalidated.
>
> For (1) the worst case is that the context entry cache be updated
> multiple times with the same value by multiple threads.  IMHO that'll
> be fine too.
>
> But yes for sure we can protect that too with the iommu lock.
>
>> 2) Can we just reuse qemu BQL here?
> I would prefer not.  As I mentioned, at least I have spent too much
> time on fighting BQL already.  I really hope we can start to use
> isolated locks when capable.  BQL is always the worst choice to me.

Just a thought, using BQL may greatly simplify the code actually 
(consider we don't plan to remove BQL now).

>
>> 3) I think the issue is common to all other kinds of IOMMU, so can we simply
>> synchronize before calling ->translate() in memory.c. This seems a more
>> common solution.
> I suspect Power and s390 live well with that.  I think it mean at
> least these platforms won't have problem in concurrency.  I'm adding
> DavidG in loop in case there is further comment.  IMHO we should just
> make sure IOMMU code be thread safe, and we fix problem if there is.
>
> Thanks,
>

Yes, it needs some investigation, but we have other IOMMUs like AMD, and 
we could have a flag to bypass BQL if IOMMU can synchronize by itself.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27 11:40             ` Peter Xu
  2018-04-27 23:37               ` Tian, Kevin
@ 2018-04-28  1:49               ` Jason Wang
  1 sibling, 0 replies; 52+ messages in thread
From: Jason Wang @ 2018-04-28  1:49 UTC (permalink / raw)
  To: Peter Xu, Tian, Kevin
  Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim



On 2018年04月27日 19:40, Peter Xu wrote:
> On Fri, Apr 27, 2018 at 05:55:27PM +0800, Peter Xu wrote:
>> On Fri, Apr 27, 2018 at 07:44:07AM +0000, Tian, Kevin wrote:
>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>> Sent: Friday, April 27, 2018 3:28 PM
>>>>
>>>> On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>> Sent: Friday, April 27, 2018 2:08 PM
>>>>>>
>>>>>> On 2018年04月25日 12:51, Peter Xu wrote:
>>>>>>> For each VTDAddressSpace, now we maintain what IOVA ranges we
>>>> have
>>>>>>> mapped and what we have not.  With that information, now we only
>>>>>> send
>>>>>>> MAP or UNMAP when necessary.  Say, we don't send MAP notifies if
>>>> we
>>>>>> know
>>>>>>> we have already mapped the range, meanwhile we don't send
>>>> UNMAP
>>>>>> notifies
>>>>>>> if we know we never mapped the range at all.
>>>>>>>
>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>> ---
>>>>>>>    include/hw/i386/intel_iommu.h |  2 ++
>>>>>>>    hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
>>>>>>>    hw/i386/trace-events          |  2 ++
>>>>>>>    3 files changed, 32 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/hw/i386/intel_iommu.h
>>>>>> b/include/hw/i386/intel_iommu.h
>>>>>>> index 486e205e79..09a2e94404 100644
>>>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>>>> @@ -27,6 +27,7 @@
>>>>>>>    #include "hw/i386/ioapic.h"
>>>>>>>    #include "hw/pci/msi.h"
>>>>>>>    #include "hw/sysbus.h"
>>>>>>> +#include "qemu/interval-tree.h"
>>>>>>>
>>>>>>>    #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
>>>>>>>    #define INTEL_IOMMU_DEVICE(obj) \
>>>>>>> @@ -95,6 +96,7 @@ struct VTDAddressSpace {
>>>>>>>        QLIST_ENTRY(VTDAddressSpace) next;
>>>>>>>        /* Superset of notifier flags that this address space has */
>>>>>>>        IOMMUNotifierFlag notifier_flags;
>>>>>>> +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
>>>>>>>    };
>>>>>>>
>>>>>>>    struct VTDBus {
>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>> index a19c18b8d4..8f396a5d13 100644
>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>> @@ -768,12 +768,37 @@ typedef struct {
>>>>>>>    static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
>>>>>>>                                 vtd_page_walk_info *info)
>>>>>>>    {
>>>>>>> +    VTDAddressSpace *as = info->as;
>>>>>>>        vtd_page_walk_hook hook_fn = info->hook_fn;
>>>>>>>        void *private = info->private;
>>>>>>> +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
>>>>>>> +                                   entry->iova + entry->addr_mask);
>>>>>>>
>>>>>>>        assert(hook_fn);
>>>>>>> +
>>>>>>> +    /* Update local IOVA mapped ranges */
>>>>>>> +    if (entry->perm) {
>>>>>>> +        if (mapped) {
>>>>>>> +            /* Skip since we have already mapped this range */
>>>>>>> +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
>>>>>>> addr_mask,
>>>>>>> +                                             mapped->start, mapped->end);
>>>>>>> +            return 0;
>>>>>>> +        }
>>>>>>> +        it_tree_insert(as->iova_tree, entry->iova,
>>>>>>> +                       entry->iova + entry->addr_mask);
>>>>>> I was consider a case e.g:
>>>>>>
>>>>>> 1) map A (iova) to B (pa)
>>>>>> 2) invalidate A
>>>>>> 3) map A (iova) to C (pa)
>>>>>> 4) invalidate A
>>>>>>
>>>>>> In this case, we will probably miss a walk here. But I'm not sure it was
>>>>>> allowed by the spec (though I think so).
>>>>>>
>>>> Hi, Kevin,
>>>>
>>>> Thanks for joining the discussion.
>>>>
>>>>> I thought it was wrong in a glimpse, but then changed my mind after
>>>>> another thinking. As long as device driver can quiescent the device
>>>>> to not access A (iova) within above window, then above sequence
>>>>> has no problem since any stale mappings (A->B) added before step 4)
>>>>> won't be used and then will get flushed after step 4). Regarding to
>>>>> that actually the 1st invalidation can be skipped:
>>>>>
>>>>> 1) map A (iova) to B (pa)
>>>>> 2) driver programs device to use A
>>>>> 3) driver programs device to not use A
>>>>> 4) map A (iova) to C (pa)
>>>>> 	A->B may be still valid in IOTLB
>>>>> 5) invalidate A
>>>>> 6) driver programs device to use A
>>>> Note that IMHO this is a bit different from Jason's example, and it'll
>>>> be fine.  Current code should work well with this scenario since the
>>>> emulation code will not aware of the map A until step (5).  Then we'll
>>>> have the correct mapping.
>>> you are right. we still need the extra PSI otherwise the 1st mapping
>>> is problematic for use. So back to Jason's example.
>>>
>>>> While for Jason's example it's exactly the extra PSI that might cause
>>>> stale mappings (though again I think it's still problematic...).
>>> problematic in software side (e.g. that way IOMMU core relies on
>>> device drivers which conflict with the value of using IOMMU) but
>>> it is OK from hardware p.o.v. btw the extra PSI itself doesn't cause
>>> stale mapping. Instead it is device activity after that PSI may cause it.
>>>
>>>> Actually I think I can just fix up the code even if Jason's case
>>>> happens by unmapping that first then remap:
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 31e9b52452..2a9584f9d8 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -778,13 +778,21 @@ static int vtd_page_walk_one(IOMMUTLBEntry
>>>> *entry, int level,
>>>>       /* Update local IOVA mapped ranges */
>>>>       if (entry->perm) {
>>>>           if (mapped) {
>>>> -            /* Skip since we have already mapped this range */
>>>> -            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
>>>>> addr_mask,
>>>> -                                             mapped->start, mapped->end);
>>>> -            return 0;
>>>> +            int ret;
>>>> +            /* Cache the write permission */
>>>> +            IOMMUAccessFlags flags = entry->perm;
>>>> +
>>>> +            /* UNMAP the old first then remap.  No need to touch IOVA tree */
>>>> +            entry->perm = IOMMU_NONE;
>>>> +            ret = hook_fn(entry, private);
>>>> +            if (ret) {
>>>> +                return ret;
>>>> +            }
>>>> +            entry->perm = flags;
>>>> +        } else {
>>>> +            it_tree_insert(as->iova_tree, entry->iova,
>>>> +                           entry->iova + entry->addr_mask);
>>>>           }
>>>> -        it_tree_insert(as->iova_tree, entry->iova,
>>>> -                       entry->iova + entry->addr_mask);
>>>>       } else {
>>>>           if (!mapped) {
>>>>               /* Skip since we didn't map this range at all */
>>>>
>>>> If we really think it necessary, I can squash this in, though this is
>>>> a bit ugly.  But I just want to confirm whether this would be anything
>>>> we want...
>>>>
>>> I didn’t look into your actual code yet. If others think above
>>> change is OK then it's definitely good as we conform to hardware
>>> behavior here. Otherwise if there is a way to detect such unusual
>>> usage and then adopt some action (say kill the VM), it's also fine
>>> since user knows he runs a bad OS which is not supported by
>>> Qemu. It's just not good if such situation is not handled, which
>>> leads to some undefined behavior which nobody knows the reason
>>> w/o hard debug into. :-)
>> Yeah, then let me do this. :)
>>
>> Jason, would you be good with above change squashed?
> Self NAK on this...
>
> More than half of the whole series tries to solve the solo problem
> that we unmapped some pages that were already mapped, which proved to
> be wrong.  Now if we squash the change we will do the same wrong
> thing, so we'll still have a very small window that the remapped page
> be missing from a device's POV.
>
> Now to solve this I suppose we'll need to cache every translation then
> we can know whether a mapping has changed, and we only remap when it
> really has changed.  But I'm afraid that can be a big amount of data
> for nested guests.  For a most common 4G L2 guest, I think the worst
> case (e.g., no huge page at all, no continuous pages) is 4G/4K=1M
> entries in that tree.
>
> Is it really worth it to solve this possibly-buggy-guest-OS problem
> with such a overhead?  I don't know..
>
> I'm not sure whether it's still acceptable that we put this issue
> aside.  We should know that normal OSs should not do this, and if they
> do, IMHO it proves a buggy OS already (so even from hardware POV we
> allow this, from software POV it should still be problematic), then
> it'll have problem for sure, but only within the VM itself, and it
> won't affect other VMs or the host.  That sounds still reasonable to
> me so far.
>
> Of course I'm always willing to listen to more advice on this.
>
> Thanks,
>

I think as qemu, we need to emulate exactly the behavior of hardware, 
otherwise there's will be bugs that is too hard to debug. Looking at the 
history of e1000 emulation code, we even try to emulate undocumented 
behavior of hardware to unbreak some non-popular operating systems.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-27  9:53         ` Peter Xu
@ 2018-04-28  1:54           ` Tian, Kevin
  0 siblings, 0 replies; 52+ messages in thread
From: Tian, Kevin @ 2018-04-28  1:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, Jintack Lim, Alex Williamson, David Gibson,
	qemu-devel, Michael S . Tsirkin

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, April 27, 2018 5:54 PM
> 
> On Fri, Apr 27, 2018 at 07:19:25AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu
> > > Sent: Friday, April 27, 2018 2:26 PM
> > >
> > > On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
> > > >
> > > >
> > > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > > Add a per-iommu big lock to protect IOMMU status.  Currently the
> only
> > > > > thing to be protected is the IOTLB cache, since that can be accessed
> > > > > even without BQL, e.g., in IO dataplane.
> > > > >
> > > > > Note that device page tables should not need any protection.  The
> > > safety
> > > > > of that should be provided by guest OS.  E.g., when a page entry is
> > > > > freed, the guest OS should be responsible to make sure that no
> device
> > > > > will be using that page any more.
> >
> > device page table definitely doesn't require protection, since it is
> > in-memory structure managed by guest. However the reason
> > above is not accurate - there is no way that guest OS can make
> > sure no device uses non-present page entry, otherwise it doesn't
> > require virtual IOMMU to protect itself. There could be bogus/
> > malicious drivers which surely may program the device to attempt so.
> 
> How about this:
> 
>   Note that we don't need to protect device page tables since that's
>   fully controlled by the guest kernel.  However there is still
>   possibilities that malicious drivers will still program the device
>   to not disobey the rule.  In that case QEMU can't really do anything
>   useful, instead the guest itself will be responsible for all
>   uncertainties.
> 

yes, OK to me

Thanks
Kevin

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  1:43       ` Jason Wang
@ 2018-04-28  2:24         ` Peter Xu
  2018-04-28  2:42           ` Jason Wang
  2018-04-30  7:20           ` Paolo Bonzini
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-28  2:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	David Gibson, Paolo Bonzini, Stefan Hajnoczi, Fam Zheng

On Sat, Apr 28, 2018 at 09:43:54AM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月27日 14:26, Peter Xu wrote:
> > On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
> > > 
> > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > Add a per-iommu big lock to protect IOMMU status.  Currently the only
> > > > thing to be protected is the IOTLB cache, since that can be accessed
> > > > even without BQL, e.g., in IO dataplane.
> > > > 
> > > > Note that device page tables should not need any protection.  The safety
> > > > of that should be provided by guest OS.  E.g., when a page entry is
> > > > freed, the guest OS should be responsible to make sure that no device
> > > > will be using that page any more.
> > > > 
> > > > Reported-by: Fam Zheng<famz@redhat.com>
> > > > Signed-off-by: Peter Xu<peterx@redhat.com>
> > > > ---
> > > >    include/hw/i386/intel_iommu.h |  8 ++++++++
> > > >    hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
> > > >    2 files changed, 37 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > index 220697253f..1a8ba8e415 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -262,6 +262,14 @@ struct IntelIOMMUState {
> > > >        uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
> > > >        uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
> > > >        uint32_t version;
> > > > +    /*
> > > > +     * Protects IOMMU states in general.  Normally we don't need to
> > > > +     * take this lock when we are with BQL held.  However we have code
> > > > +     * paths that may run even without BQL.  In those cases, we need
> > > > +     * to take the lock when we have access to IOMMU state
> > > > +     * informations, e.g., the IOTLB.
> > > > +     */
> > > > +    QemuMutex iommu_lock;
> > > Some questions:
> > > 
> > > 1) Do we need to protect context cache too?
> > IMHO the context cache entry should work even without lock.  That's a
> > bit trickly since we have two cases that this cache will be updated:
> > 
> >    (1) first translation of the address space of a device
> >    (2) invalidation of context entries
> > 
> > For (2) IMHO we don't need to worry about since guest OS should be
> > controlling that part, say, device should not be doing any translation
> > (DMA operations) when the context entry is invalidated.
> > 
> > For (1) the worst case is that the context entry cache be updated
> > multiple times with the same value by multiple threads.  IMHO that'll
> > be fine too.
> > 
> > But yes for sure we can protect that too with the iommu lock.
> > 
> > > 2) Can we just reuse qemu BQL here?
> > I would prefer not.  As I mentioned, at least I have spent too much
> > time on fighting BQL already.  I really hope we can start to use
> > isolated locks when capable.  BQL is always the worst choice to me.
> 
> Just a thought, using BQL may greatly simplify the code actually (consider
> we don't plan to remove BQL now).

Frankly speaking I don't understand why using BQL may greatly simplify
the code... :( IMHO the lock here is really not a complicated one.

Note that IMO BQL is mostly helpful when we really want something to
be run sequentially with some other things _already_ protected by BQL.
In this case, all the stuff is inside VT-d code itself (or other
IOMMUs), why bother taking the BQL to make our life harder?

So, even if we want to provide a general lock for the translation
procedure, I would prefer we add a per AddressSpace lock but not BQL.
However still that will need some extra flag showing that whether we
need the protection of not.  For example, we may need to expliclitly
turn that off for Power and s390.  Would that really worth it?

So my final preference is still current patch - we solve thread-safety
problems in VT-d and IOMMU code.  Again, we really should make sure
all IOMMUs work with multithreads.

> 
> > 
> > > 3) I think the issue is common to all other kinds of IOMMU, so can we simply
> > > synchronize before calling ->translate() in memory.c. This seems a more
> > > common solution.
> > I suspect Power and s390 live well with that.  I think it mean at
> > least these platforms won't have problem in concurrency.  I'm adding
> > DavidG in loop in case there is further comment.  IMHO we should just
> > make sure IOMMU code be thread safe, and we fix problem if there is.
> > 
> > Thanks,
> > 
> 
> Yes, it needs some investigation, but we have other IOMMUs like AMD, and we
> could have a flag to bypass BQL if IOMMU can synchronize by itself.

AMD is still only for experimental.  If we really want to use it in
production IMHO it'll need more testings and tunings not only on
thread-safety but on other stuffs too.  So again, we can just fix them
when needed.  I still don't see it a reason to depend on BQL here.

I'll see what others think about it.

CCing Paolo, Stefan and Fam too.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  2:24         ` Peter Xu
@ 2018-04-28  2:42           ` Jason Wang
  2018-04-28  3:06             ` Peter Xu
  2018-04-28  3:14             ` Peter Xu
  2018-04-30  7:20           ` Paolo Bonzini
  1 sibling, 2 replies; 52+ messages in thread
From: Jason Wang @ 2018-04-28  2:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fam Zheng, Michael S . Tsirkin, qemu-devel, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Jintack Lim, David Gibson



On 2018年04月28日 10:24, Peter Xu wrote:
> On Sat, Apr 28, 2018 at 09:43:54AM +0800, Jason Wang wrote:
>>
>> On 2018年04月27日 14:26, Peter Xu wrote:
>>> On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
>>>> On 2018年04月25日 12:51, Peter Xu wrote:
>>>>> Add a per-iommu big lock to protect IOMMU status.  Currently the only
>>>>> thing to be protected is the IOTLB cache, since that can be accessed
>>>>> even without BQL, e.g., in IO dataplane.
>>>>>
>>>>> Note that device page tables should not need any protection.  The safety
>>>>> of that should be provided by guest OS.  E.g., when a page entry is
>>>>> freed, the guest OS should be responsible to make sure that no device
>>>>> will be using that page any more.
>>>>>
>>>>> Reported-by: Fam Zheng<famz@redhat.com>
>>>>> Signed-off-by: Peter Xu<peterx@redhat.com>
>>>>> ---
>>>>>     include/hw/i386/intel_iommu.h |  8 ++++++++
>>>>>     hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
>>>>>     2 files changed, 37 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>>>> index 220697253f..1a8ba8e415 100644
>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>> @@ -262,6 +262,14 @@ struct IntelIOMMUState {
>>>>>         uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
>>>>>         uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>>>>>         uint32_t version;
>>>>> +    /*
>>>>> +     * Protects IOMMU states in general.  Normally we don't need to
>>>>> +     * take this lock when we are with BQL held.  However we have code
>>>>> +     * paths that may run even without BQL.  In those cases, we need
>>>>> +     * to take the lock when we have access to IOMMU state
>>>>> +     * informations, e.g., the IOTLB.
>>>>> +     */
>>>>> +    QemuMutex iommu_lock;
>>>> Some questions:
>>>>
>>>> 1) Do we need to protect context cache too?
>>> IMHO the context cache entry should work even without lock.  That's a
>>> bit trickly since we have two cases that this cache will be updated:
>>>
>>>     (1) first translation of the address space of a device
>>>     (2) invalidation of context entries
>>>
>>> For (2) IMHO we don't need to worry about since guest OS should be
>>> controlling that part, say, device should not be doing any translation
>>> (DMA operations) when the context entry is invalidated.
>>>
>>> For (1) the worst case is that the context entry cache be updated
>>> multiple times with the same value by multiple threads.  IMHO that'll
>>> be fine too.
>>>
>>> But yes for sure we can protect that too with the iommu lock.
>>>
>>>> 2) Can we just reuse qemu BQL here?
>>> I would prefer not.  As I mentioned, at least I have spent too much
>>> time on fighting BQL already.  I really hope we can start to use
>>> isolated locks when capable.  BQL is always the worst choice to me.
>> Just a thought, using BQL may greatly simplify the code actually (consider
>> we don't plan to remove BQL now).
> Frankly speaking I don't understand why using BQL may greatly simplify
> the code... :( IMHO the lock here is really not a complicated one.
>
> Note that IMO BQL is mostly helpful when we really want something to
> be run sequentially with some other things _already_ protected by BQL.

Except for the translate path from dataplane, I belive all other codes 
were already protected by BQL.

> In this case, all the stuff is inside VT-d code itself (or other
> IOMMUs), why bother taking the BQL to make our life harder?

It looks to me it was as simple as:

@@ -494,6 +494,7 @@ static MemoryRegionSection 
flatview_do_translate(FlatView *fv,
      IOMMUMemoryRegionClass *imrc;
      hwaddr page_mask = (hwaddr)(-1);
      hwaddr plen = (hwaddr)(-1);
+    int locked = false;

      if (plen_out) {
          plen = *plen_out;
@@ -510,8 +511,15 @@ static MemoryRegionSection 
flatview_do_translate(FlatView *fv,
          }
          imrc = memory_region_get_iommu_class_nocheck(iommu_mr);

+        if (!qemu_mutex_iothread_locked()) {
+            locked = true;
+            qemu_mutex_lock_iothread();
+        }
          iotlb = imrc->translate(iommu_mr, addr, is_write ?
                                  IOMMU_WO : IOMMU_RO);
+        if (locked) {
+            qemu_mutex_unlock_iothread();
+        }
          addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
                  | (addr & iotlb.addr_mask));
          page_mask &= iotlb.addr_mask;


>
> So, even if we want to provide a general lock for the translation
> procedure, I would prefer we add a per AddressSpace lock but not BQL.

It could be, but it needs more work on each specific IOMMU codes.

> However still that will need some extra flag showing that whether we
> need the protection of not.  For example, we may need to expliclitly
> turn that off for Power and s390.  Would that really worth it?

It would cost just several lines of code, anything wrong with this?

>
> So my final preference is still current patch - we solve thread-safety
> problems in VT-d and IOMMU code.  Again, we really should make sure
> all IOMMUs work with multithreads.
>
>>>> 3) I think the issue is common to all other kinds of IOMMU, so can we simply
>>>> synchronize before calling ->translate() in memory.c. This seems a more
>>>> common solution.
>>> I suspect Power and s390 live well with that.  I think it mean at
>>> least these platforms won't have problem in concurrency.  I'm adding
>>> DavidG in loop in case there is further comment.  IMHO we should just
>>> make sure IOMMU code be thread safe, and we fix problem if there is.
>>>
>>> Thanks,
>>>
>> Yes, it needs some investigation, but we have other IOMMUs like AMD, and we
>> could have a flag to bypass BQL if IOMMU can synchronize by itself.
> AMD is still only for experimental.  If we really want to use it in
> production IMHO it'll need more testings and tunings not only on
> thread-safety but on other stuffs too.  So again, we can just fix them
> when needed.  I still don't see it a reason to depend on BQL here.

Well, it's not about BQL specifically, it's about whether we have or 
need a generic thread safety solution for all IOMMUs.

We have more IOMMUs than just AMD, s390 and ppc:

# git grep imrc-\>translate\ =
hw/alpha/typhoon.c:    imrc->translate = typhoon_translate_iommu;
hw/dma/rc4030.c:    imrc->translate = rc4030_dma_translate;
hw/i386/amd_iommu.c:    imrc->translate = amdvi_translate;
hw/i386/intel_iommu.c:    imrc->translate = vtd_iommu_translate;
hw/ppc/spapr_iommu.c:    imrc->translate = spapr_tce_translate_iommu;
hw/s390x/s390-pci-bus.c:    imrc->translate = s390_translate_iommu;
hw/sparc/sun4m_iommu.c:    imrc->translate = sun4m_translate_iommu;
hw/sparc64/sun4u_iommu.c:    imrc->translate = sun4u_translate_iommu;

And we know there will be more in the near future.

Thanks

>
> I'll see what others think about it.
>
> CCing Paolo, Stefan and Fam too.
>
> Thanks,
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  2:42           ` Jason Wang
@ 2018-04-28  3:06             ` Peter Xu
  2018-04-28  3:11               ` Jason Wang
  2018-04-28  3:14             ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-04-28  3:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Fam Zheng, Michael S . Tsirkin, qemu-devel, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Jintack Lim, David Gibson

On Sat, Apr 28, 2018 at 10:42:11AM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月28日 10:24, Peter Xu wrote:
> > On Sat, Apr 28, 2018 at 09:43:54AM +0800, Jason Wang wrote:
> > > 
> > > On 2018年04月27日 14:26, Peter Xu wrote:
> > > > On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
> > > > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > > > Add a per-iommu big lock to protect IOMMU status.  Currently the only
> > > > > > thing to be protected is the IOTLB cache, since that can be accessed
> > > > > > even without BQL, e.g., in IO dataplane.
> > > > > > 
> > > > > > Note that device page tables should not need any protection.  The safety
> > > > > > of that should be provided by guest OS.  E.g., when a page entry is
> > > > > > freed, the guest OS should be responsible to make sure that no device
> > > > > > will be using that page any more.
> > > > > > 
> > > > > > Reported-by: Fam Zheng<famz@redhat.com>
> > > > > > Signed-off-by: Peter Xu<peterx@redhat.com>
> > > > > > ---
> > > > > >     include/hw/i386/intel_iommu.h |  8 ++++++++
> > > > > >     hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
> > > > > >     2 files changed, 37 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > > > > index 220697253f..1a8ba8e415 100644
> > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > @@ -262,6 +262,14 @@ struct IntelIOMMUState {
> > > > > >         uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
> > > > > >         uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
> > > > > >         uint32_t version;
> > > > > > +    /*
> > > > > > +     * Protects IOMMU states in general.  Normally we don't need to
> > > > > > +     * take this lock when we are with BQL held.  However we have code
> > > > > > +     * paths that may run even without BQL.  In those cases, we need
> > > > > > +     * to take the lock when we have access to IOMMU state
> > > > > > +     * informations, e.g., the IOTLB.
> > > > > > +     */
> > > > > > +    QemuMutex iommu_lock;
> > > > > Some questions:
> > > > > 
> > > > > 1) Do we need to protect context cache too?
> > > > IMHO the context cache entry should work even without lock.  That's a
> > > > bit trickly since we have two cases that this cache will be updated:
> > > > 
> > > >     (1) first translation of the address space of a device
> > > >     (2) invalidation of context entries
> > > > 
> > > > For (2) IMHO we don't need to worry about since guest OS should be
> > > > controlling that part, say, device should not be doing any translation
> > > > (DMA operations) when the context entry is invalidated.
> > > > 
> > > > For (1) the worst case is that the context entry cache be updated
> > > > multiple times with the same value by multiple threads.  IMHO that'll
> > > > be fine too.
> > > > 
> > > > But yes for sure we can protect that too with the iommu lock.
> > > > 
> > > > > 2) Can we just reuse qemu BQL here?
> > > > I would prefer not.  As I mentioned, at least I have spent too much
> > > > time on fighting BQL already.  I really hope we can start to use
> > > > isolated locks when capable.  BQL is always the worst choice to me.
> > > Just a thought, using BQL may greatly simplify the code actually (consider
> > > we don't plan to remove BQL now).
> > Frankly speaking I don't understand why using BQL may greatly simplify
> > the code... :( IMHO the lock here is really not a complicated one.
> > 
> > Note that IMO BQL is mostly helpful when we really want something to
> > be run sequentially with some other things _already_ protected by BQL.
> 
> Except for the translate path from dataplane, I belive all other codes were
> already protected by BQL.
> 
> > In this case, all the stuff is inside VT-d code itself (or other
> > IOMMUs), why bother taking the BQL to make our life harder?
> 
> It looks to me it was as simple as:
> 
> @@ -494,6 +494,7 @@ static MemoryRegionSection
> flatview_do_translate(FlatView *fv,
>      IOMMUMemoryRegionClass *imrc;
>      hwaddr page_mask = (hwaddr)(-1);
>      hwaddr plen = (hwaddr)(-1);
> +    int locked = false;
> 
>      if (plen_out) {
>          plen = *plen_out;
> @@ -510,8 +511,15 @@ static MemoryRegionSection
> flatview_do_translate(FlatView *fv,
>          }
>          imrc = memory_region_get_iommu_class_nocheck(iommu_mr);
> 
> +        if (!qemu_mutex_iothread_locked()) {
> +            locked = true;
> +            qemu_mutex_lock_iothread();
> +        }
>          iotlb = imrc->translate(iommu_mr, addr, is_write ?
>                                  IOMMU_WO : IOMMU_RO);
> +        if (locked) {
> +            qemu_mutex_unlock_iothread();
> +        }
>          addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
>                  | (addr & iotlb.addr_mask));
>          page_mask &= iotlb.addr_mask;

We'll need to add the flags thing too.  How do we flag-out existing
thread-safe IOMMUs?

> 
> 
> > 
> > So, even if we want to provide a general lock for the translation
> > procedure, I would prefer we add a per AddressSpace lock but not BQL.
> 
> It could be, but it needs more work on each specific IOMMU codes.
> 
> > However still that will need some extra flag showing that whether we
> > need the protection of not.  For example, we may need to expliclitly
> > turn that off for Power and s390.  Would that really worth it?
> 
> It would cost just several lines of code, anything wrong with this?

It's not about anything wrong; it's just about preference.

I never said BQL won't work here.  It will work.  But if you have
spent tens of hours working on BQL-related problems maybe you'll have
the same preference as me... :)

IMHO the point is to decide which might be simpler and more efficient
in general, really.

> 
> > 
> > So my final preference is still current patch - we solve thread-safety
> > problems in VT-d and IOMMU code.  Again, we really should make sure
> > all IOMMUs work with multithreads.
> > 
> > > > > 3) I think the issue is common to all other kinds of IOMMU, so can we simply
> > > > > synchronize before calling ->translate() in memory.c. This seems a more
> > > > > common solution.
> > > > I suspect Power and s390 live well with that.  I think it mean at
> > > > least these platforms won't have problem in concurrency.  I'm adding
> > > > DavidG in loop in case there is further comment.  IMHO we should just
> > > > make sure IOMMU code be thread safe, and we fix problem if there is.
> > > > 
> > > > Thanks,
> > > > 
> > > Yes, it needs some investigation, but we have other IOMMUs like AMD, and we
> > > could have a flag to bypass BQL if IOMMU can synchronize by itself.
> > AMD is still only for experimental.  If we really want to use it in
> > production IMHO it'll need more testings and tunings not only on
> > thread-safety but on other stuffs too.  So again, we can just fix them
> > when needed.  I still don't see it a reason to depend on BQL here.
> 
> Well, it's not about BQL specifically, it's about whether we have or need a
> generic thread safety solution for all IOMMUs.
> 
> We have more IOMMUs than just AMD, s390 and ppc:
> 
> # git grep imrc-\>translate\ =
> hw/alpha/typhoon.c:    imrc->translate = typhoon_translate_iommu;
> hw/dma/rc4030.c:    imrc->translate = rc4030_dma_translate;
> hw/i386/amd_iommu.c:    imrc->translate = amdvi_translate;
> hw/i386/intel_iommu.c:    imrc->translate = vtd_iommu_translate;
> hw/ppc/spapr_iommu.c:    imrc->translate = spapr_tce_translate_iommu;
> hw/s390x/s390-pci-bus.c:    imrc->translate = s390_translate_iommu;
> hw/sparc/sun4m_iommu.c:    imrc->translate = sun4m_translate_iommu;
> hw/sparc64/sun4u_iommu.c:    imrc->translate = sun4u_translate_iommu;
> 
> And we know there will be more in the near future.

Again - here I would suggest we consider thread-safe when implementing
new ones.  I suppose it should not be a hard thing to achieve.

I don't have more and new input here since I have had some in previous
posts already.  If this is still during discussion before the next
post, I'll pick this patch out of the series since this patch is not
related to other patches at all, so can be dealt with isolatedly.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  3:06             ` Peter Xu
@ 2018-04-28  3:11               ` Jason Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Jason Wang @ 2018-04-28  3:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fam Zheng, Michael S . Tsirkin, qemu-devel, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Jintack Lim, David Gibson



On 2018年04月28日 11:06, Peter Xu wrote:
> On Sat, Apr 28, 2018 at 10:42:11AM +0800, Jason Wang wrote:
>>
>> On 2018年04月28日 10:24, Peter Xu wrote:
>>> On Sat, Apr 28, 2018 at 09:43:54AM +0800, Jason Wang wrote:
>>>> On 2018年04月27日 14:26, Peter Xu wrote:
>>>>> On Fri, Apr 27, 2018 at 01:13:02PM +0800, Jason Wang wrote:
>>>>>> On 2018年04月25日 12:51, Peter Xu wrote:
>>>>>>> Add a per-iommu big lock to protect IOMMU status.  Currently the only
>>>>>>> thing to be protected is the IOTLB cache, since that can be accessed
>>>>>>> even without BQL, e.g., in IO dataplane.
>>>>>>>
>>>>>>> Note that device page tables should not need any protection.  The safety
>>>>>>> of that should be provided by guest OS.  E.g., when a page entry is
>>>>>>> freed, the guest OS should be responsible to make sure that no device
>>>>>>> will be using that page any more.
>>>>>>>
>>>>>>> Reported-by: Fam Zheng<famz@redhat.com>
>>>>>>> Signed-off-by: Peter Xu<peterx@redhat.com>
>>>>>>> ---
>>>>>>>      include/hw/i386/intel_iommu.h |  8 ++++++++
>>>>>>>      hw/i386/intel_iommu.c         | 31 +++++++++++++++++++++++++++++--
>>>>>>>      2 files changed, 37 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>>>>>> index 220697253f..1a8ba8e415 100644
>>>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>>>> @@ -262,6 +262,14 @@ struct IntelIOMMUState {
>>>>>>>          uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */
>>>>>>>          uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>>>>>>>          uint32_t version;
>>>>>>> +    /*
>>>>>>> +     * Protects IOMMU states in general.  Normally we don't need to
>>>>>>> +     * take this lock when we are with BQL held.  However we have code
>>>>>>> +     * paths that may run even without BQL.  In those cases, we need
>>>>>>> +     * to take the lock when we have access to IOMMU state
>>>>>>> +     * informations, e.g., the IOTLB.
>>>>>>> +     */
>>>>>>> +    QemuMutex iommu_lock;
>>>>>> Some questions:
>>>>>>
>>>>>> 1) Do we need to protect context cache too?
>>>>> IMHO the context cache entry should work even without lock.  That's a
>>>>> bit trickly since we have two cases that this cache will be updated:
>>>>>
>>>>>      (1) first translation of the address space of a device
>>>>>      (2) invalidation of context entries
>>>>>
>>>>> For (2) IMHO we don't need to worry about since guest OS should be
>>>>> controlling that part, say, device should not be doing any translation
>>>>> (DMA operations) when the context entry is invalidated.
>>>>>
>>>>> For (1) the worst case is that the context entry cache be updated
>>>>> multiple times with the same value by multiple threads.  IMHO that'll
>>>>> be fine too.
>>>>>
>>>>> But yes for sure we can protect that too with the iommu lock.
>>>>>
>>>>>> 2) Can we just reuse qemu BQL here?
>>>>> I would prefer not.  As I mentioned, at least I have spent too much
>>>>> time on fighting BQL already.  I really hope we can start to use
>>>>> isolated locks when capable.  BQL is always the worst choice to me.
>>>> Just a thought, using BQL may greatly simplify the code actually (consider
>>>> we don't plan to remove BQL now).
>>> Frankly speaking I don't understand why using BQL may greatly simplify
>>> the code... :( IMHO the lock here is really not a complicated one.
>>>
>>> Note that IMO BQL is mostly helpful when we really want something to
>>> be run sequentially with some other things _already_ protected by BQL.
>> Except for the translate path from dataplane, I belive all other codes were
>> already protected by BQL.
>>
>>> In this case, all the stuff is inside VT-d code itself (or other
>>> IOMMUs), why bother taking the BQL to make our life harder?
>> It looks to me it was as simple as:
>>
>> @@ -494,6 +494,7 @@ static MemoryRegionSection
>> flatview_do_translate(FlatView *fv,
>>       IOMMUMemoryRegionClass *imrc;
>>       hwaddr page_mask = (hwaddr)(-1);
>>       hwaddr plen = (hwaddr)(-1);
>> +    int locked = false;
>>
>>       if (plen_out) {
>>           plen = *plen_out;
>> @@ -510,8 +511,15 @@ static MemoryRegionSection
>> flatview_do_translate(FlatView *fv,
>>           }
>>           imrc = memory_region_get_iommu_class_nocheck(iommu_mr);
>>
>> +        if (!qemu_mutex_iothread_locked()) {
>> +            locked = true;
>> +            qemu_mutex_lock_iothread();
>> +        }
>>           iotlb = imrc->translate(iommu_mr, addr, is_write ?
>>                                   IOMMU_WO : IOMMU_RO);
>> +        if (locked) {
>> +            qemu_mutex_unlock_iothread();
>> +        }
>>           addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
>>                   | (addr & iotlb.addr_mask));
>>           page_mask &= iotlb.addr_mask;
> We'll need to add the flags thing too.  How do we flag-out existing
> thread-safe IOMMUs?

We can let thread safe IOMMU code to choose to set a flag somewhere.

>
>>
>>> So, even if we want to provide a general lock for the translation
>>> procedure, I would prefer we add a per AddressSpace lock but not BQL.
>> It could be, but it needs more work on each specific IOMMU codes.
>>
>>> However still that will need some extra flag showing that whether we
>>> need the protection of not.  For example, we may need to expliclitly
>>> turn that off for Power and s390.  Would that really worth it?
>> It would cost just several lines of code, anything wrong with this?
> It's not about anything wrong; it's just about preference.
>
> I never said BQL won't work here.  It will work.  But if you have
> spent tens of hours working on BQL-related problems maybe you'll have
> the same preference as me... :)
>
> IMHO the point is to decide which might be simpler and more efficient
> in general, really.

So I'm not against your approach. It could be on top of the BQL patch I 
think.

>
>>> So my final preference is still current patch - we solve thread-safety
>>> problems in VT-d and IOMMU code.  Again, we really should make sure
>>> all IOMMUs work with multithreads.
>>>
>>>>>> 3) I think the issue is common to all other kinds of IOMMU, so can we simply
>>>>>> synchronize before calling ->translate() in memory.c. This seems a more
>>>>>> common solution.
>>>>> I suspect Power and s390 live well with that.  I think it mean at
>>>>> least these platforms won't have problem in concurrency.  I'm adding
>>>>> DavidG in loop in case there is further comment.  IMHO we should just
>>>>> make sure IOMMU code be thread safe, and we fix problem if there is.
>>>>>
>>>>> Thanks,
>>>>>
>>>> Yes, it needs some investigation, but we have other IOMMUs like AMD, and we
>>>> could have a flag to bypass BQL if IOMMU can synchronize by itself.
>>> AMD is still only for experimental.  If we really want to use it in
>>> production IMHO it'll need more testings and tunings not only on
>>> thread-safety but on other stuffs too.  So again, we can just fix them
>>> when needed.  I still don't see it a reason to depend on BQL here.
>> Well, it's not about BQL specifically, it's about whether we have or need a
>> generic thread safety solution for all IOMMUs.
>>
>> We have more IOMMUs than just AMD, s390 and ppc:
>>
>> # git grep imrc-\>translate\ =
>> hw/alpha/typhoon.c:    imrc->translate = typhoon_translate_iommu;
>> hw/dma/rc4030.c:    imrc->translate = rc4030_dma_translate;
>> hw/i386/amd_iommu.c:    imrc->translate = amdvi_translate;
>> hw/i386/intel_iommu.c:    imrc->translate = vtd_iommu_translate;
>> hw/ppc/spapr_iommu.c:    imrc->translate = spapr_tce_translate_iommu;
>> hw/s390x/s390-pci-bus.c:    imrc->translate = s390_translate_iommu;
>> hw/sparc/sun4m_iommu.c:    imrc->translate = sun4m_translate_iommu;
>> hw/sparc64/sun4u_iommu.c:    imrc->translate = sun4u_translate_iommu;
>>
>> And we know there will be more in the near future.
> Again - here I would suggest we consider thread-safe when implementing
> new ones.  I suppose it should not be a hard thing to achieve.
>
> I don't have more and new input here since I have had some in previous
> posts already.  If this is still during discussion before the next
> post, I'll pick this patch out of the series since this patch is not
> related to other patches at all, so can be dealt with isolatedly.
>
> Thanks,
>

I fully understand the your motivation, just want to see if we can do 
something simply for all other IOMMUs. I think this series can go alone 
without caring other IOMMUs for sure.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  2:42           ` Jason Wang
  2018-04-28  3:06             ` Peter Xu
@ 2018-04-28  3:14             ` Peter Xu
  2018-04-28  3:16               ` Jason Wang
  2018-04-30  7:22               ` Paolo Bonzini
  1 sibling, 2 replies; 52+ messages in thread
From: Peter Xu @ 2018-04-28  3:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Fam Zheng, Michael S . Tsirkin, qemu-devel, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Jintack Lim, David Gibson

On Sat, Apr 28, 2018 at 10:42:11AM +0800, Jason Wang wrote:

[...]

> Well, it's not about BQL specifically, it's about whether we have or need a
> generic thread safety solution for all IOMMUs.
> 
> We have more IOMMUs than just AMD, s390 and ppc:
> 
> # git grep imrc-\>translate\ =
> hw/alpha/typhoon.c:    imrc->translate = typhoon_translate_iommu;
> hw/dma/rc4030.c:    imrc->translate = rc4030_dma_translate;
> hw/i386/amd_iommu.c:    imrc->translate = amdvi_translate;
> hw/i386/intel_iommu.c:    imrc->translate = vtd_iommu_translate;
> hw/ppc/spapr_iommu.c:    imrc->translate = spapr_tce_translate_iommu;
> hw/s390x/s390-pci-bus.c:    imrc->translate = s390_translate_iommu;
> hw/sparc/sun4m_iommu.c:    imrc->translate = sun4m_translate_iommu;
> hw/sparc64/sun4u_iommu.c:    imrc->translate = sun4u_translate_iommu;

Sorry I didn't notice this one.  This point is valid.  But again, we
need to know whether they are thread-safe already.  For VT-d, it never
hurt to have this patch to fix its own problem, so above is not a
reason to not have current patch, since it solves different problems.
Basically I'll see the solution of above problem as a separate patch
as current one.

Meanwhile, even if we want to provide a general protection on that,
again I would prefer not using BQL but use some common and new
iommu-lock.  BQL can be a nightmare sometimes.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  3:14             ` Peter Xu
@ 2018-04-28  3:16               ` Jason Wang
  2018-04-30  7:22               ` Paolo Bonzini
  1 sibling, 0 replies; 52+ messages in thread
From: Jason Wang @ 2018-04-28  3:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fam Zheng, Michael S . Tsirkin, qemu-devel, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Jintack Lim, David Gibson



On 2018年04月28日 11:14, Peter Xu wrote:
> On Sat, Apr 28, 2018 at 10:42:11AM +0800, Jason Wang wrote:
>
> [...]
>
>> Well, it's not about BQL specifically, it's about whether we have or need a
>> generic thread safety solution for all IOMMUs.
>>
>> We have more IOMMUs than just AMD, s390 and ppc:
>>
>> # git grep imrc-\>translate\ =
>> hw/alpha/typhoon.c:    imrc->translate = typhoon_translate_iommu;
>> hw/dma/rc4030.c:    imrc->translate = rc4030_dma_translate;
>> hw/i386/amd_iommu.c:    imrc->translate = amdvi_translate;
>> hw/i386/intel_iommu.c:    imrc->translate = vtd_iommu_translate;
>> hw/ppc/spapr_iommu.c:    imrc->translate = spapr_tce_translate_iommu;
>> hw/s390x/s390-pci-bus.c:    imrc->translate = s390_translate_iommu;
>> hw/sparc/sun4m_iommu.c:    imrc->translate = sun4m_translate_iommu;
>> hw/sparc64/sun4u_iommu.c:    imrc->translate = sun4u_translate_iommu;
> Sorry I didn't notice this one.  This point is valid.  But again, we
> need to know whether they are thread-safe already.  For VT-d, it never
> hurt to have this patch to fix its own problem, so above is not a
> reason to not have current patch, since it solves different problems.
> Basically I'll see the solution of above problem as a separate patch
> as current one.
>
> Meanwhile, even if we want to provide a general protection on that,
> again I would prefer not using BQL but use some common and new
> iommu-lock.  BQL can be a nightmare sometimes.

Yes, you can use other if BQL turns out to be troublesome.

Thanks

>
> Thanks,
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  2:24         ` Peter Xu
  2018-04-28  2:42           ` Jason Wang
@ 2018-04-30  7:20           ` Paolo Bonzini
  2018-05-03  5:39             ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2018-04-30  7:20 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	David Gibson, Stefan Hajnoczi, Fam Zheng

On 28/04/2018 04:24, Peter Xu wrote:
>>>> 2) Can we just reuse qemu BQL here?
>>> I would prefer not.  As I mentioned, at least I have spent too much
>>> time on fighting BQL already.  I really hope we can start to use
>>> isolated locks when capable.  BQL is always the worst choice to me.
>> Just a thought, using BQL may greatly simplify the code actually (consider
>> we don't plan to remove BQL now).
> Frankly speaking I don't understand why using BQL may greatly simplify
> the code... :( IMHO the lock here is really not a complicated one.
> 
> Note that IMO BQL is mostly helpful when we really want something to
> be run sequentially with some other things _already_ protected by BQL.
> In this case, all the stuff is inside VT-d code itself (or other
> IOMMUs), why bother taking the BQL to make our life harder?
> 
> So, even if we want to provide a general lock for the translation
> procedure, I would prefer we add a per AddressSpace lock but not BQL.
> However still that will need some extra flag showing that whether we
> need the protection of not.  For example, we may need to expliclitly
> turn that off for Power and s390.  Would that really worth it?
> 
> So my final preference is still current patch - we solve thread-safety
> problems in VT-d and IOMMU code.  Again, we really should make sure
> all IOMMUs work with multithreads.
> 

I agree.  In particular, using BQL is _worse_ because it has very strict
lock ordering requirements.  Using fine-grained locks is greatly
preferred as long as:

1) they are leaves in the lock ordering

2) they are not kept across calls to external callbacks (or there are no
external callbacks involved)

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-28  3:14             ` Peter Xu
  2018-04-28  3:16               ` Jason Wang
@ 2018-04-30  7:22               ` Paolo Bonzini
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2018-04-30  7:22 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: Fam Zheng, Michael S . Tsirkin, qemu-devel, Alex Williamson,
	Stefan Hajnoczi, Jintack Lim, David Gibson

On 28/04/2018 05:14, Peter Xu wrote:
>> # git grep imrc-\>translate\ =
>> hw/alpha/typhoon.c:    imrc->translate = typhoon_translate_iommu;
>> hw/dma/rc4030.c:    imrc->translate = rc4030_dma_translate;
>> hw/i386/amd_iommu.c:    imrc->translate = amdvi_translate;
>> hw/i386/intel_iommu.c:    imrc->translate = vtd_iommu_translate;
>> hw/ppc/spapr_iommu.c:    imrc->translate = spapr_tce_translate_iommu;
>> hw/s390x/s390-pci-bus.c:    imrc->translate = s390_translate_iommu;
>> hw/sparc/sun4m_iommu.c:    imrc->translate = sun4m_translate_iommu;
>> hw/sparc64/sun4u_iommu.c:    imrc->translate = sun4u_translate_iommu;
> Sorry I didn't notice this one.  This point is valid.  But again, we
> need to know whether they are thread-safe already.  For VT-d, it never
> hurt to have this patch to fix its own problem, so above is not a
> reason to not have current patch, since it solves different problems.
> Basically I'll see the solution of above problem as a separate patch
> as current one.

If they are not thread-safe, we fix them too.  They are not many.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock
  2018-04-30  7:20           ` Paolo Bonzini
@ 2018-05-03  5:39             ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-05-03  5:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jason Wang, qemu-devel, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim, David Gibson, Stefan Hajnoczi, Fam Zheng

On Mon, Apr 30, 2018 at 09:20:42AM +0200, Paolo Bonzini wrote:
> On 28/04/2018 04:24, Peter Xu wrote:
> >>>> 2) Can we just reuse qemu BQL here?
> >>> I would prefer not.  As I mentioned, at least I have spent too much
> >>> time on fighting BQL already.  I really hope we can start to use
> >>> isolated locks when capable.  BQL is always the worst choice to me.
> >> Just a thought, using BQL may greatly simplify the code actually (consider
> >> we don't plan to remove BQL now).
> > Frankly speaking I don't understand why using BQL may greatly simplify
> > the code... :( IMHO the lock here is really not a complicated one.
> > 
> > Note that IMO BQL is mostly helpful when we really want something to
> > be run sequentially with some other things _already_ protected by BQL.
> > In this case, all the stuff is inside VT-d code itself (or other
> > IOMMUs), why bother taking the BQL to make our life harder?
> > 
> > So, even if we want to provide a general lock for the translation
> > procedure, I would prefer we add a per AddressSpace lock but not BQL.
> > However still that will need some extra flag showing that whether we
> > need the protection of not.  For example, we may need to expliclitly
> > turn that off for Power and s390.  Would that really worth it?
> > 
> > So my final preference is still current patch - we solve thread-safety
> > problems in VT-d and IOMMU code.  Again, we really should make sure
> > all IOMMUs work with multithreads.
> > 
> 
> I agree.  In particular, using BQL is _worse_ because it has very strict
> lock ordering requirements.  Using fine-grained locks is greatly
> preferred as long as:
> 
> 1) they are leaves in the lock ordering
> 
> 2) they are not kept across calls to external callbacks (or there are no
> external callbacks involved)

Thanks Paolo for the input.

I'll temporarily keep this patch in my next post.  We can further
discuss it there if we have better alternatives.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-04-27 23:37               ` Tian, Kevin
@ 2018-05-03  6:04                 ` Peter Xu
  2018-05-03  7:20                   ` Jason Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-05-03  6:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, qemu-devel, Michael S . Tsirkin, Alex Williamson,
	Jintack Lim

On Fri, Apr 27, 2018 at 11:37:24PM +0000, Tian, Kevin wrote:

[...]

> > Self NAK on this...
> > 
> > More than half of the whole series tries to solve the solo problem
> > that we unmapped some pages that were already mapped, which proved
> > to
> > be wrong.  Now if we squash the change we will do the same wrong
> > thing, so we'll still have a very small window that the remapped page
> > be missing from a device's POV.
> > 
> > Now to solve this I suppose we'll need to cache every translation then
> > we can know whether a mapping has changed, and we only remap when it
> > really has changed.  But I'm afraid that can be a big amount of data
> > for nested guests.  For a most common 4G L2 guest, I think the worst
> > case (e.g., no huge page at all, no continuous pages) is 4G/4K=1M
> > entries in that tree.
> 
> I think one key factor to think about is the effect of PSI. From
> VT-d spec, all internal caches (IOTLB entries, paging-structure
> cache entries, etc.) associated with specified address range
> must be flushed accordingly, i.e. no cache on stale mapping.
> Now per-device iova ranges is new type of cache introduced
> by Qemu VT-d. It doesn't cache actual mapping but its purpose
> is to decide whether to notify VFIO for updated mapping. In
> this manner if we don't differentiate whether an entry is
> for stale mapping, looks the definition of PSI is broken.
> 
> ask another question. In reality how much improvement this
> patch can bring, i.e. is it usual to see guest map on an already
> mapped region, or unmap an already unmapped region?

The funny thing is that there is actually no MAP/UNMAP flag for
PSI/DSI.  For either MAP/UNMAP, guest send one PSI for that range, or
even a DSI (Domain Selective Invalidations).

This patch will mostly be helpful not really for PSIs, but for DSI.
Please have a look on Issue (4) that I mentioned in the cover letter.
This patch is the core part to solve that DMA error problem.

An example is that we can get DSI for a domain that already have
existing mappings. In QEMU, we handle DSI as a whole-range PSI, so
before this patch we will unmap those already mapped pages then remap
all of them.  However we can't map the same page again, so we cache
what we have mapped here.

> 
> > 
> > Is it really worth it to solve this possibly-buggy-guest-OS problem
> > with such a overhead?  I don't know..
> 
> If adding overhead removes the benefit of this patch, then 
> definitely not a good thing.

For the problem that we are going to solve, this patch is not really a
beneficial one, but fixes a critical bug.  Again, please refer to the
issue (4) of the cover letter for that 3ms window problem.

> 
> > 
> > I'm not sure whether it's still acceptable that we put this issue
> > aside.  We should know that normal OSs should not do this, and if they
> > do, IMHO it proves a buggy OS already (so even from hardware POV we
> > allow this, from software POV it should still be problematic), then
> > it'll have problem for sure, but only within the VM itself, and it
> > won't affect other VMs or the host.  That sounds still reasonable to
> > me so far.
> 
> As said earlier, what I'm worried is whether there is a way to
> detect such case when your assumption is violated. usually
> emulation can choose to not implement all the features which
> are supported on the real device, but it's done in a way that
> non-supported features/behaviors can be reported to guest
> (based on spec definition) thus guest knows the expectation
> from the emulated device...

IMHO the guest can't really detect this, but it'll found that the
device is not working functionally if it's doing something like what
Jason has mentioned.

Actually now I have had an idea if we really want to live well even
with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
we don't remap for mapped pages; for PSI, we unmap and remap the
mapped pages.  That'll complicate the stuff a bit, but it should
satisfy all the people.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic
  2018-04-27  5:53   ` Jason Wang
  2018-04-27  6:27     ` Peter Xu
@ 2018-05-03  7:10     ` Peter Xu
  2018-05-03  7:21       ` Jason Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-05-03  7:10 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim

On Fri, Apr 27, 2018 at 01:53:35PM +0800, Jason Wang wrote:

[...]

> > +int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
> > +{
> > +    ITRange range = { .start = start, .end = end }, *overlap, and;
> > +    GTree *gtree;
> > +
> > +    g_assert(tree);
> > +
> > +    gtree = tree->tree;
> > +    while ((overlap = g_tree_lookup(gtree, &range))) {
> > +        if (it_range_equal(overlap, &range)) {
> > +            /* Exactly what we want to remove; done */

[1]

> > +            g_tree_remove(gtree, overlap);
> > +            break;
> > +        } else if (it_range_cover(overlap, &range)) {

[2]

> > +            /* Split existing range into two; done */
> > +            it_tree_remove_subset(gtree, overlap, &range);
> > +            break;
> > +        } else if (it_range_cover(&range, overlap)) {

[3]

> > +            /* Drop this range and continue */
> > +            g_tree_remove(gtree, overlap);
> > +        } else {

[4]

> > +            /*
> > +             * The range to remove has intersection with existing
> > +             * ranges.  Remove part of the range and continue
> > +             */
> > +            it_range_and(&and, overlap, &range);
> > +            g_assert(and.start <= and.end);
> > +            it_tree_remove_subset(gtree, overlap, &and);
> > +        }
> > +    }
> > +
> 
> Looks like what we need here is just calculate the intersection part the
> remove it.

Here after a second thought, I am thining whether it'll be good I use
a general routine in [4] to replace all [1-4].

The problem is that if we do that we'll depend on the next
g_tree_lookup() to detect case [1-2] (in that case we'll find nothing
in the lookup, but still we'll need to walk the tree) to break the
loop, while IMHO that sometimes can be more expensive than the if
clauses, especially when the tree is a bit large (each branch needs a
if clause actually) and also note that in our use case we really have
lots of cases for [1] and [2].

I think I can merge [3] into [4], but I might consider keeping [1] and
[2] since I don't really know whether merging them would be better.
Anyway we can do that as follow up enhancement after the series
merged.  But I think we'll need some performance benchmarks.

Jason, what do you think?

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-03  6:04                 ` Peter Xu
@ 2018-05-03  7:20                   ` Jason Wang
  2018-05-03  7:28                     ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Wang @ 2018-05-03  7:20 UTC (permalink / raw)
  To: Peter Xu, Tian, Kevin
  Cc: Jintack Lim, Alex Williamson, qemu-devel, Michael S . Tsirkin



On 2018年05月03日 14:04, Peter Xu wrote:
> IMHO the guest can't really detect this, but it'll found that the
> device is not working functionally if it's doing something like what
> Jason has mentioned.
>
> Actually now I have had an idea if we really want to live well even
> with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
> we don't remap for mapped pages; for PSI, we unmap and remap the
> mapped pages.  That'll complicate the stuff a bit, but it should
> satisfy all the people.
>
> Thanks,

So it looks like there will be still unnecessary unamps. How about 
record the mappings in the tree too?

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic
  2018-05-03  7:10     ` Peter Xu
@ 2018-05-03  7:21       ` Jason Wang
  2018-05-03  7:30         ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Wang @ 2018-05-03  7:21 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim



On 2018年05月03日 15:10, Peter Xu wrote:
> On Fri, Apr 27, 2018 at 01:53:35PM +0800, Jason Wang wrote:
>
> [...]
>
>>> +int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
>>> +{
>>> +    ITRange range = { .start = start, .end = end }, *overlap, and;
>>> +    GTree *gtree;
>>> +
>>> +    g_assert(tree);
>>> +
>>> +    gtree = tree->tree;
>>> +    while ((overlap = g_tree_lookup(gtree, &range))) {
>>> +        if (it_range_equal(overlap, &range)) {
>>> +            /* Exactly what we want to remove; done */
> [1]
>
>>> +            g_tree_remove(gtree, overlap);
>>> +            break;
>>> +        } else if (it_range_cover(overlap, &range)) {
> [2]
>
>>> +            /* Split existing range into two; done */
>>> +            it_tree_remove_subset(gtree, overlap, &range);
>>> +            break;
>>> +        } else if (it_range_cover(&range, overlap)) {
> [3]
>
>>> +            /* Drop this range and continue */
>>> +            g_tree_remove(gtree, overlap);
>>> +        } else {
> [4]
>
>>> +            /*
>>> +             * The range to remove has intersection with existing
>>> +             * ranges.  Remove part of the range and continue
>>> +             */
>>> +            it_range_and(&and, overlap, &range);
>>> +            g_assert(and.start <= and.end);
>>> +            it_tree_remove_subset(gtree, overlap, &and);
>>> +        }
>>> +    }
>>> +
>> Looks like what we need here is just calculate the intersection part the
>> remove it.
> Here after a second thought, I am thining whether it'll be good I use
> a general routine in [4] to replace all [1-4].
>
> The problem is that if we do that we'll depend on the next
> g_tree_lookup() to detect case [1-2] (in that case we'll find nothing
> in the lookup, but still we'll need to walk the tree) to break the
> loop, while IMHO that sometimes can be more expensive than the if
> clauses, especially when the tree is a bit large (each branch needs a
> if clause actually) and also note that in our use case we really have
> lots of cases for [1] and [2].
>
> I think I can merge [3] into [4], but I might consider keeping [1] and
> [2] since I don't really know whether merging them would be better.
> Anyway we can do that as follow up enhancement after the series
> merged.  But I think we'll need some performance benchmarks.
>
> Jason, what do you think?
>
> Regards,
>

Sounds good (but I don't promise I won't have new comments :))

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-03  7:20                   ` Jason Wang
@ 2018-05-03  7:28                     ` Peter Xu
  2018-05-03  7:43                       ` Jason Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-05-03  7:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Jintack Lim, Alex Williamson, qemu-devel,
	Michael S . Tsirkin

On Thu, May 03, 2018 at 03:20:11PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月03日 14:04, Peter Xu wrote:
> > IMHO the guest can't really detect this, but it'll found that the
> > device is not working functionally if it's doing something like what
> > Jason has mentioned.
> > 
> > Actually now I have had an idea if we really want to live well even
> > with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
> > we don't remap for mapped pages; for PSI, we unmap and remap the
> > mapped pages.  That'll complicate the stuff a bit, but it should
> > satisfy all the people.
> > 
> > Thanks,
> 
> So it looks like there will be still unnecessary unamps.

Could I ask what do you mean by "unecessary unmaps"?

> How about record the mappings in the tree too?

As I mentioned, for L1 guest (e.g., DPDK applications running in L1)
it'll be fine.  But I'm just afraid we will have other use cases, like
the L2 guests. That might need tons of the mapping entries in the
worst case scenario.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic
  2018-05-03  7:21       ` Jason Wang
@ 2018-05-03  7:30         ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-05-03  7:30 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-devel, Michael S . Tsirkin, Alex Williamson, Jintack Lim

On Thu, May 03, 2018 at 03:21:13PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月03日 15:10, Peter Xu wrote:
> > On Fri, Apr 27, 2018 at 01:53:35PM +0800, Jason Wang wrote:
> > 
> > [...]
> > 
> > > > +int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
> > > > +{
> > > > +    ITRange range = { .start = start, .end = end }, *overlap, and;
> > > > +    GTree *gtree;
> > > > +
> > > > +    g_assert(tree);
> > > > +
> > > > +    gtree = tree->tree;
> > > > +    while ((overlap = g_tree_lookup(gtree, &range))) {
> > > > +        if (it_range_equal(overlap, &range)) {
> > > > +            /* Exactly what we want to remove; done */
> > [1]
> > 
> > > > +            g_tree_remove(gtree, overlap);
> > > > +            break;
> > > > +        } else if (it_range_cover(overlap, &range)) {
> > [2]
> > 
> > > > +            /* Split existing range into two; done */
> > > > +            it_tree_remove_subset(gtree, overlap, &range);
> > > > +            break;
> > > > +        } else if (it_range_cover(&range, overlap)) {
> > [3]
> > 
> > > > +            /* Drop this range and continue */
> > > > +            g_tree_remove(gtree, overlap);
> > > > +        } else {
> > [4]
> > 
> > > > +            /*
> > > > +             * The range to remove has intersection with existing
> > > > +             * ranges.  Remove part of the range and continue
> > > > +             */
> > > > +            it_range_and(&and, overlap, &range);
> > > > +            g_assert(and.start <= and.end);
> > > > +            it_tree_remove_subset(gtree, overlap, &and);
> > > > +        }
> > > > +    }
> > > > +
> > > Looks like what we need here is just calculate the intersection part the
> > > remove it.
> > Here after a second thought, I am thining whether it'll be good I use
> > a general routine in [4] to replace all [1-4].
> > 
> > The problem is that if we do that we'll depend on the next
> > g_tree_lookup() to detect case [1-2] (in that case we'll find nothing
> > in the lookup, but still we'll need to walk the tree) to break the
> > loop, while IMHO that sometimes can be more expensive than the if
> > clauses, especially when the tree is a bit large (each branch needs a
> > if clause actually) and also note that in our use case we really have
> > lots of cases for [1] and [2].
> > 
> > I think I can merge [3] into [4], but I might consider keeping [1] and
> > [2] since I don't really know whether merging them would be better.
> > Anyway we can do that as follow up enhancement after the series
> > merged.  But I think we'll need some performance benchmarks.
> > 
> > Jason, what do you think?
> > 
> > Regards,
> > 
> 
> Sounds good (but I don't promise I won't have new comments :))

Sure!  I suppose that's why we post patches - we let people see so
that one can leave comments. :)

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-03  7:28                     ` Peter Xu
@ 2018-05-03  7:43                       ` Jason Wang
  2018-05-03  7:53                         ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Wang @ 2018-05-03  7:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	qemu-devel



On 2018年05月03日 15:28, Peter Xu wrote:
> On Thu, May 03, 2018 at 03:20:11PM +0800, Jason Wang wrote:
>>
>> On 2018年05月03日 14:04, Peter Xu wrote:
>>> IMHO the guest can't really detect this, but it'll found that the
>>> device is not working functionally if it's doing something like what
>>> Jason has mentioned.
>>>
>>> Actually now I have had an idea if we really want to live well even
>>> with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
>>> we don't remap for mapped pages; for PSI, we unmap and remap the
>>> mapped pages.  That'll complicate the stuff a bit, but it should
>>> satisfy all the people.
>>>
>>> Thanks,
>> So it looks like there will be still unnecessary unamps.
> Could I ask what do you mean by "unecessary unmaps"?

It's for "for PSI, we unmap and remap the mapped pages". So for the 
first "unmap" how do you know it was really necessary without knowing 
the state of current shadow page table?

>> How about record the mappings in the tree too?
> As I mentioned, for L1 guest (e.g., DPDK applications running in L1)
> it'll be fine.  But I'm just afraid we will have other use cases, like
> the L2 guests. That might need tons of the mapping entries in the
> worst case scenario.
>

Yes, but that's the price of shadow page tables.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-03  7:43                       ` Jason Wang
@ 2018-05-03  7:53                         ` Peter Xu
  2018-05-03  9:22                           ` Jason Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-05-03  7:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Michael S . Tsirkin, Alex Williamson, Jintack Lim,
	qemu-devel

On Thu, May 03, 2018 at 03:43:35PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月03日 15:28, Peter Xu wrote:
> > On Thu, May 03, 2018 at 03:20:11PM +0800, Jason Wang wrote:
> > > 
> > > On 2018年05月03日 14:04, Peter Xu wrote:
> > > > IMHO the guest can't really detect this, but it'll found that the
> > > > device is not working functionally if it's doing something like what
> > > > Jason has mentioned.
> > > > 
> > > > Actually now I have had an idea if we really want to live well even
> > > > with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
> > > > we don't remap for mapped pages; for PSI, we unmap and remap the
> > > > mapped pages.  That'll complicate the stuff a bit, but it should
> > > > satisfy all the people.
> > > > 
> > > > Thanks,
> > > So it looks like there will be still unnecessary unamps.
> > Could I ask what do you mean by "unecessary unmaps"?
> 
> It's for "for PSI, we unmap and remap the mapped pages". So for the first
> "unmap" how do you know it was really necessary without knowing the state of
> current shadow page table?

I don't.  Could I just unmap it anyway?  Say, now the guest _modified_
the PTE already.  Yes I think it's following the spec, but it is
really _unsafe_.  We can know that from what it has done already.
Then I really think a unmap+map would be good enough for us...  After
all that behavior can cause DMA error even on real hardwares.  It can
never tell.

> 
> > > How about record the mappings in the tree too?
> > As I mentioned, for L1 guest (e.g., DPDK applications running in L1)
> > it'll be fine.  But I'm just afraid we will have other use cases, like
> > the L2 guests. That might need tons of the mapping entries in the
> > worst case scenario.
> > 
> 
> Yes, but that's the price of shadow page tables.

So that's why I would like to propose this mergable interval tree.  It
might greatly reduce the price if we can reach a consensus on how we
should treat those strange-behaved guest OSs.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-03  7:53                         ` Peter Xu
@ 2018-05-03  9:22                           ` Jason Wang
  2018-05-03  9:53                             ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Wang @ 2018-05-03  9:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jintack Lim, Tian, Kevin, qemu-devel, Alex Williamson,
	Michael S . Tsirkin



On 2018年05月03日 15:53, Peter Xu wrote:
> On Thu, May 03, 2018 at 03:43:35PM +0800, Jason Wang wrote:
>>
>> On 2018年05月03日 15:28, Peter Xu wrote:
>>> On Thu, May 03, 2018 at 03:20:11PM +0800, Jason Wang wrote:
>>>> On 2018年05月03日 14:04, Peter Xu wrote:
>>>>> IMHO the guest can't really detect this, but it'll found that the
>>>>> device is not working functionally if it's doing something like what
>>>>> Jason has mentioned.
>>>>>
>>>>> Actually now I have had an idea if we really want to live well even
>>>>> with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
>>>>> we don't remap for mapped pages; for PSI, we unmap and remap the
>>>>> mapped pages.  That'll complicate the stuff a bit, but it should
>>>>> satisfy all the people.
>>>>>
>>>>> Thanks,
>>>> So it looks like there will be still unnecessary unamps.
>>> Could I ask what do you mean by "unecessary unmaps"?
>> It's for "for PSI, we unmap and remap the mapped pages". So for the first
>> "unmap" how do you know it was really necessary without knowing the state of
>> current shadow page table?
> I don't.  Could I just unmap it anyway?  Say, now the guest _modified_
> the PTE already.  Yes I think it's following the spec, but it is
> really _unsafe_.  We can know that from what it has done already.
> Then I really think a unmap+map would be good enough for us...  After
> all that behavior can cause DMA error even on real hardwares.  It can
> never tell.

I mean for following case:

1) guest maps A1 (iova) to XXX
2) guest maps A2 (A1 + 4K) (iova) to YYY
3) guest maps A3 (A1 + 8K) (iova) to ZZZ
4) guest unmaps A2 and A2, for reducing the number of PSIs, it can 
invalidate A1 with a range of 2M

If this is allowed by spec, looks like A1 will be unmaped and mapped.

Thanks

>
>>>> How about record the mappings in the tree too?
>>> As I mentioned, for L1 guest (e.g., DPDK applications running in L1)
>>> it'll be fine.  But I'm just afraid we will have other use cases, like
>>> the L2 guests. That might need tons of the mapping entries in the
>>> worst case scenario.
>>>
>> Yes, but that's the price of shadow page tables.
> So that's why I would like to propose this mergable interval tree.  It
> might greatly reduce the price if we can reach a consensus on how we
> should treat those strange-behaved guest OSs.  Thanks,
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-03  9:22                           ` Jason Wang
@ 2018-05-03  9:53                             ` Peter Xu
  2018-05-03 12:01                               ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2018-05-03  9:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jintack Lim, Tian, Kevin, qemu-devel, Alex Williamson,
	Michael S . Tsirkin

On Thu, May 03, 2018 at 05:22:03PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月03日 15:53, Peter Xu wrote:
> > On Thu, May 03, 2018 at 03:43:35PM +0800, Jason Wang wrote:
> > > 
> > > On 2018年05月03日 15:28, Peter Xu wrote:
> > > > On Thu, May 03, 2018 at 03:20:11PM +0800, Jason Wang wrote:
> > > > > On 2018年05月03日 14:04, Peter Xu wrote:
> > > > > > IMHO the guest can't really detect this, but it'll found that the
> > > > > > device is not working functionally if it's doing something like what
> > > > > > Jason has mentioned.
> > > > > > 
> > > > > > Actually now I have had an idea if we really want to live well even
> > > > > > with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
> > > > > > we don't remap for mapped pages; for PSI, we unmap and remap the
> > > > > > mapped pages.  That'll complicate the stuff a bit, but it should
> > > > > > satisfy all the people.
> > > > > > 
> > > > > > Thanks,
> > > > > So it looks like there will be still unnecessary unamps.
> > > > Could I ask what do you mean by "unecessary unmaps"?
> > > It's for "for PSI, we unmap and remap the mapped pages". So for the first
> > > "unmap" how do you know it was really necessary without knowing the state of
> > > current shadow page table?
> > I don't.  Could I just unmap it anyway?  Say, now the guest _modified_
> > the PTE already.  Yes I think it's following the spec, but it is
> > really _unsafe_.  We can know that from what it has done already.
> > Then I really think a unmap+map would be good enough for us...  After
> > all that behavior can cause DMA error even on real hardwares.  It can
> > never tell.
> 
> I mean for following case:
> 
> 1) guest maps A1 (iova) to XXX
> 2) guest maps A2 (A1 + 4K) (iova) to YYY
> 3) guest maps A3 (A1 + 8K) (iova) to ZZZ
> 4) guest unmaps A2 and A2, for reducing the number of PSIs, it can
> invalidate A1 with a range of 2M
> 
> If this is allowed by spec, looks like A1 will be unmaped and mapped.

My follow-up patch won't survive with this one but the original patch
will work.

Jason and I discussed a bit on IRC on this matter.  Here's the
conclusion we got: for now we use my original patch (which solves
everything except PTE modifications).  We mark that modify-PTE problem
as TODO. Then at least we can have the nested device assignment work
well on known OSs first.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
  2018-05-03  9:53                             ` Peter Xu
@ 2018-05-03 12:01                               ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2018-05-03 12:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jintack Lim, Tian, Kevin, qemu-devel, Alex Williamson,
	Michael S . Tsirkin

On Thu, May 03, 2018 at 05:53:59PM +0800, Peter Xu wrote:
> On Thu, May 03, 2018 at 05:22:03PM +0800, Jason Wang wrote:
> > 
> > 
> > On 2018年05月03日 15:53, Peter Xu wrote:
> > > On Thu, May 03, 2018 at 03:43:35PM +0800, Jason Wang wrote:
> > > > 
> > > > On 2018年05月03日 15:28, Peter Xu wrote:
> > > > > On Thu, May 03, 2018 at 03:20:11PM +0800, Jason Wang wrote:
> > > > > > On 2018年05月03日 14:04, Peter Xu wrote:
> > > > > > > IMHO the guest can't really detect this, but it'll found that the
> > > > > > > device is not working functionally if it's doing something like what
> > > > > > > Jason has mentioned.
> > > > > > > 
> > > > > > > Actually now I have had an idea if we really want to live well even
> > > > > > > with Jason's example: maybe we'll need to identify PSI/DSI.  For DSI,
> > > > > > > we don't remap for mapped pages; for PSI, we unmap and remap the
> > > > > > > mapped pages.  That'll complicate the stuff a bit, but it should
> > > > > > > satisfy all the people.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > So it looks like there will be still unnecessary unamps.
> > > > > Could I ask what do you mean by "unecessary unmaps"?
> > > > It's for "for PSI, we unmap and remap the mapped pages". So for the first
> > > > "unmap" how do you know it was really necessary without knowing the state of
> > > > current shadow page table?
> > > I don't.  Could I just unmap it anyway?  Say, now the guest _modified_
> > > the PTE already.  Yes I think it's following the spec, but it is
> > > really _unsafe_.  We can know that from what it has done already.
> > > Then I really think a unmap+map would be good enough for us...  After
> > > all that behavior can cause DMA error even on real hardwares.  It can
> > > never tell.
> > 
> > I mean for following case:
> > 
> > 1) guest maps A1 (iova) to XXX
> > 2) guest maps A2 (A1 + 4K) (iova) to YYY
> > 3) guest maps A3 (A1 + 8K) (iova) to ZZZ
> > 4) guest unmaps A2 and A2, for reducing the number of PSIs, it can
> > invalidate A1 with a range of 2M
> > 
> > If this is allowed by spec, looks like A1 will be unmaped and mapped.
> 
> My follow-up patch won't survive with this one but the original patch
> will work.
> 
> Jason and I discussed a bit on IRC on this matter.  Here's the
> conclusion we got: for now we use my original patch (which solves
> everything except PTE modifications).  We mark that modify-PTE problem
> as TODO. Then at least we can have the nested device assignment work
> well on known OSs first.

Here just to mention that we actually have no way to emulate a PTE
modification procedure.  The problem is that we can never atomically
modify a PTE on the host with Linux, either via VFIO interface or even
directly using IOMMU API in kernel.  To be more specific to our use
case - VFIO provides VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA, but
it never provides VFIO_IOMMU_MODIFY_DMA to modify a PTE atomically.
It means that even if we know the PTE has changed, then we can only
unmap it and remap.  It'll still have the same "invalid window"
problem we have discussed since during unmap and remap the page is
invalid (while from guest POV it should never, since the PTE
modification is atomic).

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2018-05-03 12:01 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock Peter Xu
2018-04-25 16:26   ` Emilio G. Cota
2018-04-26  5:45     ` Peter Xu
2018-04-27  5:13   ` Jason Wang
2018-04-27  6:26     ` Peter Xu
2018-04-27  7:19       ` Tian, Kevin
2018-04-27  9:53         ` Peter Xu
2018-04-28  1:54           ` Tian, Kevin
2018-04-28  1:43       ` Jason Wang
2018-04-28  2:24         ` Peter Xu
2018-04-28  2:42           ` Jason Wang
2018-04-28  3:06             ` Peter Xu
2018-04-28  3:11               ` Jason Wang
2018-04-28  3:14             ` Peter Xu
2018-04-28  3:16               ` Jason Wang
2018-04-30  7:22               ` Paolo Bonzini
2018-04-30  7:20           ` Paolo Bonzini
2018-05-03  5:39             ` Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 06/10] intel-iommu: pass in address space when page walk Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic Peter Xu
2018-04-27  5:53   ` Jason Wang
2018-04-27  6:27     ` Peter Xu
2018-05-03  7:10     ` Peter Xu
2018-05-03  7:21       ` Jason Wang
2018-05-03  7:30         ` Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
2018-04-27  6:07   ` Jason Wang
2018-04-27  6:34     ` Peter Xu
2018-04-27  7:02     ` Tian, Kevin
2018-04-27  7:28       ` Peter Xu
2018-04-27  7:44         ` Tian, Kevin
2018-04-27  9:55           ` Peter Xu
2018-04-27 11:40             ` Peter Xu
2018-04-27 23:37               ` Tian, Kevin
2018-05-03  6:04                 ` Peter Xu
2018-05-03  7:20                   ` Jason Wang
2018-05-03  7:28                     ` Peter Xu
2018-05-03  7:43                       ` Jason Wang
2018-05-03  7:53                         ` Peter Xu
2018-05-03  9:22                           ` Jason Wang
2018-05-03  9:53                             ` Peter Xu
2018-05-03 12:01                               ` Peter Xu
2018-04-28  1:49               ` Jason Wang
2018-04-25  4:51 ` [Qemu-devel] [PATCH 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 10/10] intel-iommu: remove notify_unmap for page walk Peter Xu
2018-04-25  5:05 ` [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
2018-04-25  5:34   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.