All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
@ 2017-04-06  7:08 Peter Xu
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier Peter Xu
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

This is v8 of vt-d vfio enablement series.

v8
- remove patches 1-9 since merged already
- add David's r-b for all the patches
- add Aviv's s-o-b in the last patch
- rename iommu to iommu_dmar [Jason]
- rename last patch subject to "remote IOTLB" [Jason]
- pick up jason's two patches to fix vhost breakage
- let vhost leverage the new IOMMU notifier interface

v7:
- for the two traces patches: Change subjects. Remove vtd_err() and
  vtd_err_nonzero_rsvd() tracers, instead using standalone trace for
  each of the places. Don't remove any DPRINTF() if there is no
  replacement. [Jason]
- add r-b and a-b for Alex/David/Jason.
- in patch "intel_iommu: renaming gpa to iova where proper", convert
  one more place where I missed [Jason]
- fix the place where I should use "~0ULL" not "~0" [Jason]
- squash patch 16 into 18 [Jason]

v6:
- do unmap in all cases when replay [Jason]
- do global replay even if context entry is invalidated [Jason]
- when iommu reset, send unmap to all registered notifiers [Jason]
- use rcu read lock to protect the whole vfio_iommu_map_notify()
  [Alex, Paolo]

v5:
- fix patch 4 subject too long, and error spelling [Eric]
- add ack-by for alex in patch 1 [Alex]
- squashing patch 19/20 into patch 18 [Jason]
- fix comments in vtd_page_walk() [Jason]
- remove all error_report() [Jason]
- add comment for patch 18, mention about that enabled vhost without
  ATS as well [Jason]
- remove skipped debug thing during page walk [Jason]
- remove duplicated page walk trace [Jason]
- some tunings in vtd_address_space_unmap(), to provide correct iova
  and addr_mask. For this, I tuned this patch as well a bit:
  "memory: add section range info for IOMMU notifier"
  to loosen the range check

v4:
- convert all error_report()s into traces (in the two patches that did
  that)
- rebased to Jason's DMAR series (master + one more patch:
  "[PATCH V4 net-next] vhost_net: device IOTLB support")
- let vhost use the new api iommu_notifier_init() so it won't break
  vhost dmar [Jason]
- touch commit message of the patch:
  "intel_iommu: provide its own replay() callback"
  old replay is not a dead loop, but it will just consume lots of time
  [Jason]
- add comment for patch:
  "intel_iommu: do replay when context invalidate"
  telling why replay won't be a problem even without CM=1 [Jason]
- remove a useless comment line [Jason]
- remove dmar_enabled parameter for vtd_switch_address_space() and
  vtd_switch_address_space_all() [Mst, Jason]
- merged the vfio patches in, to support unmap of big ranges at the
  beginning ("[PATCH RFC 0/3] vfio: allow to notify unmap for very big
  region")
- using caching_mode instead of cache_mode_enabled, and "caching-mode"
  instead of "cache-mode" [Kevin]
- when receive context entry invalidation, we unmap the entire region
  first, then replay [Alex]
- fix commit message for patch:
  "intel_iommu: simplify irq region translation" [Kevin]
- handle domain/global invalidation, and notify where proper [Jason,
  Kevin]

v3:
- fix style error reported by patchew
- fix comment in domain switch patch: use "IOMMU address space" rather
  than "IOMMU region" [Kevin]
- add ack-by for Paolo in patch:
  "memory: add section range info for IOMMU notifier"
  (this is seperately collected besides this thread)
- remove 3 patches which are merged already (from Jason)
- rebase to master b6c0897

v2:
- change comment for "end" parameter in vtd_page_walk() [Tianyu]
- change comment for "a iova" to "an iova" [Yi]
- fix fault printed val for GPA address in vtd_page_walk_level (debug
  only)
- rebased to master (rather than Aviv's v6 series) and merged Aviv's
  series v6: picked patch 1 (as patch 1 in this series), dropped patch
  2, re-wrote patch 3 (as patch 17 of this series).
- picked up two more bugfix patches from Jason's DMAR series
- picked up the following patch as well:
  "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"

This RFC series is a re-work for Aviv B.D.'s vfio enablement series
with vt-d:

  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html

Aviv has done a great job there, and what we still lack there are
mostly the following:

(1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
    memory region.

(2) VT-d still haven't provide a correct replay() mechanism (e.g.,
    when IOMMU domain switches, things will broke).

This series should have solved the above two issues.

Online repo:

  https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v8

I would be glad to hear about any review comments for above patches.

=========
Test Done
=========

Build test passed for x86_64/arm/ppc64.

Simply tested with x86_64, assigning two PCI devices to a single VM,
boot the VM using:

bin=x86_64-softmmu/qemu-system-x86_64
$bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
     -device intel-iommu,intremap=on,eim=off,caching-mode=on \
     -netdev user,id=net0,hostfwd=tcp::5555-:22 \
     -device virtio-net-pci,netdev=net0 \
     -device vfio-pci,host=03:00.0 \
     -device vfio-pci,host=02:00.0 \
     -trace events=".trace.vfio" \
     /var/lib/libvirt/images/vm1.qcow2

pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
vtd_page_walk*
vtd_replay*
vtd_inv_desc*

Then, in the guest, run the following tool:

  https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c

With parameter:

  ./vfio-bind-group 00:03.0 00:04.0

Check host side trace log, I can see pages are replayed and mapped in
00:04.0 device address space, like:

...
vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
...

=========
Todo List
=========

- error reporting for the assigned devices (as Tianyu has mentioned)

- per-domain address-space: A better solution in the future may be -
  we maintain one address space per IOMMU domain in the guest (so
  multiple devices can share a same address space if they are sharing
  the same IOMMU domains in the guest), rather than one address space
  per device (which is current implementation of vt-d). However that's
  a step further than this series, and let's see whether we can first
  provide a workable version of device assignment with vt-d
  protection.

- don't need to notify IOTLB (psi/gsi/global) invalidations to devices
  that with ATS enabled

- investigate when guest map page while mask contains existing mapped
  pages (e.g. map 12k-16k first, then map 0-12k)

- coalesce unmap during page walk (currently, we send it once per
  page)

- when do PSI for unmap, whether we can send one notify directly
  instead of walking over the page table?

- more to come...

Thanks,

Jason Wang (1):
  intel_iommu: use the correct memory region for device IOTLB
    notification

Peter Xu (8):
  memory: add section range info for IOMMU notifier
  memory: provide IOMMU_NOTIFIER_FOREACH macro
  memory: provide iommu_replay_all()
  memory: introduce memory_region_notify_one()
  memory: add MemoryRegionIOMMUOps.replay() callback
  intel_iommu: provide its own replay() callback
  intel_iommu: allow dynamic switch of IOMMU region
  intel_iommu: enable remote IOTLB

 hw/i386/intel_iommu.c          | 442 +++++++++++++++++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |   1 +
 hw/i386/trace-events           |  10 +-
 hw/vfio/common.c               |  12 +-
 hw/virtio/vhost.c              |  10 +-
 include/exec/memory.h          |  49 ++++-
 include/hw/i386/intel_iommu.h  |  10 +
 memory.c                       |  52 ++++-
 8 files changed, 552 insertions(+), 34 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 10:42   ` Auger Eric
                     ` (2 more replies)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

In this patch, IOMMUNotifier.{start|end} are introduced to store section
information for a specific notifier. When notification occurs, we not
only check the notification type (MAP|UNMAP), but also check whether the
notified iova range overlaps with the range of specific IOMMU notifier,
and skip those notifiers if not in the listened range.

When removing an region, we need to make sure we removed the correct
VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.

This patch is solving the problem that vfio-pci devices receive
duplicated UNMAP notification on x86 platform when vIOMMU is there. The
issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
this (splitted IOMMU region) is only happening on x86.

This patch also helps vhost to leverage the new interface as well, so
that vhost won't get duplicated cache flushes. In that sense, it's an
slight performance improvement.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v7->v8:
- let vhost dmar leverage the new interface as well
- add some more comments in commit message, mentioning what issue this
  patch has solved
- since touched up, removing Alex's a-b and DavidG's r-b
---
 hw/vfio/common.c      | 12 +++++++++---
 hw/virtio/vhost.c     | 10 ++++++++--
 include/exec/memory.h | 19 ++++++++++++++++++-
 memory.c              |  9 +++++++++
 4 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f3ba9b9..6b33b9f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
         giommu->iommu_offset = section->offset_within_address_space -
                                section->offset_within_region;
         giommu->container = container;
-        giommu->n.notify = vfio_iommu_map_notify;
-        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
+        llend = int128_add(int128_make64(section->offset_within_region),
+                           section->size);
+        llend = int128_sub(llend, int128_one());
+        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
+                            IOMMU_NOTIFIER_ALL,
+                            section->offset_within_region,
+                            int128_get64(llend));
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
@@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (giommu->iommu == section->mr) {
+            if (giommu->iommu == section->mr &&
+                giommu->n.start == section->offset_within_region) {
                 memory_region_unregister_iommu_notifier(giommu->iommu,
                                                         &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 613494d..185b95b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -736,14 +736,20 @@ static void vhost_iommu_region_add(MemoryListener *listener,
     struct vhost_dev *dev = container_of(listener, struct vhost_dev,
                                          iommu_listener);
     struct vhost_iommu *iommu;
+    Int128 end;
 
     if (!memory_region_is_iommu(section->mr)) {
         return;
     }
 
     iommu = g_malloc0(sizeof(*iommu));
-    iommu->n.notify = vhost_iommu_unmap_notify;
-    iommu->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
+    end = int128_add(int128_make64(section->offset_within_region),
+                     section->size);
+    end = int128_sub(end, int128_one());
+    iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
+                        IOMMU_NOTIFIER_UNMAP,
+                        section->offset_within_region,
+                        int128_get64(end));
     iommu->mr = section->mr;
     iommu->iommu_offset = section->offset_within_address_space -
                           section->offset_within_region;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index f20b191..0840c89 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -77,13 +77,30 @@ typedef enum {
 
 #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
 
+struct IOMMUNotifier;
+typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
+                            IOMMUTLBEntry *data);
+
 struct IOMMUNotifier {
-    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
+    IOMMUNotify notify;
     IOMMUNotifierFlag notifier_flags;
+    /* Notify for address space range start <= addr <= end */
+    hwaddr start;
+    hwaddr end;
     QLIST_ENTRY(IOMMUNotifier) node;
 };
 typedef struct IOMMUNotifier IOMMUNotifier;
 
+static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
+                                       IOMMUNotifierFlag flags,
+                                       hwaddr start, hwaddr end)
+{
+    n->notify = fn;
+    n->notifier_flags = flags;
+    n->start = start;
+    n->end = end;
+}
+
 /* New-style MMIO accessors can indicate that the transaction failed.
  * A zero (MEMTX_OK) response means success; anything else is a failure
  * of some kind. The memory subsystem will bitwise-OR together results
diff --git a/memory.c b/memory.c
index 4c95aaf..75ac595 100644
--- a/memory.c
+++ b/memory.c
@@ -1606,6 +1606,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
 
     /* We need to register for at least one bitfield */
     assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
+    assert(n->start <= n->end);
     QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
     memory_region_update_iommu_notify_flags(mr);
 }
@@ -1667,6 +1668,14 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     }
 
     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+        /*
+         * Skip the notification if the notification does not overlap
+         * with registered range.
+         */
+        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
+            iommu_notifier->end < entry.iova) {
+            continue;
+        }
         if (iommu_notifier->notifier_flags & request_flags) {
             iommu_notifier->notify(iommu_notifier, &entry);
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 10:45   ` Auger Eric
  2017-04-06 11:54   ` Michael S. Tsirkin
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all() Peter Xu
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 3 +++
 memory.c              | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0840c89..07e43da 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -239,6 +239,9 @@ struct MemoryRegion {
     IOMMUNotifierFlag iommu_notify_flags;
 };
 
+#define IOMMU_NOTIFIER_FOREACH(n, mr) \
+    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
+
 /**
  * MemoryListener: callbacks structure for updates to the physical memory map
  *
diff --git a/memory.c b/memory.c
index 75ac595..7496b3d 100644
--- a/memory.c
+++ b/memory.c
@@ -1583,7 +1583,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
     IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
     IOMMUNotifier *iommu_notifier;
 
-    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
         flags |= iommu_notifier->notifier_flags;
     }
 
@@ -1667,7 +1667,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
-    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
         /*
          * Skip the notification if the notification does not overlap
          * with registered range.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all()
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier Peter Xu
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 10:52   ` Auger Eric
  2017-04-06 11:55   ` Michael S. Tsirkin
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one() Peter Xu
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

This is an "global" version of exising memory_region_iommu_replay() - we
announce the translations to all the registered notifiers, instead of a
specific one.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 8 ++++++++
 memory.c              | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 07e43da..fb7dff3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -713,6 +713,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
                                 bool is_write);
 
 /**
+ * memory_region_iommu_replay_all: replay existing IOMMU translations
+ * to all the notifiers registered.
+ *
+ * @mr: the memory region to observe
+ */
+void memory_region_iommu_replay_all(MemoryRegion *mr);
+
+/**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
  * changes to IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index 7496b3d..b4ed67b 100644
--- a/memory.c
+++ b/memory.c
@@ -1642,6 +1642,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     }
 }
 
+void memory_region_iommu_replay_all(MemoryRegion *mr)
+{
+    IOMMUNotifier *notifier;
+
+    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
+        memory_region_iommu_replay(mr, notifier, false);
+    }
+}
+
 void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
                                              IOMMUNotifier *n)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one()
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (2 preceding siblings ...)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all() Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 10:54   ` Auger Eric
  2017-04-06 11:55   ` Michael S. Tsirkin
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

Generalizing the notify logic in memory_region_notify_iommu() into a
single function. This can be further used in customized replay()
functions for IOMMUs.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 15 +++++++++++++++
 memory.c              | 40 ++++++++++++++++++++++++----------------
 2 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index fb7dff3..055b3a8 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -688,6 +688,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
                                 IOMMUTLBEntry entry);
 
 /**
+ * memory_region_notify_one: notify a change in an IOMMU translation
+ *                           entry to a single notifier
+ *
+ * This works just like memory_region_notify_iommu(), but it only
+ * notifies a specific notifier, not all of them.
+ *
+ * @notifier: the notifier to be notified
+ * @entry: the new entry in the IOMMU translation table.  The entry
+ *         replaces all old entries for the same virtual I/O address range.
+ *         Deleted entries have .@perm == 0.
+ */
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry);
+
+/**
  * memory_region_register_iommu_notifier: register a notifier for changes to
  * IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index b4ed67b..ded4bf1 100644
--- a/memory.c
+++ b/memory.c
@@ -1662,32 +1662,40 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
     memory_region_update_iommu_notify_flags(mr);
 }
 
-void memory_region_notify_iommu(MemoryRegion *mr,
-                                IOMMUTLBEntry entry)
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry)
 {
-    IOMMUNotifier *iommu_notifier;
     IOMMUNotifierFlag request_flags;
 
-    assert(memory_region_is_iommu(mr));
+    /*
+     * Skip the notification if the notification does not overlap
+     * with registered range.
+     */
+    if (notifier->start > entry->iova + entry->addr_mask + 1 ||
+        notifier->end < entry->iova) {
+        return;
+    }
 
-    if (entry.perm & IOMMU_RW) {
+    if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
     } else {
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
+    if (notifier->notifier_flags & request_flags) {
+        notifier->notify(notifier, entry);
+    }
+}
+
+void memory_region_notify_iommu(MemoryRegion *mr,
+                                IOMMUTLBEntry entry)
+{
+    IOMMUNotifier *iommu_notifier;
+
+    assert(memory_region_is_iommu(mr));
+
     IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
-        /*
-         * Skip the notification if the notification does not overlap
-         * with registered range.
-         */
-        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
-            iommu_notifier->end < entry.iova) {
-            continue;
-        }
-        if (iommu_notifier->notifier_flags & request_flags) {
-            iommu_notifier->notify(iommu_notifier, &entry);
-        }
+        memory_region_notify_one(iommu_notifier, &entry);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (3 preceding siblings ...)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one() Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 10:58   ` Auger Eric
  2017-04-06 11:57   ` Michael S. Tsirkin
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 6/9] intel_iommu: use the correct memory region for device IOTLB notification Peter Xu
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

Originally we have one memory_region_iommu_replay() function, which is
the default behavior to replay the translations of the whole IOMMU
region. However, on some platform like x86, we may want our own replay
logic for IOMMU regions. This patch add one more hook for IOMMUOps for
the callback, and it'll override the default if set.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 2 ++
 memory.c              | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 055b3a8..c0280b7 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
     void (*notify_flag_changed)(MemoryRegion *iommu,
                                 IOMMUNotifierFlag old_flags,
                                 IOMMUNotifierFlag new_flags);
+    /* Set this up to provide customized IOMMU replay function */
+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/memory.c b/memory.c
index ded4bf1..b782d5b 100644
--- a/memory.c
+++ b/memory.c
@@ -1626,6 +1626,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    /* If the IOMMU has its own replay callback, override */
+    if (mr->iommu_ops->replay) {
+        mr->iommu_ops->replay(mr, n);
+        return;
+    }
+
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 6/9] intel_iommu: use the correct memory region for device IOTLB notification
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (4 preceding siblings ...)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 11:57   ` Michael S. Tsirkin
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 7/9] intel_iommu: provide its own replay() callback Peter Xu
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Jason Wang <jasowang@redhat.com>

We have a specific memory region for DMAR now, so it's wrong to
trigger the notifier with the root region.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 22d8226..2412df4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1457,7 +1457,7 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
     entry.iova = addr;
     entry.perm = IOMMU_NONE;
     entry.translated_addr = 0;
-    memory_region_notify_iommu(entry.target_as->root, entry);
+    memory_region_notify_iommu(&vtd_dev_as->iommu, entry);
 
 done:
     return true;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 7/9] intel_iommu: provide its own replay() callback
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (5 preceding siblings ...)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 6/9] intel_iommu: use the correct memory region for device IOTLB notification Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 11:57   ` Michael S. Tsirkin
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 8/9] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

The default replay() don't work for VT-d since vt-d will have a huge
default memory region which covers address range 0-(2^64-1). This will
normally consumes a lot of time (which looks like a dead loop).

The solution is simple - we don't walk over all the regions. Instead, we
jump over the regions when we found that the page directories are empty.
It'll greatly reduce the time to walk the whole region.

To achieve this, we provided a page walk helper to do that, invoking
corresponding hook function when we found an page we are interested in.
vtd_page_walk_level() is the core logic for the page walking. It's
interface is designed to suite further use case, e.g., to invalidate a
range of addresses.

Reviewed-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/i386/trace-events  |   7 ++
 include/exec/memory.h |   2 +
 3 files changed, 186 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2412df4..7af4e22 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -595,6 +595,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
+static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
+{
+    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
+    return 1ULL << MIN(ce_agaw, VTD_MGAW);
+}
+
+/* Return true if IOVA passes range check, otherwise false. */
+static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
+{
+    /*
+     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
+     */
+    return !(iova & ~(vtd_iova_limit(ce) - 1));
+}
+
 static const uint64_t vtd_paging_entry_rsvd_field[] = {
     [0] = ~0ULL,
     /* For not large page */
@@ -630,13 +646,9 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     uint32_t level = vtd_get_level_from_context_entry(ce);
     uint32_t offset;
     uint64_t slpte;
-    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check;
 
-    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
-     * in CAP_REG and AW in context-entry.
-     */
-    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+    if (!vtd_iova_range_check(iova, ce)) {
         VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
@@ -684,6 +696,134 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     }
 }
 
+typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
+
+/**
+ * vtd_page_walk_level - walk over specific level for IOVA range
+ *
+ * @addr: base GPA addr to start the walk
+ * @start: IOVA range start address
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @read: whether parent level has read permission
+ * @write: whether parent level has write permission
+ * @notify_unmap: whether we should notify invalid entries
+ */
+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
+                               uint64_t end, vtd_page_walk_hook hook_fn,
+                               void *private, uint32_t level,
+                               bool read, bool write, bool notify_unmap)
+{
+    bool read_cur, write_cur, entry_valid;
+    uint32_t offset;
+    uint64_t slpte;
+    uint64_t subpage_size, subpage_mask;
+    IOMMUTLBEntry entry;
+    uint64_t iova = start;
+    uint64_t iova_next;
+    int ret = 0;
+
+    trace_vtd_page_walk_level(addr, level, start, end);
+
+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
+    subpage_mask = vtd_slpt_level_page_mask(level);
+
+    while (iova < end) {
+        iova_next = (iova & subpage_mask) + subpage_size;
+
+        offset = vtd_iova_level_offset(iova, level);
+        slpte = vtd_get_slpte(addr, offset);
+
+        if (slpte == (uint64_t)-1) {
+            trace_vtd_page_walk_skip_read(iova, iova_next);
+            goto next;
+        }
+
+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
+            goto next;
+        }
+
+        /* Permissions are stacked with parents' */
+        read_cur = read && (slpte & VTD_SL_R);
+        write_cur = write && (slpte & VTD_SL_W);
+
+        /*
+         * As long as we have either read/write permission, this is a
+         * valid entry. The rule works for both page entries and page
+         * table entries.
+         */
+        entry_valid = read_cur | write_cur;
+
+        if (vtd_is_last_slpte(slpte, level)) {
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            /* NOTE: this is only meaningful if entry_valid == true */
+            entry.translated_addr = vtd_get_slpte_addr(slpte);
+            entry.addr_mask = ~subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            if (!entry_valid && !notify_unmap) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                goto next;
+            }
+            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
+                                    entry.addr_mask, entry.perm);
+            if (hook_fn) {
+                ret = hook_fn(&entry, private);
+                if (ret < 0) {
+                    return ret;
+                }
+            }
+        } else {
+            if (!entry_valid) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                goto next;
+            }
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
+                                      MIN(iova_next, end), hook_fn, private,
+                                      level - 1, read_cur, write_cur,
+                                      notify_unmap);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+
+next:
+        iova = iova_next;
+    }
+
+    return 0;
+}
+
+/**
+ * vtd_page_walk - walk specific IOVA range, and call the hook
+ *
+ * @ce: context entry to walk upon
+ * @start: IOVA address to start the walk
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: the hook that to be called for each detected area
+ * @private: private data for the hook function
+ */
+static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
+                         vtd_page_walk_hook hook_fn, void *private)
+{
+    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
+    uint32_t level = vtd_get_level_from_context_entry(ce);
+
+    if (!vtd_iova_range_check(start, ce)) {
+        return -VTD_FR_ADDR_BEYOND_MGAW;
+    }
+
+    if (!vtd_iova_range_check(end, ce)) {
+        /* Fix end so that it reaches the maximum */
+        end = vtd_iova_limit(ce);
+    }
+
+    return vtd_page_walk_level(addr, start, end, hook_fn, private,
+                               level, true, true, false);
+}
+
 /* Map a device to its corresponding domain (context-entry) */
 static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
                                     uint8_t devfn, VTDContextEntry *ce)
@@ -2402,6 +2542,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
+{
+    memory_region_notify_one((IOMMUNotifier *)private, entry);
+    return 0;
+}
+
+static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
+{
+    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_n = pci_bus_num(vtd_as->bus);
+    VTDContextEntry ce;
+
+    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+        /*
+         * Scanned a valid context entry, walk over the pages and
+         * notify when needed.
+         */
+        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                  PCI_FUNC(vtd_as->devfn),
+                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
+                                  ce.hi, ce.lo);
+        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n);
+    } else {
+        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                    PCI_FUNC(vtd_as->devfn));
+    }
+
+    return;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -2416,6 +2587,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     s->iommu_ops.translate = vtd_iommu_translate;
     s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
+    s->iommu_ops.replay = vtd_iommu_replay;
     s->root = 0;
     s->root_extended = false;
     s->dmar_enabled = false;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index baed874..f725bca 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -30,6 +30,13 @@ vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
 vtd_fault_disabled(void) "Fault processing disabled for context entry"
+vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
+vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
+vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
+vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c0280b7..c4fc94d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -55,6 +55,8 @@ typedef enum {
     IOMMU_RW   = 3,
 } IOMMUAccessFlags;
 
+#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
+
 struct IOMMUTLBEntry {
     AddressSpace    *target_as;
     hwaddr           iova;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 8/9] intel_iommu: allow dynamic switch of IOMMU region
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (6 preceding siblings ...)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 7/9] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 11:58   ` Michael S. Tsirkin
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 9/9] intel_iommu: enable remote IOTLB Peter Xu
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv

This is preparation work to finally enabled dynamic switching ON/OFF for
VT-d protection. The old VT-d codes is using static IOMMU address space,
and that won't satisfy vfio-pci device listeners.

Let me explain.

vfio-pci devices depend on the memory region listener and IOMMU replay
mechanism to make sure the device mapping is coherent with the guest
even if there are domain switches. And there are two kinds of domain
switches:

  (1) switch from domain A -> B
  (2) switch from domain A -> no domain (e.g., turn DMAR off)

Case (1) is handled by the context entry invalidation handling by the
VT-d replay logic. What the replay function should do here is to replay
the existing page mappings in domain B.

However for case (2), we don't want to replay any domain mappings - we
just need the default GPA->HPA mappings (the address_space_memory
mapping). And this patch helps on case (2) to build up the mapping
automatically by leveraging the vfio-pci memory listeners.

Another important thing that this patch does is to seperate
IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
depend on the DMAR region (like before this patch). It should be a
standalone region, and it should be able to be activated without
DMAR (which is a common behavior of Linux kernel - by default it enables
IR while disabled DMAR).

Reviewed-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 81 ++++++++++++++++++++++++++++++++++++++++---
 hw/i386/trace-events          |  2 +-
 include/hw/i386/intel_iommu.h |  2 ++
 3 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7af4e22..f7dec82 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1291,9 +1291,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
 }
 
+static void vtd_switch_address_space(VTDAddressSpace *as)
+{
+    assert(as);
+
+    trace_vtd_switch_address_space(pci_bus_num(as->bus),
+                                   VTD_PCI_SLOT(as->devfn),
+                                   VTD_PCI_FUNC(as->devfn),
+                                   as->iommu_state->dmar_enabled);
+
+    /* Turn off first then on the other */
+    if (as->iommu_state->dmar_enabled) {
+        memory_region_set_enabled(&as->sys_alias, false);
+        memory_region_set_enabled(&as->iommu, true);
+    } else {
+        memory_region_set_enabled(&as->iommu, false);
+        memory_region_set_enabled(&as->sys_alias, true);
+    }
+}
+
+static void vtd_switch_address_space_all(IntelIOMMUState *s)
+{
+    GHashTableIter iter;
+    VTDBus *vtd_bus;
+    int i;
+
+    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
+    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
+        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
+            if (!vtd_bus->dev_as[i]) {
+                continue;
+            }
+            vtd_switch_address_space(vtd_bus->dev_as[i]);
+        }
+    }
+}
+
 /* Handle Translation Enable/Disable */
 static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 {
+    if (s->dmar_enabled == en) {
+        return;
+    }
+
     VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
 
     if (en) {
@@ -1308,6 +1348,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
         /* Ok - report back to driver */
         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
     }
+
+    vtd_switch_address_space_all(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -2529,15 +2571,44 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+
+        /*
+         * Memory region relationships looks like (Address range shows
+         * only lower 32 bits to make it short in length...):
+         *
+         * |-----------------+-------------------+----------|
+         * | Name            | Address range     | Priority |
+         * |-----------------+-------------------+----------+
+         * | vtd_root        | 00000000-ffffffff |        0 |
+         * |  intel_iommu    | 00000000-ffffffff |        1 |
+         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
+         * |  intel_iommu_ir | fee00000-feefffff |       64 |
+         * |-----------------+-------------------+----------|
+         *
+         * We enable/disable DMAR by switching enablement for
+         * vtd_sys_alias and intel_iommu regions. IR region is always
+         * enabled.
+         */
         memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
-                                 &s->iommu_ops, "intel_iommu", UINT64_MAX);
+                                 &s->iommu_ops, "intel_iommu_dmar",
+                                 UINT64_MAX);
+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
+                                 "vtd_sys_alias", get_system_memory(),
+                                 0, memory_region_size(get_system_memory()));
         memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
                               &vtd_mem_ir_ops, s, "intel_iommu_ir",
                               VTD_INTERRUPT_ADDR_SIZE);
-        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
-                                    &vtd_dev_as->iommu_ir);
-        address_space_init(&vtd_dev_as->as,
-                           &vtd_dev_as->iommu, name);
+        memory_region_init(&vtd_dev_as->root, OBJECT(s),
+                           "vtd_root", UINT64_MAX);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root,
+                                            VTD_INTERRUPT_ADDR_FIRST,
+                                            &vtd_dev_as->iommu_ir, 64);
+        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->sys_alias, 1);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->iommu, 1);
+        vtd_switch_address_space(vtd_dev_as);
     }
     return vtd_dev_as;
 }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index f725bca..3c3a167 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -4,7 +4,6 @@
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
 # hw/i386/intel_iommu.c
-vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
 vtd_inv_desc_invalid(uint64_t hi, uint64_t lo) "invalid inv desc hi 0x%"PRIx64" lo 0x%"PRIx64
 vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
@@ -37,6 +36,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fe645aa..8f212a1 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -83,6 +83,8 @@ struct VTDAddressSpace {
     uint8_t devfn;
     AddressSpace as;
     MemoryRegion iommu;
+    MemoryRegion root;
+    MemoryRegion sys_alias;
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH v8 9/9] intel_iommu: enable remote IOTLB
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (7 preceding siblings ...)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 8/9] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2017-04-06  7:08 ` Peter Xu
  2017-04-06 11:58   ` Michael S. Tsirkin
  2017-04-06 11:53 ` [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
  2017-04-06 12:00 ` Michael S. Tsirkin
  10 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-04-06  7:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, Marcel Apfelbaum, jan.kiszka,
	jasowang, peterx, David Gibson, alex.williamson, bd.aviv,
	Aviv Ben-David

This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
upstream:

  "IOMMU: enable intel_iommu map and unmap notifiers"
  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html

However I removed/fixed some content, and added my own codes.

Instead of translate() every page for iotlb invalidations (which is
slower), we walk the pages when needed and notify in a hook function.

This patch enables vfio devices for VT-d emulation.

And, since we already have vhost DMAR support via device-iotlb, a
natural benefit that this patch brings is that vt-d enabled vhost can
live even without ATS capability now. Though more tests are needed.

Signed-off-by: Aviv Ben-David <bdaviv@cs.technion.ac.il>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 191 ++++++++++++++++++++++++++++++++++++++---
 hw/i386/intel_iommu_internal.h |   1 +
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |   8 ++
 4 files changed, 188 insertions(+), 13 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f7dec82..02f047c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -806,7 +806,8 @@ next:
  * @private: private data for the hook function
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
-                         vtd_page_walk_hook hook_fn, void *private)
+                         vtd_page_walk_hook hook_fn, void *private,
+                         bool notify_unmap)
 {
     dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
     uint32_t level = vtd_get_level_from_context_entry(ce);
@@ -821,7 +822,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
     }
 
     return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, false);
+                               level, true, true, notify_unmap);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1038,6 +1039,15 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
                 s->intr_root, s->intr_size);
 }
 
+static void vtd_iommu_replay_all(IntelIOMMUState *s)
+{
+    IntelIOMMUNotifierNode *node;
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        memory_region_iommu_replay_all(&node->vtd_as->iommu);
+    }
+}
+
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
     trace_vtd_inv_desc_cc_global();
@@ -1045,6 +1055,14 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
     }
+    /*
+     * From VT-d spec 6.5.2.1, a global context entry invalidation
+     * should be followed by a IOTLB global invalidation, so we should
+     * be safe even without this. Hoewever, let's replay the region as
+     * well to be safer, and go back here when we need finer tunes for
+     * VT-d emulation codes.
+     */
+    vtd_iommu_replay_all(s);
 }
 
 
@@ -1111,6 +1129,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                 trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
                                              VTD_PCI_FUNC(devfn_it));
                 vtd_as->context_cache_entry.context_cache_gen = 0;
+                /*
+                 * So a device is moving out of (or moving into) a
+                 * domain, a replay() suites here to notify all the
+                 * IOMMU_NOTIFIER_MAP registers about this change.
+                 * This won't bring bad even if we have no such
+                 * notifier registered - the IOMMU notification
+                 * framework will skip MAP notifications if that
+                 * happened.
+                 */
+                memory_region_iommu_replay_all(&vtd_as->iommu);
             }
         }
     }
@@ -1152,12 +1180,53 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 {
     trace_vtd_iotlb_reset("global invalidation recved");
     vtd_reset_iotlb(s);
+    vtd_iommu_replay_all(s);
 }
 
 static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 {
+    IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    VTDAddressSpace *vtd_as;
+
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        vtd_as = node->vtd_as;
+        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                      vtd_as->devfn, &ce) &&
+            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            memory_region_iommu_replay_all(&vtd_as->iommu);
+        }
+    }
+}
+
+static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
+                                           void *private)
+{
+    memory_region_notify_iommu((MemoryRegion *)private, *entry);
+    return 0;
+}
+
+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
+                                           uint16_t domain_id, hwaddr addr,
+                                           uint8_t am)
+{
+    IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    int ret;
+
+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
+        VTDAddressSpace *vtd_as = node->vtd_as;
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce);
+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
+                          vtd_page_invalidate_notify_hook,
+                          (void *)&vtd_as->iommu, true);
+        }
+    }
 }
 
 static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
@@ -1170,6 +1239,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
 }
 
 /* Flush IOTLB
@@ -2187,15 +2257,33 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu,
                                           IOMMUNotifierFlag new)
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    IntelIOMMUNotifierNode *node = NULL;
+    IntelIOMMUNotifierNode *next_node = NULL;
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_report("Device at bus %s addr %02x.%d requires iommu "
-                     "notifier which is currently not supported by "
-                     "intel-iommu emulation",
-                     vtd_as->bus->qbus.name, PCI_SLOT(vtd_as->devfn),
-                     PCI_FUNC(vtd_as->devfn));
+    if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
+        error_report("We need to set cache_mode=1 for intel-iommu to enable "
+                     "device assignment with IOMMU protection.");
         exit(1);
     }
+
+    if (old == IOMMU_NOTIFIER_NONE) {
+        node = g_malloc0(sizeof(*node));
+        node->vtd_as = vtd_as;
+        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
+        return;
+    }
+
+    /* update notifier node with new flags */
+    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
+        if (node->vtd_as == vtd_as) {
+            if (new == IOMMU_NOTIFIER_NONE) {
+                QLIST_REMOVE(node, next);
+                g_free(node);
+            }
+            return;
+        }
+    }
 }
 
 static const VMStateDescription vtd_vmstate = {
@@ -2613,6 +2701,74 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+/* Unmap the whole range in the notifier's scope. */
+static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
+{
+    IOMMUTLBEntry entry;
+    hwaddr size;
+    hwaddr start = n->start;
+    hwaddr end = n->end;
+
+    /*
+     * Note: all the codes in this function has a assumption that IOVA
+     * bits are no more than VTD_MGAW bits (which is restricted by
+     * VT-d spec), otherwise we need to consider overflow of 64 bits.
+     */
+
+    if (end > VTD_ADDRESS_SIZE) {
+        /*
+         * Don't need to unmap regions that is bigger than the whole
+         * VT-d supported address space size
+         */
+        end = VTD_ADDRESS_SIZE;
+    }
+
+    assert(start <= end);
+    size = end - start;
+
+    if (ctpop64(size) != 1) {
+        /*
+         * This size cannot format a correct mask. Let's enlarge it to
+         * suite the minimum available mask.
+         */
+        int n = 64 - clz64(size);
+        if (n > VTD_MGAW) {
+            /* should not happen, but in case it happens, limit it */
+            n = VTD_MGAW;
+        }
+        size = 1ULL << n;
+    }
+
+    entry.target_as = &address_space_memory;
+    /* Adjust iova for the size */
+    entry.iova = n->start & ~(size - 1);
+    /* This field is meaningless for unmap */
+    entry.translated_addr = 0;
+    entry.perm = IOMMU_NONE;
+    entry.addr_mask = size - 1;
+
+    trace_vtd_as_unmap_whole(pci_bus_num(as->bus),
+                             VTD_PCI_SLOT(as->devfn),
+                             VTD_PCI_FUNC(as->devfn),
+                             entry.iova, size);
+
+    memory_region_notify_one(n, &entry);
+}
+
+static void vtd_address_space_unmap_all(IntelIOMMUState *s)
+{
+    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
+    IOMMUNotifier *n;
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        vtd_as = node->vtd_as;
+        IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
+            vtd_address_space_unmap(vtd_as, n);
+        }
+    }
+}
+
 static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
 {
     memory_region_notify_one((IOMMUNotifier *)private, entry);
@@ -2626,16 +2782,19 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
     uint8_t bus_n = pci_bus_num(vtd_as->bus);
     VTDContextEntry ce;
 
+    /*
+     * The replay can be triggered by either a invalidation or a newly
+     * created entry. No matter what, we release existing mappings
+     * (it means flushing caches for UNMAP-only registers).
+     */
+    vtd_address_space_unmap(vtd_as, n);
+
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
-        /*
-         * Scanned a valid context entry, walk over the pages and
-         * notify when needed.
-         */
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn),
                                   VTD_CONTEXT_ENTRY_DID(ce.hi),
                                   ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n);
+        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false);
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
@@ -2754,6 +2913,11 @@ static void vtd_reset(DeviceState *dev)
 
     VTD_DPRINTF(GENERAL, "");
     vtd_init(s);
+
+    /*
+     * When device reset, throw away all mappings and external caches
+     */
+    vtd_address_space_unmap_all(s);
 }
 
 static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
@@ -2817,6 +2981,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    QLIST_INIT(&s->notifiers_list);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4104121..29d6707 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -197,6 +197,7 @@
 #define VTD_DOMAIN_ID_MASK          ((1UL << VTD_DOMAIN_ID_SHIFT) - 1)
 #define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
 #define VTD_MGAW                    39  /* Maximum Guest Address Width */
+#define VTD_ADDRESS_SIZE            (1ULL << VTD_MGAW)
 #define VTD_CAP_MGAW                (((VTD_MGAW - 1) & 0x3fULL) << 16)
 #define VTD_MAMV                    18ULL
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 3c3a167..04a6980 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -37,6 +37,7 @@ vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"P
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
+vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 8f212a1..3e51876 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -63,6 +63,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDIrq VTDIrq;
 typedef struct VTD_MSIMessage VTD_MSIMessage;
+typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -249,6 +250,11 @@ struct VTD_MSIMessage {
 /* When IR is enabled, all MSI/MSI-X data bits should be zero */
 #define VTD_IR_MSI_DATA          (0)
 
+struct IntelIOMMUNotifierNode {
+    VTDAddressSpace *vtd_as;
+    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
+};
+
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
@@ -286,6 +292,8 @@ struct IntelIOMMUState {
     MemoryRegionIOMMUOps iommu_ops;
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    /* list of registered notifiers */
+    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-04-06 10:42   ` Auger Eric
  2017-04-06 10:52     ` Peter Xu
  2017-04-06 11:54   ` Michael S. Tsirkin
  2017-04-06 15:10   ` Alex Williamson
  2 siblings, 1 reply; 34+ messages in thread
From: Auger Eric @ 2017-04-06 10:42 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

Hi Peter,
On 06/04/2017 09:08, Peter Xu wrote:
> In this patch, IOMMUNotifier.{start|end} are introduced to store section
> information for a specific notifier. When notification occurs, we not
> only check the notification type (MAP|UNMAP), but also check whether the
> notified iova range overlaps with the range of specific IOMMU notifier,
> and skip those notifiers if not in the listened range.
> 
> When removing an region, we need to make sure we removed the correct
> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> 
> This patch is solving the problem that vfio-pci devices receive
> duplicated UNMAP notification on x86 platform when vIOMMU is there. The
> issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
> splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
> this (splitted IOMMU region) is only happening on x86.
I think this is likely to happen on other architectures too as "reserved
regions" are now exported to the user space.
> 
> This patch also helps vhost to leverage the new interface as well, so
> that vhost won't get duplicated cache flushes. In that sense, it's an
> slight performance improvement.
> 
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Looks good to me.
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
> v7->v8:
> - let vhost dmar leverage the new interface as well
> - add some more comments in commit message, mentioning what issue this
>   patch has solved
> - since touched up, removing Alex's a-b and DavidG's r-b
> ---
>  hw/vfio/common.c      | 12 +++++++++---
>  hw/virtio/vhost.c     | 10 ++++++++--
>  include/exec/memory.h | 19 ++++++++++++++++++-
>  memory.c              |  9 +++++++++
>  4 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f3ba9b9..6b33b9f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          giommu->iommu_offset = section->offset_within_address_space -
>                                 section->offset_within_region;
>          giommu->container = container;
> -        giommu->n.notify = vfio_iommu_map_notify;
> -        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> +        llend = int128_add(int128_make64(section->offset_within_region),
> +                           section->size);
> +        llend = int128_sub(llend, int128_one());
> +        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> +                            IOMMU_NOTIFIER_ALL,
> +                            section->offset_within_region,
> +                            int128_get64(llend));
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> @@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          VFIOGuestIOMMU *giommu;
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> -            if (giommu->iommu == section->mr) {
> +            if (giommu->iommu == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
>                  memory_region_unregister_iommu_notifier(giommu->iommu,
>                                                          &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 613494d..185b95b 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -736,14 +736,20 @@ static void vhost_iommu_region_add(MemoryListener *listener,
>      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
>                                           iommu_listener);
>      struct vhost_iommu *iommu;
> +    Int128 end;
>  
>      if (!memory_region_is_iommu(section->mr)) {
>          return;
>      }
>  
>      iommu = g_malloc0(sizeof(*iommu));
> -    iommu->n.notify = vhost_iommu_unmap_notify;
> -    iommu->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
> +    end = int128_add(int128_make64(section->offset_within_region),
> +                     section->size);
> +    end = int128_sub(end, int128_one());
> +    iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
> +                        IOMMU_NOTIFIER_UNMAP,
> +                        section->offset_within_region,
> +                        int128_get64(end));
>      iommu->mr = section->mr;
>      iommu->iommu_offset = section->offset_within_address_space -
>                            section->offset_within_region;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f20b191..0840c89 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -77,13 +77,30 @@ typedef enum {
>  
>  #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>  
> +struct IOMMUNotifier;
> +typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
> +                            IOMMUTLBEntry *data);
> +
>  struct IOMMUNotifier {
> -    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
> +    IOMMUNotify notify;
>      IOMMUNotifierFlag notifier_flags;
> +    /* Notify for address space range start <= addr <= end */
> +    hwaddr start;
> +    hwaddr end;
>      QLIST_ENTRY(IOMMUNotifier) node;
>  };
>  typedef struct IOMMUNotifier IOMMUNotifier;
>  
> +static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
> +                                       IOMMUNotifierFlag flags,
> +                                       hwaddr start, hwaddr end)
> +{
> +    n->notify = fn;
> +    n->notifier_flags = flags;
> +    n->start = start;
> +    n->end = end;
> +}
> +
>  /* New-style MMIO accessors can indicate that the transaction failed.
>   * A zero (MEMTX_OK) response means success; anything else is a failure
>   * of some kind. The memory subsystem will bitwise-OR together results
> diff --git a/memory.c b/memory.c
> index 4c95aaf..75ac595 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1606,6 +1606,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
>  
>      /* We need to register for at least one bitfield */
>      assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
> +    assert(n->start <= n->end);
>      QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
>      memory_region_update_iommu_notify_flags(mr);
>  }
> @@ -1667,6 +1668,14 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      }
>  
>      QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +        /*
> +         * Skip the notification if the notification does not overlap
> +         * with registered range.
> +         */
> +        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
> +            iommu_notifier->end < entry.iova) {
> +            continue;
> +        }
>          if (iommu_notifier->notifier_flags & request_flags) {
>              iommu_notifier->notify(iommu_notifier, &entry);
>          }
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
@ 2017-04-06 10:45   ` Auger Eric
  2017-04-06 11:12     ` Peter Xu
  2017-04-06 11:54   ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: Auger Eric @ 2017-04-06 10:45 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

Hi Peter,
On 06/04/2017 09:08, Peter Xu wrote:
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Even if the commit message is obvious it may be requested?

Reviewed-by: Eric Auger <eric.auger@redhat.com>

> ---
>  include/exec/memory.h | 3 +++
>  memory.c              | 4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0840c89..07e43da 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -239,6 +239,9 @@ struct MemoryRegion {
>      IOMMUNotifierFlag iommu_notify_flags;
>  };
>  
> +#define IOMMU_NOTIFIER_FOREACH(n, mr) \
> +    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
> +
>  /**
>   * MemoryListener: callbacks structure for updates to the physical memory map
>   *
> diff --git a/memory.c b/memory.c
> index 75ac595..7496b3d 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1583,7 +1583,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
>      IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
>      IOMMUNotifier *iommu_notifier;
>  
> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>          flags |= iommu_notifier->notifier_flags;
>      }
>  
> @@ -1667,7 +1667,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>          request_flags = IOMMU_NOTIFIER_UNMAP;
>      }
>  
> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>          /*
>           * Skip the notification if the notification does not overlap
>           * with registered range.
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all()
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all() Peter Xu
@ 2017-04-06 10:52   ` Auger Eric
  2017-04-06 11:46     ` Peter Xu
  2017-04-06 11:55   ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: Auger Eric @ 2017-04-06 10:52 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

Hi Peter,

On 06/04/2017 09:08, Peter Xu wrote:
> This is an "global" version of exising memory_region_iommu_replay() - we
s/exising/existing
> announce the translations to all the registered notifiers, instead of a
> specific one.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/exec/memory.h | 8 ++++++++
>  memory.c              | 9 +++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 07e43da..fb7dff3 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -713,6 +713,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>                                  bool is_write);
>  
>  /**
> + * memory_region_iommu_replay_all: replay existing IOMMU translations
> + * to all the notifiers registered.
> + *
> + * @mr: the memory region to observe
> + */
> +void memory_region_iommu_replay_all(MemoryRegion *mr);
> +
> +/**
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
>   * changes to IOMMU translation entries.
>   *
> diff --git a/memory.c b/memory.c
> index 7496b3d..b4ed67b 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1642,6 +1642,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      }
>  }
>  
> +void memory_region_iommu_replay_all(MemoryRegion *mr)
> +{
> +    IOMMUNotifier *notifier;
> +
> +    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> +        memory_region_iommu_replay(mr, notifier, false);
It is not fully clear to me what is the consequence of setting
is_write=false always?

Thanks

Eric
> +    }
> +}
> +
>  void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>                                               IOMMUNotifier *n)
>  {
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier
  2017-04-06 10:42   ` Auger Eric
@ 2017-04-06 10:52     ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06 10:52 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

On Thu, Apr 06, 2017 at 12:42:54PM +0200, Auger Eric wrote:
> Hi Peter,
> On 06/04/2017 09:08, Peter Xu wrote:
> > In this patch, IOMMUNotifier.{start|end} are introduced to store section
> > information for a specific notifier. When notification occurs, we not
> > only check the notification type (MAP|UNMAP), but also check whether the
> > notified iova range overlaps with the range of specific IOMMU notifier,
> > and skip those notifiers if not in the listened range.
> > 
> > When removing an region, we need to make sure we removed the correct
> > VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> > 
> > This patch is solving the problem that vfio-pci devices receive
> > duplicated UNMAP notification on x86 platform when vIOMMU is there. The
> > issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
> > splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
> > this (splitted IOMMU region) is only happening on x86.
> I think this is likely to happen on other architectures too as "reserved
> regions" are now exported to the user space.

Good to know this.

> > 
> > This patch also helps vhost to leverage the new interface as well, so
> > that vhost won't get duplicated cache flushes. In that sense, it's an
> > slight performance improvement.
> > 
> > Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> Looks good to me.
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one()
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one() Peter Xu
@ 2017-04-06 10:54   ` Auger Eric
  2017-04-06 11:55   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Auger Eric @ 2017-04-06 10:54 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

Hi Peter,
On 06/04/2017 09:08, Peter Xu wrote:
> Generalizing the notify logic in memory_region_notify_iommu() into a
> single function. This can be further used in customized replay()
> functions for IOMMUs.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

> ---
>  include/exec/memory.h | 15 +++++++++++++++
>  memory.c              | 40 ++++++++++++++++++++++++----------------
>  2 files changed, 39 insertions(+), 16 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index fb7dff3..055b3a8 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -688,6 +688,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>                                  IOMMUTLBEntry entry);
>  
>  /**
> + * memory_region_notify_one: notify a change in an IOMMU translation
> + *                           entry to a single notifier
> + *
> + * This works just like memory_region_notify_iommu(), but it only
> + * notifies a specific notifier, not all of them.
> + *
> + * @notifier: the notifier to be notified
> + * @entry: the new entry in the IOMMU translation table.  The entry
> + *         replaces all old entries for the same virtual I/O address range.
> + *         Deleted entries have .@perm == 0.
> + */
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry);
> +
> +/**
>   * memory_region_register_iommu_notifier: register a notifier for changes to
>   * IOMMU translation entries.
>   *
> diff --git a/memory.c b/memory.c
> index b4ed67b..ded4bf1 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1662,32 +1662,40 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>      memory_region_update_iommu_notify_flags(mr);
>  }
>  
> -void memory_region_notify_iommu(MemoryRegion *mr,
> -                                IOMMUTLBEntry entry)
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry)
>  {
> -    IOMMUNotifier *iommu_notifier;
>      IOMMUNotifierFlag request_flags;
>  
> -    assert(memory_region_is_iommu(mr));
> +    /*
> +     * Skip the notification if the notification does not overlap
> +     * with registered range.
> +     */
> +    if (notifier->start > entry->iova + entry->addr_mask + 1 ||
> +        notifier->end < entry->iova) {
> +        return;
> +    }
>  
> -    if (entry.perm & IOMMU_RW) {
> +    if (entry->perm & IOMMU_RW) {
>          request_flags = IOMMU_NOTIFIER_MAP;
>      } else {
>          request_flags = IOMMU_NOTIFIER_UNMAP;
>      }
>  
> +    if (notifier->notifier_flags & request_flags) {
> +        notifier->notify(notifier, entry);
> +    }
> +}
> +
> +void memory_region_notify_iommu(MemoryRegion *mr,
> +                                IOMMUTLBEntry entry)
> +{
> +    IOMMUNotifier *iommu_notifier;
> +
> +    assert(memory_region_is_iommu(mr));
> +
>      IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
> -        /*
> -         * Skip the notification if the notification does not overlap
> -         * with registered range.
> -         */
> -        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
> -            iommu_notifier->end < entry.iova) {
> -            continue;
> -        }
> -        if (iommu_notifier->notifier_flags & request_flags) {
> -            iommu_notifier->notify(iommu_notifier, &entry);
> -        }
> +        memory_region_notify_one(iommu_notifier, &entry);
>      }
>  }
>  
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
@ 2017-04-06 10:58   ` Auger Eric
  2017-04-06 11:57   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Auger Eric @ 2017-04-06 10:58 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

Hi Peter,

On 06/04/2017 09:08, Peter Xu wrote:
> Originally we have one memory_region_iommu_replay() function, which is
> the default behavior to replay the translations of the whole IOMMU
> region. However, on some platform like x86, we may want our own replay
> logic for IOMMU regions. This patch add one more hook for IOMMUOps for
s/add/adds
> the callback, and it'll override the default if set.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
>  include/exec/memory.h | 2 ++
>  memory.c              | 6 ++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 055b3a8..c0280b7 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>      void (*notify_flag_changed)(MemoryRegion *iommu,
>                                  IOMMUNotifierFlag old_flags,
>                                  IOMMUNotifierFlag new_flags);
> +    /* Set this up to provide customized IOMMU replay function */
> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> diff --git a/memory.c b/memory.c
> index ded4bf1..b782d5b 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1626,6 +1626,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    /* If the IOMMU has its own replay callback, override */
> +    if (mr->iommu_ops->replay) {
> +        mr->iommu_ops->replay(mr, n);
> +        return;
> +    }
> +
>      granularity = memory_region_iommu_get_min_page_size(mr);
>  
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-04-06 10:45   ` Auger Eric
@ 2017-04-06 11:12     ` Peter Xu
  2017-04-06 11:30       ` Auger Eric
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-04-06 11:12 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

On Thu, Apr 06, 2017 at 12:45:59PM +0200, Auger Eric wrote:
> Hi Peter,
> On 06/04/2017 09:08, Peter Xu wrote:
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> Even if the commit message is obvious it may be requested?

Do you mean we'd better provide a commit message?

How about this:

  A new macro is provided to iterate all the IOMMU notifiers hooked
  under specific IOMMU memory region.

Thanks,

> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> > ---
> >  include/exec/memory.h | 3 +++
> >  memory.c              | 4 ++--
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 0840c89..07e43da 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -239,6 +239,9 @@ struct MemoryRegion {
> >      IOMMUNotifierFlag iommu_notify_flags;
> >  };
> >  
> > +#define IOMMU_NOTIFIER_FOREACH(n, mr) \
> > +    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
> > +
> >  /**
> >   * MemoryListener: callbacks structure for updates to the physical memory map
> >   *
> > diff --git a/memory.c b/memory.c
> > index 75ac595..7496b3d 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1583,7 +1583,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
> >      IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
> >      IOMMUNotifier *iommu_notifier;
> >  
> > -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> > +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
> >          flags |= iommu_notifier->notifier_flags;
> >      }
> >  
> > @@ -1667,7 +1667,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
> >          request_flags = IOMMU_NOTIFIER_UNMAP;
> >      }
> >  
> > -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> > +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
> >          /*
> >           * Skip the notification if the notification does not overlap
> >           * with registered range.
> > 

-- peterx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-04-06 11:12     ` Peter Xu
@ 2017-04-06 11:30       ` Auger Eric
  0 siblings, 0 replies; 34+ messages in thread
From: Auger Eric @ 2017-04-06 11:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

Hi,

On 06/04/2017 13:12, Peter Xu wrote:
> On Thu, Apr 06, 2017 at 12:45:59PM +0200, Auger Eric wrote:
>> Hi Peter,
>> On 06/04/2017 09:08, Peter Xu wrote:
>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> Even if the commit message is obvious it may be requested?
> 
> Do you mean we'd better provide a commit message?
> 
> How about this:
> 
>   A new macro is provided to iterate all the IOMMU notifiers hooked
>   under specific IOMMU memory region.
yep perfect

Eric
> 
> Thanks,
> 
>>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>
>>> ---
>>>  include/exec/memory.h | 3 +++
>>>  memory.c              | 4 ++--
>>>  2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>> index 0840c89..07e43da 100644
>>> --- a/include/exec/memory.h
>>> +++ b/include/exec/memory.h
>>> @@ -239,6 +239,9 @@ struct MemoryRegion {
>>>      IOMMUNotifierFlag iommu_notify_flags;
>>>  };
>>>  
>>> +#define IOMMU_NOTIFIER_FOREACH(n, mr) \
>>> +    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
>>> +
>>>  /**
>>>   * MemoryListener: callbacks structure for updates to the physical memory map
>>>   *
>>> diff --git a/memory.c b/memory.c
>>> index 75ac595..7496b3d 100644
>>> --- a/memory.c
>>> +++ b/memory.c
>>> @@ -1583,7 +1583,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
>>>      IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
>>>      IOMMUNotifier *iommu_notifier;
>>>  
>>> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
>>> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>>>          flags |= iommu_notifier->notifier_flags;
>>>      }
>>>  
>>> @@ -1667,7 +1667,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>>>          request_flags = IOMMU_NOTIFIER_UNMAP;
>>>      }
>>>  
>>> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
>>> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>>>          /*
>>>           * Skip the notification if the notification does not overlap
>>>           * with registered range.
>>>
> 
> -- peterx
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all()
  2017-04-06 10:52   ` Auger Eric
@ 2017-04-06 11:46     ` Peter Xu
  2017-04-07  4:17       ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-04-06 11:46 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

On Thu, Apr 06, 2017 at 12:52:19PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 06/04/2017 09:08, Peter Xu wrote:
> > This is an "global" version of exising memory_region_iommu_replay() - we
> s/exising/existing
> > announce the translations to all the registered notifiers, instead of a
> > specific one.
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/exec/memory.h | 8 ++++++++
> >  memory.c              | 9 +++++++++
> >  2 files changed, 17 insertions(+)
> > 
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 07e43da..fb7dff3 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -713,6 +713,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
> >                                  bool is_write);
> >  
> >  /**
> > + * memory_region_iommu_replay_all: replay existing IOMMU translations
> > + * to all the notifiers registered.
> > + *
> > + * @mr: the memory region to observe
> > + */
> > +void memory_region_iommu_replay_all(MemoryRegion *mr);
> > +
> > +/**
> >   * memory_region_unregister_iommu_notifier: unregister a notifier for
> >   * changes to IOMMU translation entries.
> >   *
> > diff --git a/memory.c b/memory.c
> > index 7496b3d..b4ed67b 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1642,6 +1642,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
> >      }
> >  }
> >  
> > +void memory_region_iommu_replay_all(MemoryRegion *mr)
> > +{
> > +    IOMMUNotifier *notifier;
> > +
> > +    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> > +        memory_region_iommu_replay(mr, notifier, false);
> It is not fully clear to me what is the consequence of setting
> is_write=false always?

I am not quite sure about it either, on why we are having the is_write
parameter on memory_region_iommu_replay(). It's there since the first
commit that introduced the interface:

    commit a788f227ef7bd2912fcaacdfe13d13ece2998149
    Author: David Gibson <david@gibson.dropbear.id.au>
    Date:   Wed Sep 30 12:13:55 2015 +1000

    memory: Allow replay of IOMMU mapping notifications

However imho that should be a check against page RW permissions, which
seems unecessary during replay. Or say, not sure whether below patch
would be good (as a standalone one besides current patch/series):

------8<------

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b33b9f..d008a4b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -488,7 +488,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
-        memory_region_iommu_replay(giommu->iommu, &giommu->n, false);
+        memory_region_iommu_replay(giommu->iommu, &giommu->n);
 
         return;
     }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c4fc94d..5ab0151 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -725,11 +725,8 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
  *
  * @mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
- * @is_write: Whether to treat the replay as a translate "write"
- *     through the iommu
  */
-void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
-                                bool is_write);
+void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n);
 
 /**
  * memory_region_iommu_replay_all: replay existing IOMMU translations
diff --git a/memory.c b/memory.c
index b782d5b..155ca1c 100644
--- a/memory.c
+++ b/memory.c
@@ -1620,8 +1620,7 @@ uint64_t memory_region_iommu_get_min_page_size(MemoryRegion *mr)
     return TARGET_PAGE_SIZE;
 }
 
-void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
-                                bool is_write)
+void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
 {
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
@@ -1635,7 +1634,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
+        /*
+         * For a replay, we don't need to do permission check.
+         * Assuming a "read" operation here to make sure we will pass
+         * the check (in case we don't have write permission on the
+         * page).
+         */
+        iotlb = mr->iommu_ops->translate(mr, addr, false);
         if (iotlb.perm != IOMMU_NONE) {
             n->notify(n, &iotlb);
         }
@@ -1653,7 +1658,7 @@ void memory_region_iommu_replay_all(MemoryRegion *mr)
     IOMMUNotifier *notifier;
 
     IOMMU_NOTIFIER_FOREACH(notifier, mr) {
-        memory_region_iommu_replay(mr, notifier, false);
+        memory_region_iommu_replay(mr, notifier);
     }
 }

----->8------

Thanks,

-- peterx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (8 preceding siblings ...)
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 9/9] intel_iommu: enable remote IOTLB Peter Xu
@ 2017-04-06 11:53 ` Michael S. Tsirkin
  2017-04-06 15:25   ` Peter Xu
  2017-04-06 12:00 ` Michael S. Tsirkin
  10 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote:
> This is v8 of vt-d vfio enablement series.
> 
> v8
> - remove patches 1-9 since merged already
> - add David's r-b for all the patches
> - add Aviv's s-o-b in the last patch
> - rename iommu to iommu_dmar [Jason]
> - rename last patch subject to "remote IOTLB" [Jason]
> - pick up jason's two patches to fix vhost breakage

I only see one (6/9) - is a patch missing or misattributed?

> - let vhost leverage the new IOMMU notifier interface

Which patch does this?

> v7:
> - for the two traces patches: Change subjects. Remove vtd_err() and
>   vtd_err_nonzero_rsvd() tracers, instead using standalone trace for
>   each of the places. Don't remove any DPRINTF() if there is no
>   replacement. [Jason]
> - add r-b and a-b for Alex/David/Jason.
> - in patch "intel_iommu: renaming gpa to iova where proper", convert
>   one more place where I missed [Jason]
> - fix the place where I should use "~0ULL" not "~0" [Jason]
> - squash patch 16 into 18 [Jason]
> 
> v6:
> - do unmap in all cases when replay [Jason]
> - do global replay even if context entry is invalidated [Jason]
> - when iommu reset, send unmap to all registered notifiers [Jason]
> - use rcu read lock to protect the whole vfio_iommu_map_notify()
>   [Alex, Paolo]
> 
> v5:
> - fix patch 4 subject too long, and error spelling [Eric]
> - add ack-by for alex in patch 1 [Alex]
> - squashing patch 19/20 into patch 18 [Jason]
> - fix comments in vtd_page_walk() [Jason]
> - remove all error_report() [Jason]
> - add comment for patch 18, mention about that enabled vhost without
>   ATS as well [Jason]
> - remove skipped debug thing during page walk [Jason]
> - remove duplicated page walk trace [Jason]
> - some tunings in vtd_address_space_unmap(), to provide correct iova
>   and addr_mask. For this, I tuned this patch as well a bit:
>   "memory: add section range info for IOMMU notifier"
>   to loosen the range check
> 
> v4:
> - convert all error_report()s into traces (in the two patches that did
>   that)
> - rebased to Jason's DMAR series (master + one more patch:
>   "[PATCH V4 net-next] vhost_net: device IOTLB support")
> - let vhost use the new api iommu_notifier_init() so it won't break
>   vhost dmar [Jason]
> - touch commit message of the patch:
>   "intel_iommu: provide its own replay() callback"
>   old replay is not a dead loop, but it will just consume lots of time
>   [Jason]
> - add comment for patch:
>   "intel_iommu: do replay when context invalidate"
>   telling why replay won't be a problem even without CM=1 [Jason]
> - remove a useless comment line [Jason]
> - remove dmar_enabled parameter for vtd_switch_address_space() and
>   vtd_switch_address_space_all() [Mst, Jason]
> - merged the vfio patches in, to support unmap of big ranges at the
>   beginning ("[PATCH RFC 0/3] vfio: allow to notify unmap for very big
>   region")
> - using caching_mode instead of cache_mode_enabled, and "caching-mode"
>   instead of "cache-mode" [Kevin]
> - when receive context entry invalidation, we unmap the entire region
>   first, then replay [Alex]
> - fix commit message for patch:
>   "intel_iommu: simplify irq region translation" [Kevin]
> - handle domain/global invalidation, and notify where proper [Jason,
>   Kevin]
> 
> v3:
> - fix style error reported by patchew
> - fix comment in domain switch patch: use "IOMMU address space" rather
>   than "IOMMU region" [Kevin]
> - add ack-by for Paolo in patch:
>   "memory: add section range info for IOMMU notifier"
>   (this is seperately collected besides this thread)
> - remove 3 patches which are merged already (from Jason)
> - rebase to master b6c0897
> 
> v2:
> - change comment for "end" parameter in vtd_page_walk() [Tianyu]
> - change comment for "a iova" to "an iova" [Yi]
> - fix fault printed val for GPA address in vtd_page_walk_level (debug
>   only)
> - rebased to master (rather than Aviv's v6 series) and merged Aviv's
>   series v6: picked patch 1 (as patch 1 in this series), dropped patch
>   2, re-wrote patch 3 (as patch 17 of this series).
> - picked up two more bugfix patches from Jason's DMAR series
> - picked up the following patch as well:
>   "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"
> 
> This RFC series is a re-work for Aviv B.D.'s vfio enablement series
> with vt-d:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html
> 
> Aviv has done a great job there, and what we still lack there are
> mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> This series should have solved the above two issues.
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v8
> 
> I would be glad to hear about any review comments for above patches.
> 
> =========
> Test Done
> =========
> 
> Build test passed for x86_64/arm/ppc64.
> 
> Simply tested with x86_64, assigning two PCI devices to a single VM,
> boot the VM using:
> 
> bin=x86_64-softmmu/qemu-system-x86_64
> $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
>      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
>      -device virtio-net-pci,netdev=net0 \
>      -device vfio-pci,host=03:00.0 \
>      -device vfio-pci,host=02:00.0 \
>      -trace events=".trace.vfio" \
>      /var/lib/libvirt/images/vm1.qcow2
> 
> pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> vtd_page_walk*
> vtd_replay*
> vtd_inv_desc*
> 
> Then, in the guest, run the following tool:
> 
>   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> 
> With parameter:
> 
>   ./vfio-bind-group 00:03.0 00:04.0
> 
> Check host side trace log, I can see pages are replayed and mapped in
> 00:04.0 device address space, like:
> 
> ...
> vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
> vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
> vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
> ...
> 
> =========
> Todo List
> =========
> 
> - error reporting for the assigned devices (as Tianyu has mentioned)
> 
> - per-domain address-space: A better solution in the future may be -
>   we maintain one address space per IOMMU domain in the guest (so
>   multiple devices can share a same address space if they are sharing
>   the same IOMMU domains in the guest), rather than one address space
>   per device (which is current implementation of vt-d). However that's
>   a step further than this series, and let's see whether we can first
>   provide a workable version of device assignment with vt-d
>   protection.
> 
> - don't need to notify IOTLB (psi/gsi/global) invalidations to devices
>   that with ATS enabled
> 
> - investigate when guest map page while mask contains existing mapped
>   pages (e.g. map 12k-16k first, then map 0-12k)
> 
> - coalesce unmap during page walk (currently, we send it once per
>   page)
> 
> - when do PSI for unmap, whether we can send one notify directly
>   instead of walking over the page table?
> 
> - more to come...
> 
> Thanks,
> 
> Jason Wang (1):
>   intel_iommu: use the correct memory region for device IOTLB
>     notification
> 
> Peter Xu (8):
>   memory: add section range info for IOMMU notifier
>   memory: provide IOMMU_NOTIFIER_FOREACH macro
>   memory: provide iommu_replay_all()
>   memory: introduce memory_region_notify_one()
>   memory: add MemoryRegionIOMMUOps.replay() callback
>   intel_iommu: provide its own replay() callback
>   intel_iommu: allow dynamic switch of IOMMU region
>   intel_iommu: enable remote IOTLB
> 
>  hw/i386/intel_iommu.c          | 442 +++++++++++++++++++++++++++++++++++++++--
>  hw/i386/intel_iommu_internal.h |   1 +
>  hw/i386/trace-events           |  10 +-
>  hw/vfio/common.c               |  12 +-
>  hw/virtio/vhost.c              |  10 +-
>  include/exec/memory.h          |  49 ++++-
>  include/hw/i386/intel_iommu.h  |  10 +
>  memory.c                       |  52 ++++-
>  8 files changed, 552 insertions(+), 34 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier Peter Xu
  2017-04-06 10:42   ` Auger Eric
@ 2017-04-06 11:54   ` Michael S. Tsirkin
  2017-04-06 15:10   ` Alex Williamson
  2 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:36PM +0800, Peter Xu wrote:
> In this patch, IOMMUNotifier.{start|end} are introduced to store section
> information for a specific notifier. When notification occurs, we not
> only check the notification type (MAP|UNMAP), but also check whether the
> notified iova range overlaps with the range of specific IOMMU notifier,
> and skip those notifiers if not in the listened range.
> 
> When removing an region, we need to make sure we removed the correct
> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> 
> This patch is solving the problem that vfio-pci devices receive
> duplicated UNMAP notification on x86 platform when vIOMMU is there. The
> issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
> splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
> this (splitted IOMMU region) is only happening on x86.
> 
> This patch also helps vhost to leverage the new interface as well, so
> that vhost won't get duplicated cache flushes. In that sense, it's an
> slight performance improvement.
> 
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> v7->v8:
> - let vhost dmar leverage the new interface as well
> - add some more comments in commit message, mentioning what issue this
>   patch has solved
> - since touched up, removing Alex's a-b and DavidG's r-b
> ---
>  hw/vfio/common.c      | 12 +++++++++---
>  hw/virtio/vhost.c     | 10 ++++++++--
>  include/exec/memory.h | 19 ++++++++++++++++++-
>  memory.c              |  9 +++++++++
>  4 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f3ba9b9..6b33b9f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          giommu->iommu_offset = section->offset_within_address_space -
>                                 section->offset_within_region;
>          giommu->container = container;
> -        giommu->n.notify = vfio_iommu_map_notify;
> -        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> +        llend = int128_add(int128_make64(section->offset_within_region),
> +                           section->size);
> +        llend = int128_sub(llend, int128_one());
> +        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> +                            IOMMU_NOTIFIER_ALL,
> +                            section->offset_within_region,
> +                            int128_get64(llend));
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> @@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          VFIOGuestIOMMU *giommu;
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> -            if (giommu->iommu == section->mr) {
> +            if (giommu->iommu == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
>                  memory_region_unregister_iommu_notifier(giommu->iommu,
>                                                          &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 613494d..185b95b 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -736,14 +736,20 @@ static void vhost_iommu_region_add(MemoryListener *listener,
>      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
>                                           iommu_listener);
>      struct vhost_iommu *iommu;
> +    Int128 end;
>  
>      if (!memory_region_is_iommu(section->mr)) {
>          return;
>      }
>  
>      iommu = g_malloc0(sizeof(*iommu));
> -    iommu->n.notify = vhost_iommu_unmap_notify;
> -    iommu->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
> +    end = int128_add(int128_make64(section->offset_within_region),
> +                     section->size);
> +    end = int128_sub(end, int128_one());
> +    iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
> +                        IOMMU_NOTIFIER_UNMAP,
> +                        section->offset_within_region,
> +                        int128_get64(end));
>      iommu->mr = section->mr;
>      iommu->iommu_offset = section->offset_within_address_space -
>                            section->offset_within_region;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f20b191..0840c89 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -77,13 +77,30 @@ typedef enum {
>  
>  #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>  
> +struct IOMMUNotifier;
> +typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
> +                            IOMMUTLBEntry *data);
> +
>  struct IOMMUNotifier {
> -    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
> +    IOMMUNotify notify;
>      IOMMUNotifierFlag notifier_flags;
> +    /* Notify for address space range start <= addr <= end */
> +    hwaddr start;
> +    hwaddr end;
>      QLIST_ENTRY(IOMMUNotifier) node;
>  };
>  typedef struct IOMMUNotifier IOMMUNotifier;
>  
> +static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
> +                                       IOMMUNotifierFlag flags,
> +                                       hwaddr start, hwaddr end)
> +{
> +    n->notify = fn;
> +    n->notifier_flags = flags;
> +    n->start = start;
> +    n->end = end;
> +}
> +
>  /* New-style MMIO accessors can indicate that the transaction failed.
>   * A zero (MEMTX_OK) response means success; anything else is a failure
>   * of some kind. The memory subsystem will bitwise-OR together results
> diff --git a/memory.c b/memory.c
> index 4c95aaf..75ac595 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1606,6 +1606,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
>  
>      /* We need to register for at least one bitfield */
>      assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
> +    assert(n->start <= n->end);
>      QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
>      memory_region_update_iommu_notify_flags(mr);
>  }
> @@ -1667,6 +1668,14 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      }
>  
>      QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +        /*
> +         * Skip the notification if the notification does not overlap
> +         * with registered range.
> +         */
> +        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
> +            iommu_notifier->end < entry.iova) {
> +            continue;
> +        }
>          if (iommu_notifier->notifier_flags & request_flags) {
>              iommu_notifier->notify(iommu_notifier, &entry);
>          }
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
  2017-04-06 10:45   ` Auger Eric
@ 2017-04-06 11:54   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:37PM +0800, Peter Xu wrote:
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  include/exec/memory.h | 3 +++
>  memory.c              | 4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0840c89..07e43da 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -239,6 +239,9 @@ struct MemoryRegion {
>      IOMMUNotifierFlag iommu_notify_flags;
>  };
>  
> +#define IOMMU_NOTIFIER_FOREACH(n, mr) \
> +    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
> +
>  /**
>   * MemoryListener: callbacks structure for updates to the physical memory map
>   *
> diff --git a/memory.c b/memory.c
> index 75ac595..7496b3d 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1583,7 +1583,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
>      IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
>      IOMMUNotifier *iommu_notifier;
>  
> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>          flags |= iommu_notifier->notifier_flags;
>      }
>  
> @@ -1667,7 +1667,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>          request_flags = IOMMU_NOTIFIER_UNMAP;
>      }
>  
> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>          /*
>           * Skip the notification if the notification does not overlap
>           * with registered range.
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all()
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all() Peter Xu
  2017-04-06 10:52   ` Auger Eric
@ 2017-04-06 11:55   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:38PM +0800, Peter Xu wrote:
> This is an "global" version of exising memory_region_iommu_replay() - we
> announce the translations to all the registered notifiers, instead of a
> specific one.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  include/exec/memory.h | 8 ++++++++
>  memory.c              | 9 +++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 07e43da..fb7dff3 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -713,6 +713,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>                                  bool is_write);
>  
>  /**
> + * memory_region_iommu_replay_all: replay existing IOMMU translations
> + * to all the notifiers registered.
> + *
> + * @mr: the memory region to observe
> + */
> +void memory_region_iommu_replay_all(MemoryRegion *mr);
> +
> +/**
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
>   * changes to IOMMU translation entries.
>   *
> diff --git a/memory.c b/memory.c
> index 7496b3d..b4ed67b 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1642,6 +1642,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      }
>  }
>  
> +void memory_region_iommu_replay_all(MemoryRegion *mr)
> +{
> +    IOMMUNotifier *notifier;
> +
> +    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> +        memory_region_iommu_replay(mr, notifier, false);
> +    }
> +}
> +
>  void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>                                               IOMMUNotifier *n)
>  {
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one()
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one() Peter Xu
  2017-04-06 10:54   ` Auger Eric
@ 2017-04-06 11:55   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:39PM +0800, Peter Xu wrote:
> Generalizing the notify logic in memory_region_notify_iommu() into a
> single function. This can be further used in customized replay()
> functions for IOMMUs.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  include/exec/memory.h | 15 +++++++++++++++
>  memory.c              | 40 ++++++++++++++++++++++++----------------
>  2 files changed, 39 insertions(+), 16 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index fb7dff3..055b3a8 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -688,6 +688,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>                                  IOMMUTLBEntry entry);
>  
>  /**
> + * memory_region_notify_one: notify a change in an IOMMU translation
> + *                           entry to a single notifier
> + *
> + * This works just like memory_region_notify_iommu(), but it only
> + * notifies a specific notifier, not all of them.
> + *
> + * @notifier: the notifier to be notified
> + * @entry: the new entry in the IOMMU translation table.  The entry
> + *         replaces all old entries for the same virtual I/O address range.
> + *         Deleted entries have .@perm == 0.
> + */
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry);
> +
> +/**
>   * memory_region_register_iommu_notifier: register a notifier for changes to
>   * IOMMU translation entries.
>   *
> diff --git a/memory.c b/memory.c
> index b4ed67b..ded4bf1 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1662,32 +1662,40 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>      memory_region_update_iommu_notify_flags(mr);
>  }
>  
> -void memory_region_notify_iommu(MemoryRegion *mr,
> -                                IOMMUTLBEntry entry)
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry)
>  {
> -    IOMMUNotifier *iommu_notifier;
>      IOMMUNotifierFlag request_flags;
>  
> -    assert(memory_region_is_iommu(mr));
> +    /*
> +     * Skip the notification if the notification does not overlap
> +     * with registered range.
> +     */
> +    if (notifier->start > entry->iova + entry->addr_mask + 1 ||
> +        notifier->end < entry->iova) {
> +        return;
> +    }
>  
> -    if (entry.perm & IOMMU_RW) {
> +    if (entry->perm & IOMMU_RW) {
>          request_flags = IOMMU_NOTIFIER_MAP;
>      } else {
>          request_flags = IOMMU_NOTIFIER_UNMAP;
>      }
>  
> +    if (notifier->notifier_flags & request_flags) {
> +        notifier->notify(notifier, entry);
> +    }
> +}
> +
> +void memory_region_notify_iommu(MemoryRegion *mr,
> +                                IOMMUTLBEntry entry)
> +{
> +    IOMMUNotifier *iommu_notifier;
> +
> +    assert(memory_region_is_iommu(mr));
> +
>      IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
> -        /*
> -         * Skip the notification if the notification does not overlap
> -         * with registered range.
> -         */
> -        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
> -            iommu_notifier->end < entry.iova) {
> -            continue;
> -        }
> -        if (iommu_notifier->notifier_flags & request_flags) {
> -            iommu_notifier->notify(iommu_notifier, &entry);
> -        }
> +        memory_region_notify_one(iommu_notifier, &entry);
>      }
>  }
>  
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
  2017-04-06 10:58   ` Auger Eric
@ 2017-04-06 11:57   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:40PM +0800, Peter Xu wrote:
> Originally we have one memory_region_iommu_replay() function, which is
> the default behavior to replay the translations of the whole IOMMU
> region. However, on some platform like x86, we may want our own replay
> logic for IOMMU regions. This patch add one more hook for IOMMUOps for
> the callback, and it'll override the default if set.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  include/exec/memory.h | 2 ++
>  memory.c              | 6 ++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 055b3a8..c0280b7 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>      void (*notify_flag_changed)(MemoryRegion *iommu,
>                                  IOMMUNotifierFlag old_flags,
>                                  IOMMUNotifierFlag new_flags);
> +    /* Set this up to provide customized IOMMU replay function */
> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> diff --git a/memory.c b/memory.c
> index ded4bf1..b782d5b 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1626,6 +1626,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    /* If the IOMMU has its own replay callback, override */
> +    if (mr->iommu_ops->replay) {
> +        mr->iommu_ops->replay(mr, n);
> +        return;
> +    }
> +
>      granularity = memory_region_iommu_get_min_page_size(mr);
>  
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 6/9] intel_iommu: use the correct memory region for device IOTLB notification
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 6/9] intel_iommu: use the correct memory region for device IOTLB notification Peter Xu
@ 2017-04-06 11:57   ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On Thu, Apr 06, 2017 at 03:08:41PM +0800, Peter Xu wrote:
> From: Jason Wang <jasowang@redhat.com>
> 
> We have a specific memory region for DMAR now, so it's wrong to
> trigger the notifier with the root region.
> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  hw/i386/intel_iommu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 22d8226..2412df4 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1457,7 +1457,7 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
>      entry.iova = addr;
>      entry.perm = IOMMU_NONE;
>      entry.translated_addr = 0;
> -    memory_region_notify_iommu(entry.target_as->root, entry);
> +    memory_region_notify_iommu(&vtd_dev_as->iommu, entry);
>  
>  done:
>      return true;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 7/9] intel_iommu: provide its own replay() callback
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 7/9] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-04-06 11:57   ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:42PM +0800, Peter Xu wrote:
> The default replay() don't work for VT-d since vt-d will have a huge
> default memory region which covers address range 0-(2^64-1). This will
> normally consumes a lot of time (which looks like a dead loop).
> 
> The solution is simple - we don't walk over all the regions. Instead, we
> jump over the regions when we found that the page directories are empty.
> It'll greatly reduce the time to walk the whole region.
> 
> To achieve this, we provided a page walk helper to do that, invoking
> corresponding hook function when we found an page we are interested in.
> vtd_page_walk_level() is the core logic for the page walking. It's
> interface is designed to suite further use case, e.g., to invalidate a
> range of addresses.
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  hw/i386/intel_iommu.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/i386/trace-events  |   7 ++
>  include/exec/memory.h |   2 +
>  3 files changed, 186 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 2412df4..7af4e22 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -595,6 +595,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
>      return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
>  }
>  
> +static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
> +{
> +    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
> +    return 1ULL << MIN(ce_agaw, VTD_MGAW);
> +}
> +
> +/* Return true if IOVA passes range check, otherwise false. */
> +static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
> +{
> +    /*
> +     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
> +     */
> +    return !(iova & ~(vtd_iova_limit(ce) - 1));
> +}
> +
>  static const uint64_t vtd_paging_entry_rsvd_field[] = {
>      [0] = ~0ULL,
>      /* For not large page */
> @@ -630,13 +646,9 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>      uint32_t level = vtd_get_level_from_context_entry(ce);
>      uint32_t offset;
>      uint64_t slpte;
> -    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>      uint64_t access_right_check;
>  
> -    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> -     * in CAP_REG and AW in context-entry.
> -     */
> -    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +    if (!vtd_iova_range_check(iova, ce)) {
>          VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
> @@ -684,6 +696,134 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>      }
>  }
>  
> +typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
> +
> +/**
> + * vtd_page_walk_level - walk over specific level for IOVA range
> + *
> + * @addr: base GPA addr to start the walk
> + * @start: IOVA range start address
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: hook func to be called when detected page
> + * @private: private data to be passed into hook func
> + * @read: whether parent level has read permission
> + * @write: whether parent level has write permission
> + * @notify_unmap: whether we should notify invalid entries
> + */
> +static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> +                               uint64_t end, vtd_page_walk_hook hook_fn,
> +                               void *private, uint32_t level,
> +                               bool read, bool write, bool notify_unmap)
> +{
> +    bool read_cur, write_cur, entry_valid;
> +    uint32_t offset;
> +    uint64_t slpte;
> +    uint64_t subpage_size, subpage_mask;
> +    IOMMUTLBEntry entry;
> +    uint64_t iova = start;
> +    uint64_t iova_next;
> +    int ret = 0;
> +
> +    trace_vtd_page_walk_level(addr, level, start, end);
> +
> +    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> +    subpage_mask = vtd_slpt_level_page_mask(level);
> +
> +    while (iova < end) {
> +        iova_next = (iova & subpage_mask) + subpage_size;
> +
> +        offset = vtd_iova_level_offset(iova, level);
> +        slpte = vtd_get_slpte(addr, offset);
> +
> +        if (slpte == (uint64_t)-1) {
> +            trace_vtd_page_walk_skip_read(iova, iova_next);
> +            goto next;
> +        }
> +
> +        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> +            goto next;
> +        }
> +
> +        /* Permissions are stacked with parents' */
> +        read_cur = read && (slpte & VTD_SL_R);
> +        write_cur = write && (slpte & VTD_SL_W);
> +
> +        /*
> +         * As long as we have either read/write permission, this is a
> +         * valid entry. The rule works for both page entries and page
> +         * table entries.
> +         */
> +        entry_valid = read_cur | write_cur;
> +
> +        if (vtd_is_last_slpte(slpte, level)) {
> +            entry.target_as = &address_space_memory;
> +            entry.iova = iova & subpage_mask;
> +            /* NOTE: this is only meaningful if entry_valid == true */
> +            entry.translated_addr = vtd_get_slpte_addr(slpte);
> +            entry.addr_mask = ~subpage_mask;
> +            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +            if (!entry_valid && !notify_unmap) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                goto next;
> +            }
> +            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> +                                    entry.addr_mask, entry.perm);
> +            if (hook_fn) {
> +                ret = hook_fn(&entry, private);
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +            }
> +        } else {
> +            if (!entry_valid) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                goto next;
> +            }
> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
> +                                      MIN(iova_next, end), hook_fn, private,
> +                                      level - 1, read_cur, write_cur,
> +                                      notify_unmap);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +
> +next:
> +        iova = iova_next;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: the hook that to be called for each detected area
> + * @private: private data for the hook function
> + */
> +static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> +                         vtd_page_walk_hook hook_fn, void *private)
> +{
> +    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> +    uint32_t level = vtd_get_level_from_context_entry(ce);
> +
> +    if (!vtd_iova_range_check(start, ce)) {
> +        return -VTD_FR_ADDR_BEYOND_MGAW;
> +    }
> +
> +    if (!vtd_iova_range_check(end, ce)) {
> +        /* Fix end so that it reaches the maximum */
> +        end = vtd_iova_limit(ce);
> +    }
> +
> +    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> +                               level, true, true, false);
> +}
> +
>  /* Map a device to its corresponding domain (context-entry) */
>  static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>                                      uint8_t devfn, VTDContextEntry *ce)
> @@ -2402,6 +2542,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>      return vtd_dev_as;
>  }
>  
> +static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> +{
> +    memory_region_notify_one((IOMMUNotifier *)private, entry);
> +    return 0;
> +}
> +
> +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> +{
> +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> +    VTDContextEntry ce;
> +
> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> +        /*
> +         * Scanned a valid context entry, walk over the pages and
> +         * notify when needed.
> +         */
> +        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> +                                  PCI_FUNC(vtd_as->devfn),
> +                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
> +                                  ce.hi, ce.lo);
> +        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n);
> +    } else {
> +        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> +                                    PCI_FUNC(vtd_as->devfn));
> +    }
> +
> +    return;
> +}
> +
>  /* Do the initialization. It will also be called when reset, so pay
>   * attention when adding new initialization stuff.
>   */
> @@ -2416,6 +2587,7 @@ static void vtd_init(IntelIOMMUState *s)
>  
>      s->iommu_ops.translate = vtd_iommu_translate;
>      s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
> +    s->iommu_ops.replay = vtd_iommu_replay;
>      s->root = 0;
>      s->root_extended = false;
>      s->dmar_enabled = false;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index baed874..f725bca 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -30,6 +30,13 @@ vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32
>  vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
>  vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
>  vtd_fault_disabled(void) "Fault processing disabled for context entry"
> +vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
> +vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
> +vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
> +vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
> +vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
> +vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index c0280b7..c4fc94d 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -55,6 +55,8 @@ typedef enum {
>      IOMMU_RW   = 3,
>  } IOMMUAccessFlags;
>  
> +#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
> +
>  struct IOMMUTLBEntry {
>      AddressSpace    *target_as;
>      hwaddr           iova;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 8/9] intel_iommu: allow dynamic switch of IOMMU region
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 8/9] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2017-04-06 11:58   ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:43PM +0800, Peter Xu wrote:
> This is preparation work to finally enabled dynamic switching ON/OFF for
> VT-d protection. The old VT-d codes is using static IOMMU address space,
> and that won't satisfy vfio-pci device listeners.
> 
> Let me explain.
> 
> vfio-pci devices depend on the memory region listener and IOMMU replay
> mechanism to make sure the device mapping is coherent with the guest
> even if there are domain switches. And there are two kinds of domain
> switches:
> 
>   (1) switch from domain A -> B
>   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> 
> Case (1) is handled by the context entry invalidation handling by the
> VT-d replay logic. What the replay function should do here is to replay
> the existing page mappings in domain B.
> 
> However for case (2), we don't want to replay any domain mappings - we
> just need the default GPA->HPA mappings (the address_space_memory
> mapping). And this patch helps on case (2) to build up the mapping
> automatically by leveraging the vfio-pci memory listeners.
> 
> Another important thing that this patch does is to seperate
> IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> depend on the DMAR region (like before this patch). It should be a
> standalone region, and it should be able to be activated without
> DMAR (which is a common behavior of Linux kernel - by default it enables
> IR while disabled DMAR).
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  hw/i386/intel_iommu.c         | 81 ++++++++++++++++++++++++++++++++++++++++---
>  hw/i386/trace-events          |  2 +-
>  include/hw/i386/intel_iommu.h |  2 ++
>  3 files changed, 79 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 7af4e22..f7dec82 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1291,9 +1291,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>  }
>  
> +static void vtd_switch_address_space(VTDAddressSpace *as)
> +{
> +    assert(as);
> +
> +    trace_vtd_switch_address_space(pci_bus_num(as->bus),
> +                                   VTD_PCI_SLOT(as->devfn),
> +                                   VTD_PCI_FUNC(as->devfn),
> +                                   as->iommu_state->dmar_enabled);
> +
> +    /* Turn off first then on the other */
> +    if (as->iommu_state->dmar_enabled) {
> +        memory_region_set_enabled(&as->sys_alias, false);
> +        memory_region_set_enabled(&as->iommu, true);
> +    } else {
> +        memory_region_set_enabled(&as->iommu, false);
> +        memory_region_set_enabled(&as->sys_alias, true);
> +    }
> +}
> +
> +static void vtd_switch_address_space_all(IntelIOMMUState *s)
> +{
> +    GHashTableIter iter;
> +    VTDBus *vtd_bus;
> +    int i;
> +
> +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> +            if (!vtd_bus->dev_as[i]) {
> +                continue;
> +            }
> +            vtd_switch_address_space(vtd_bus->dev_as[i]);
> +        }
> +    }
> +}
> +
>  /* Handle Translation Enable/Disable */
>  static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>  {
> +    if (s->dmar_enabled == en) {
> +        return;
> +    }
> +
>      VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
>  
>      if (en) {
> @@ -1308,6 +1348,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>          /* Ok - report back to driver */
>          vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
>      }
> +
> +    vtd_switch_address_space_all(s);
>  }
>  
>  /* Handle Interrupt Remap Enable/Disable */
> @@ -2529,15 +2571,44 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>          vtd_dev_as->devfn = (uint8_t)devfn;
>          vtd_dev_as->iommu_state = s;
>          vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> +
> +        /*
> +         * Memory region relationships looks like (Address range shows
> +         * only lower 32 bits to make it short in length...):
> +         *
> +         * |-----------------+-------------------+----------|
> +         * | Name            | Address range     | Priority |
> +         * |-----------------+-------------------+----------+
> +         * | vtd_root        | 00000000-ffffffff |        0 |
> +         * |  intel_iommu    | 00000000-ffffffff |        1 |
> +         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
> +         * |  intel_iommu_ir | fee00000-feefffff |       64 |
> +         * |-----------------+-------------------+----------|
> +         *
> +         * We enable/disable DMAR by switching enablement for
> +         * vtd_sys_alias and intel_iommu regions. IR region is always
> +         * enabled.
> +         */
>          memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
> -                                 &s->iommu_ops, "intel_iommu", UINT64_MAX);
> +                                 &s->iommu_ops, "intel_iommu_dmar",
> +                                 UINT64_MAX);
> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> +                                 "vtd_sys_alias", get_system_memory(),
> +                                 0, memory_region_size(get_system_memory()));
>          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
>                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
>                                VTD_INTERRUPT_ADDR_SIZE);
> -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> -                                    &vtd_dev_as->iommu_ir);
> -        address_space_init(&vtd_dev_as->as,
> -                           &vtd_dev_as->iommu, name);
> +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> +                           "vtd_root", UINT64_MAX);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> +                                            VTD_INTERRUPT_ADDR_FIRST,
> +                                            &vtd_dev_as->iommu_ir, 64);
> +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->sys_alias, 1);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->iommu, 1);
> +        vtd_switch_address_space(vtd_dev_as);
>      }
>      return vtd_dev_as;
>  }
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index f725bca..3c3a167 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -4,7 +4,6 @@
>  x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>  
>  # hw/i386/intel_iommu.c
> -vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>  vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
>  vtd_inv_desc_invalid(uint64_t hi, uint64_t lo) "invalid inv desc hi 0x%"PRIx64" lo 0x%"PRIx64
>  vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
> @@ -37,6 +36,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
>  vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
>  vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
>  vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index fe645aa..8f212a1 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -83,6 +83,8 @@ struct VTDAddressSpace {
>      uint8_t devfn;
>      AddressSpace as;
>      MemoryRegion iommu;
> +    MemoryRegion root;
> +    MemoryRegion sys_alias;
>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 9/9] intel_iommu: enable remote IOTLB
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 9/9] intel_iommu: enable remote IOTLB Peter Xu
@ 2017-04-06 11:58   ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 11:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv, Aviv Ben-David

On Thu, Apr 06, 2017 at 03:08:44PM +0800, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
> 
>   "IOMMU: enable intel_iommu map and unmap notifiers"
>   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
> 
> However I removed/fixed some content, and added my own codes.
> 
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
> 
> This patch enables vfio devices for VT-d emulation.
> 
> And, since we already have vhost DMAR support via device-iotlb, a
> natural benefit that this patch brings is that vt-d enabled vhost can
> live even without ATS capability now. Though more tests are needed.
> 
> Signed-off-by: Aviv Ben-David <bdaviv@cs.technion.ac.il>
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  hw/i386/intel_iommu.c          | 191 ++++++++++++++++++++++++++++++++++++++---
>  hw/i386/intel_iommu_internal.h |   1 +
>  hw/i386/trace-events           |   1 +
>  include/hw/i386/intel_iommu.h  |   8 ++
>  4 files changed, 188 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f7dec82..02f047c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -806,7 +806,8 @@ next:
>   * @private: private data for the hook function
>   */
>  static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> -                         vtd_page_walk_hook hook_fn, void *private)
> +                         vtd_page_walk_hook hook_fn, void *private,
> +                         bool notify_unmap)
>  {
>      dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
>      uint32_t level = vtd_get_level_from_context_entry(ce);
> @@ -821,7 +822,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>      }
>  
>      return vtd_page_walk_level(addr, start, end, hook_fn, private,
> -                               level, true, true, false);
> +                               level, true, true, notify_unmap);
>  }
>  
>  /* Map a device to its corresponding domain (context-entry) */
> @@ -1038,6 +1039,15 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
>                  s->intr_root, s->intr_size);
>  }
>  
> +static void vtd_iommu_replay_all(IntelIOMMUState *s)
> +{
> +    IntelIOMMUNotifierNode *node;
> +
> +    QLIST_FOREACH(node, &s->notifiers_list, next) {
> +        memory_region_iommu_replay_all(&node->vtd_as->iommu);
> +    }
> +}
> +
>  static void vtd_context_global_invalidate(IntelIOMMUState *s)
>  {
>      trace_vtd_inv_desc_cc_global();
> @@ -1045,6 +1055,14 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>      if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
>          vtd_reset_context_cache(s);
>      }
> +    /*
> +     * From VT-d spec 6.5.2.1, a global context entry invalidation
> +     * should be followed by a IOTLB global invalidation, so we should
> +     * be safe even without this. Hoewever, let's replay the region as
> +     * well to be safer, and go back here when we need finer tunes for
> +     * VT-d emulation codes.
> +     */
> +    vtd_iommu_replay_all(s);
>  }
>  
>  
> @@ -1111,6 +1129,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                  trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
>                                               VTD_PCI_FUNC(devfn_it));
>                  vtd_as->context_cache_entry.context_cache_gen = 0;
> +                /*
> +                 * So a device is moving out of (or moving into) a
> +                 * domain, a replay() suites here to notify all the
> +                 * IOMMU_NOTIFIER_MAP registers about this change.
> +                 * This won't bring bad even if we have no such
> +                 * notifier registered - the IOMMU notification
> +                 * framework will skip MAP notifications if that
> +                 * happened.
> +                 */
> +                memory_region_iommu_replay_all(&vtd_as->iommu);
>              }
>          }
>      }
> @@ -1152,12 +1180,53 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
>  {
>      trace_vtd_iotlb_reset("global invalidation recved");
>      vtd_reset_iotlb(s);
> +    vtd_iommu_replay_all(s);
>  }
>  
>  static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>  {
> +    IntelIOMMUNotifierNode *node;
> +    VTDContextEntry ce;
> +    VTDAddressSpace *vtd_as;
> +
>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
>                                  &domain_id);
> +
> +    QLIST_FOREACH(node, &s->notifiers_list, next) {
> +        vtd_as = node->vtd_as;
> +        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> +                                      vtd_as->devfn, &ce) &&
> +            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> +            memory_region_iommu_replay_all(&vtd_as->iommu);
> +        }
> +    }
> +}
> +
> +static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
> +                                           void *private)
> +{
> +    memory_region_notify_iommu((MemoryRegion *)private, *entry);
> +    return 0;
> +}
> +
> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> +                                           uint16_t domain_id, hwaddr addr,
> +                                           uint8_t am)
> +{
> +    IntelIOMMUNotifierNode *node;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> +        VTDAddressSpace *vtd_as = node->vtd_as;
> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> +                                       vtd_as->devfn, &ce);
> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> +                          vtd_page_invalidate_notify_hook,
> +                          (void *)&vtd_as->iommu, true);
> +        }
> +    }
>  }
>  
>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> @@ -1170,6 +1239,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>      info.addr = addr;
>      info.mask = ~((1 << am) - 1);
>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
>  }
>  
>  /* Flush IOTLB
> @@ -2187,15 +2257,33 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu,
>                                            IOMMUNotifierFlag new)
>  {
>      VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    IntelIOMMUNotifierNode *node = NULL;
> +    IntelIOMMUNotifierNode *next_node = NULL;
>  
> -    if (new & IOMMU_NOTIFIER_MAP) {
> -        error_report("Device at bus %s addr %02x.%d requires iommu "
> -                     "notifier which is currently not supported by "
> -                     "intel-iommu emulation",
> -                     vtd_as->bus->qbus.name, PCI_SLOT(vtd_as->devfn),
> -                     PCI_FUNC(vtd_as->devfn));
> +    if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
> +        error_report("We need to set cache_mode=1 for intel-iommu to enable "
> +                     "device assignment with IOMMU protection.");
>          exit(1);
>      }
> +
> +    if (old == IOMMU_NOTIFIER_NONE) {
> +        node = g_malloc0(sizeof(*node));
> +        node->vtd_as = vtd_as;
> +        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
> +        return;
> +    }
> +
> +    /* update notifier node with new flags */
> +    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
> +        if (node->vtd_as == vtd_as) {
> +            if (new == IOMMU_NOTIFIER_NONE) {
> +                QLIST_REMOVE(node, next);
> +                g_free(node);
> +            }
> +            return;
> +        }
> +    }
>  }
>  
>  static const VMStateDescription vtd_vmstate = {
> @@ -2613,6 +2701,74 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>      return vtd_dev_as;
>  }
>  
> +/* Unmap the whole range in the notifier's scope. */
> +static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> +{
> +    IOMMUTLBEntry entry;
> +    hwaddr size;
> +    hwaddr start = n->start;
> +    hwaddr end = n->end;
> +
> +    /*
> +     * Note: all the codes in this function has a assumption that IOVA
> +     * bits are no more than VTD_MGAW bits (which is restricted by
> +     * VT-d spec), otherwise we need to consider overflow of 64 bits.
> +     */
> +
> +    if (end > VTD_ADDRESS_SIZE) {
> +        /*
> +         * Don't need to unmap regions that is bigger than the whole
> +         * VT-d supported address space size
> +         */
> +        end = VTD_ADDRESS_SIZE;
> +    }
> +
> +    assert(start <= end);
> +    size = end - start;
> +
> +    if (ctpop64(size) != 1) {
> +        /*
> +         * This size cannot format a correct mask. Let's enlarge it to
> +         * suite the minimum available mask.
> +         */
> +        int n = 64 - clz64(size);
> +        if (n > VTD_MGAW) {
> +            /* should not happen, but in case it happens, limit it */
> +            n = VTD_MGAW;
> +        }
> +        size = 1ULL << n;
> +    }
> +
> +    entry.target_as = &address_space_memory;
> +    /* Adjust iova for the size */
> +    entry.iova = n->start & ~(size - 1);
> +    /* This field is meaningless for unmap */
> +    entry.translated_addr = 0;
> +    entry.perm = IOMMU_NONE;
> +    entry.addr_mask = size - 1;
> +
> +    trace_vtd_as_unmap_whole(pci_bus_num(as->bus),
> +                             VTD_PCI_SLOT(as->devfn),
> +                             VTD_PCI_FUNC(as->devfn),
> +                             entry.iova, size);
> +
> +    memory_region_notify_one(n, &entry);
> +}
> +
> +static void vtd_address_space_unmap_all(IntelIOMMUState *s)
> +{
> +    IntelIOMMUNotifierNode *node;
> +    VTDAddressSpace *vtd_as;
> +    IOMMUNotifier *n;
> +
> +    QLIST_FOREACH(node, &s->notifiers_list, next) {
> +        vtd_as = node->vtd_as;
> +        IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
> +            vtd_address_space_unmap(vtd_as, n);
> +        }
> +    }
> +}
> +
>  static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
>  {
>      memory_region_notify_one((IOMMUNotifier *)private, entry);
> @@ -2626,16 +2782,19 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
>      VTDContextEntry ce;
>  
> +    /*
> +     * The replay can be triggered by either a invalidation or a newly
> +     * created entry. No matter what, we release existing mappings
> +     * (it means flushing caches for UNMAP-only registers).
> +     */
> +    vtd_address_space_unmap(vtd_as, n);
> +
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> -        /*
> -         * Scanned a valid context entry, walk over the pages and
> -         * notify when needed.
> -         */
>          trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
>                                    PCI_FUNC(vtd_as->devfn),
>                                    VTD_CONTEXT_ENTRY_DID(ce.hi),
>                                    ce.hi, ce.lo);
> -        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n);
> +        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false);
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
>                                      PCI_FUNC(vtd_as->devfn));
> @@ -2754,6 +2913,11 @@ static void vtd_reset(DeviceState *dev)
>  
>      VTD_DPRINTF(GENERAL, "");
>      vtd_init(s);
> +
> +    /*
> +     * When device reset, throw away all mappings and external caches
> +     */
> +    vtd_address_space_unmap_all(s);
>  }
>  
>  static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> @@ -2817,6 +2981,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> +    QLIST_INIT(&s->notifiers_list);
>      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>                            "intel_iommu", DMAR_REG_SIZE);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 4104121..29d6707 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -197,6 +197,7 @@
>  #define VTD_DOMAIN_ID_MASK          ((1UL << VTD_DOMAIN_ID_SHIFT) - 1)
>  #define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
>  #define VTD_MGAW                    39  /* Maximum Guest Address Width */
> +#define VTD_ADDRESS_SIZE            (1ULL << VTD_MGAW)
>  #define VTD_CAP_MGAW                (((VTD_MGAW - 1) & 0x3fULL) << 16)
>  #define VTD_MAMV                    18ULL
>  #define VTD_CAP_MAMV                (VTD_MAMV << 48)
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 3c3a167..04a6980 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -37,6 +37,7 @@ vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"P
>  vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
>  vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
>  vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
> +vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 8f212a1..3e51876 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -63,6 +63,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDIrq VTDIrq;
>  typedef struct VTD_MSIMessage VTD_MSIMessage;
> +typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -249,6 +250,11 @@ struct VTD_MSIMessage {
>  /* When IR is enabled, all MSI/MSI-X data bits should be zero */
>  #define VTD_IR_MSI_DATA          (0)
>  
> +struct IntelIOMMUNotifierNode {
> +    VTDAddressSpace *vtd_as;
> +    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
> +};
> +
>  /* The iommu (DMAR) device state struct */
>  struct IntelIOMMUState {
>      X86IOMMUState x86_iommu;
> @@ -286,6 +292,8 @@ struct IntelIOMMUState {
>      MemoryRegionIOMMUOps iommu_ops;
>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +    /* list of registered notifiers */
> +    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
>  
>      /* interrupt remapping */
>      bool intr_enabled;              /* Whether guest enabled IR */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
  2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (9 preceding siblings ...)
  2017-04-06 11:53 ` [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
@ 2017-04-06 12:00 ` Michael S. Tsirkin
  2017-04-06 15:27   ` Peter Xu
  10 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-04-06 12:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote:
> This is v8 of vt-d vfio enablement series.
> 
> v8
> - remove patches 1-9 since merged already
> - add David's r-b for all the patches
> - add Aviv's s-o-b in the last patch
> - rename iommu to iommu_dmar [Jason]
> - rename last patch subject to "remote IOTLB" [Jason]
> - pick up jason's two patches to fix vhost breakage
> - let vhost leverage the new IOMMU notifier interface

Looks good to me except I'm still wondering why there's a single patch
by Jason.  Marcel - you mentioned that you intend to try to open and
maintain pci-next.  Would you like to pick this one up?  Otherwise, pls
repost after 2.10.


> v7:
> - for the two traces patches: Change subjects. Remove vtd_err() and
>   vtd_err_nonzero_rsvd() tracers, instead using standalone trace for
>   each of the places. Don't remove any DPRINTF() if there is no
>   replacement. [Jason]
> - add r-b and a-b for Alex/David/Jason.
> - in patch "intel_iommu: renaming gpa to iova where proper", convert
>   one more place where I missed [Jason]
> - fix the place where I should use "~0ULL" not "~0" [Jason]
> - squash patch 16 into 18 [Jason]
> 
> v6:
> - do unmap in all cases when replay [Jason]
> - do global replay even if context entry is invalidated [Jason]
> - when iommu reset, send unmap to all registered notifiers [Jason]
> - use rcu read lock to protect the whole vfio_iommu_map_notify()
>   [Alex, Paolo]
> 
> v5:
> - fix patch 4 subject too long, and error spelling [Eric]
> - add ack-by for alex in patch 1 [Alex]
> - squashing patch 19/20 into patch 18 [Jason]
> - fix comments in vtd_page_walk() [Jason]
> - remove all error_report() [Jason]
> - add comment for patch 18, mention about that enabled vhost without
>   ATS as well [Jason]
> - remove skipped debug thing during page walk [Jason]
> - remove duplicated page walk trace [Jason]
> - some tunings in vtd_address_space_unmap(), to provide correct iova
>   and addr_mask. For this, I tuned this patch as well a bit:
>   "memory: add section range info for IOMMU notifier"
>   to loosen the range check
> 
> v4:
> - convert all error_report()s into traces (in the two patches that did
>   that)
> - rebased to Jason's DMAR series (master + one more patch:
>   "[PATCH V4 net-next] vhost_net: device IOTLB support")
> - let vhost use the new api iommu_notifier_init() so it won't break
>   vhost dmar [Jason]
> - touch commit message of the patch:
>   "intel_iommu: provide its own replay() callback"
>   old replay is not a dead loop, but it will just consume lots of time
>   [Jason]
> - add comment for patch:
>   "intel_iommu: do replay when context invalidate"
>   telling why replay won't be a problem even without CM=1 [Jason]
> - remove a useless comment line [Jason]
> - remove dmar_enabled parameter for vtd_switch_address_space() and
>   vtd_switch_address_space_all() [Mst, Jason]
> - merged the vfio patches in, to support unmap of big ranges at the
>   beginning ("[PATCH RFC 0/3] vfio: allow to notify unmap for very big
>   region")
> - using caching_mode instead of cache_mode_enabled, and "caching-mode"
>   instead of "cache-mode" [Kevin]
> - when receive context entry invalidation, we unmap the entire region
>   first, then replay [Alex]
> - fix commit message for patch:
>   "intel_iommu: simplify irq region translation" [Kevin]
> - handle domain/global invalidation, and notify where proper [Jason,
>   Kevin]
> 
> v3:
> - fix style error reported by patchew
> - fix comment in domain switch patch: use "IOMMU address space" rather
>   than "IOMMU region" [Kevin]
> - add ack-by for Paolo in patch:
>   "memory: add section range info for IOMMU notifier"
>   (this is seperately collected besides this thread)
> - remove 3 patches which are merged already (from Jason)
> - rebase to master b6c0897
> 
> v2:
> - change comment for "end" parameter in vtd_page_walk() [Tianyu]
> - change comment for "a iova" to "an iova" [Yi]
> - fix fault printed val for GPA address in vtd_page_walk_level (debug
>   only)
> - rebased to master (rather than Aviv's v6 series) and merged Aviv's
>   series v6: picked patch 1 (as patch 1 in this series), dropped patch
>   2, re-wrote patch 3 (as patch 17 of this series).
> - picked up two more bugfix patches from Jason's DMAR series
> - picked up the following patch as well:
>   "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"
> 
> This RFC series is a re-work for Aviv B.D.'s vfio enablement series
> with vt-d:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html
> 
> Aviv has done a great job there, and what we still lack there are
> mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> This series should have solved the above two issues.
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v8
> 
> I would be glad to hear about any review comments for above patches.
> 
> =========
> Test Done
> =========
> 
> Build test passed for x86_64/arm/ppc64.
> 
> Simply tested with x86_64, assigning two PCI devices to a single VM,
> boot the VM using:
> 
> bin=x86_64-softmmu/qemu-system-x86_64
> $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
>      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
>      -device virtio-net-pci,netdev=net0 \
>      -device vfio-pci,host=03:00.0 \
>      -device vfio-pci,host=02:00.0 \
>      -trace events=".trace.vfio" \
>      /var/lib/libvirt/images/vm1.qcow2
> 
> pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> vtd_page_walk*
> vtd_replay*
> vtd_inv_desc*
> 
> Then, in the guest, run the following tool:
> 
>   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> 
> With parameter:
> 
>   ./vfio-bind-group 00:03.0 00:04.0
> 
> Check host side trace log, I can see pages are replayed and mapped in
> 00:04.0 device address space, like:
> 
> ...
> vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
> vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
> vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
> ...
> 
> =========
> Todo List
> =========
> 
> - error reporting for the assigned devices (as Tianyu has mentioned)
> 
> - per-domain address-space: A better solution in the future may be -
>   we maintain one address space per IOMMU domain in the guest (so
>   multiple devices can share a same address space if they are sharing
>   the same IOMMU domains in the guest), rather than one address space
>   per device (which is current implementation of vt-d). However that's
>   a step further than this series, and let's see whether we can first
>   provide a workable version of device assignment with vt-d
>   protection.
> 
> - don't need to notify IOTLB (psi/gsi/global) invalidations to devices
>   that with ATS enabled
> 
> - investigate when guest map page while mask contains existing mapped
>   pages (e.g. map 12k-16k first, then map 0-12k)
> 
> - coalesce unmap during page walk (currently, we send it once per
>   page)
> 
> - when do PSI for unmap, whether we can send one notify directly
>   instead of walking over the page table?
> 
> - more to come...
> 
> Thanks,
> 
> Jason Wang (1):
>   intel_iommu: use the correct memory region for device IOTLB
>     notification
> 
> Peter Xu (8):
>   memory: add section range info for IOMMU notifier
>   memory: provide IOMMU_NOTIFIER_FOREACH macro
>   memory: provide iommu_replay_all()
>   memory: introduce memory_region_notify_one()
>   memory: add MemoryRegionIOMMUOps.replay() callback
>   intel_iommu: provide its own replay() callback
>   intel_iommu: allow dynamic switch of IOMMU region
>   intel_iommu: enable remote IOTLB
> 
>  hw/i386/intel_iommu.c          | 442 +++++++++++++++++++++++++++++++++++++++--
>  hw/i386/intel_iommu_internal.h |   1 +
>  hw/i386/trace-events           |  10 +-
>  hw/vfio/common.c               |  12 +-
>  hw/virtio/vhost.c              |  10 +-
>  include/exec/memory.h          |  49 ++++-
>  include/hw/i386/intel_iommu.h  |  10 +
>  memory.c                       |  52 ++++-
>  8 files changed, 552 insertions(+), 34 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier
  2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier Peter Xu
  2017-04-06 10:42   ` Auger Eric
  2017-04-06 11:54   ` Michael S. Tsirkin
@ 2017-04-06 15:10   ` Alex Williamson
  2 siblings, 0 replies; 34+ messages in thread
From: Alex Williamson @ 2017-04-06 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, Marcel Apfelbaum,
	jan.kiszka, jasowang, David Gibson, bd.aviv

On Thu,  6 Apr 2017 15:08:36 +0800
Peter Xu <peterx@redhat.com> wrote:

> In this patch, IOMMUNotifier.{start|end} are introduced to store section
> information for a specific notifier. When notification occurs, we not
> only check the notification type (MAP|UNMAP), but also check whether the
> notified iova range overlaps with the range of specific IOMMU notifier,
> and skip those notifiers if not in the listened range.
> 
> When removing an region, we need to make sure we removed the correct
> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> 
> This patch is solving the problem that vfio-pci devices receive
> duplicated UNMAP notification on x86 platform when vIOMMU is there. The
> issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
> splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
> this (splitted IOMMU region) is only happening on x86.
> 
> This patch also helps vhost to leverage the new interface as well, so
> that vhost won't get duplicated cache flushes. In that sense, it's an
> slight performance improvement.
> 
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v7->v8:
> - let vhost dmar leverage the new interface as well
> - add some more comments in commit message, mentioning what issue this
>   patch has solved
> - since touched up, removing Alex's a-b and DavidG's r-b
> ---
>  hw/vfio/common.c      | 12 +++++++++---
>  hw/virtio/vhost.c     | 10 ++++++++--
>  include/exec/memory.h | 19 ++++++++++++++++++-
>  memory.c              |  9 +++++++++
>  4 files changed, 44 insertions(+), 6 deletions(-)


Acked-by: Alex Williamson <alex.williamson@redhat.com>

> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f3ba9b9..6b33b9f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          giommu->iommu_offset = section->offset_within_address_space -
>                                 section->offset_within_region;
>          giommu->container = container;
> -        giommu->n.notify = vfio_iommu_map_notify;
> -        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> +        llend = int128_add(int128_make64(section->offset_within_region),
> +                           section->size);
> +        llend = int128_sub(llend, int128_one());
> +        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> +                            IOMMU_NOTIFIER_ALL,
> +                            section->offset_within_region,
> +                            int128_get64(llend));
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> @@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          VFIOGuestIOMMU *giommu;
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> -            if (giommu->iommu == section->mr) {
> +            if (giommu->iommu == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
>                  memory_region_unregister_iommu_notifier(giommu->iommu,
>                                                          &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 613494d..185b95b 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -736,14 +736,20 @@ static void vhost_iommu_region_add(MemoryListener *listener,
>      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
>                                           iommu_listener);
>      struct vhost_iommu *iommu;
> +    Int128 end;
>  
>      if (!memory_region_is_iommu(section->mr)) {
>          return;
>      }
>  
>      iommu = g_malloc0(sizeof(*iommu));
> -    iommu->n.notify = vhost_iommu_unmap_notify;
> -    iommu->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
> +    end = int128_add(int128_make64(section->offset_within_region),
> +                     section->size);
> +    end = int128_sub(end, int128_one());
> +    iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
> +                        IOMMU_NOTIFIER_UNMAP,
> +                        section->offset_within_region,
> +                        int128_get64(end));
>      iommu->mr = section->mr;
>      iommu->iommu_offset = section->offset_within_address_space -
>                            section->offset_within_region;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f20b191..0840c89 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -77,13 +77,30 @@ typedef enum {
>  
>  #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>  
> +struct IOMMUNotifier;
> +typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
> +                            IOMMUTLBEntry *data);
> +
>  struct IOMMUNotifier {
> -    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
> +    IOMMUNotify notify;
>      IOMMUNotifierFlag notifier_flags;
> +    /* Notify for address space range start <= addr <= end */
> +    hwaddr start;
> +    hwaddr end;
>      QLIST_ENTRY(IOMMUNotifier) node;
>  };
>  typedef struct IOMMUNotifier IOMMUNotifier;
>  
> +static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
> +                                       IOMMUNotifierFlag flags,
> +                                       hwaddr start, hwaddr end)
> +{
> +    n->notify = fn;
> +    n->notifier_flags = flags;
> +    n->start = start;
> +    n->end = end;
> +}
> +
>  /* New-style MMIO accessors can indicate that the transaction failed.
>   * A zero (MEMTX_OK) response means success; anything else is a failure
>   * of some kind. The memory subsystem will bitwise-OR together results
> diff --git a/memory.c b/memory.c
> index 4c95aaf..75ac595 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1606,6 +1606,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
>  
>      /* We need to register for at least one bitfield */
>      assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
> +    assert(n->start <= n->end);
>      QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
>      memory_region_update_iommu_notify_flags(mr);
>  }
> @@ -1667,6 +1668,14 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      }
>  
>      QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +        /*
> +         * Skip the notification if the notification does not overlap
> +         * with registered range.
> +         */
> +        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
> +            iommu_notifier->end < entry.iova) {
> +            continue;
> +        }
>          if (iommu_notifier->notifier_flags & request_flags) {
>              iommu_notifier->notify(iommu_notifier, &entry);
>          }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
  2017-04-06 11:53 ` [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
@ 2017-04-06 15:25   ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06 15:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 02:53:46PM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote:
> > This is v8 of vt-d vfio enablement series.
> > 
> > v8
> > - remove patches 1-9 since merged already
> > - add David's r-b for all the patches
> > - add Aviv's s-o-b in the last patch
> > - rename iommu to iommu_dmar [Jason]
> > - rename last patch subject to "remote IOTLB" [Jason]
> > - pick up jason's two patches to fix vhost breakage
> 
> I only see one (6/9) - is a patch missing or misattributed?

Oh sorry I should say "jason's one patch". The other patch has already
been merged upstream, which is 375f74f47 ("vhost: generalize iommu
memory region").

> 
> > - let vhost leverage the new IOMMU notifier interface
> 
> Which patch does this?

The first one ("memory: add section range info for IOMMU notifier").

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
  2017-04-06 12:00 ` Michael S. Tsirkin
@ 2017-04-06 15:27   ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-06 15:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, tianyu.lan, kevin.tian, Marcel Apfelbaum, jan.kiszka,
	jasowang, David Gibson, alex.williamson, bd.aviv

On Thu, Apr 06, 2017 at 03:00:28PM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote:
> > This is v8 of vt-d vfio enablement series.
> > 
> > v8
> > - remove patches 1-9 since merged already
> > - add David's r-b for all the patches
> > - add Aviv's s-o-b in the last patch
> > - rename iommu to iommu_dmar [Jason]
> > - rename last patch subject to "remote IOTLB" [Jason]
> > - pick up jason's two patches to fix vhost breakage
> > - let vhost leverage the new IOMMU notifier interface
> 
> Looks good to me except I'm still wondering why there's a single patch
> by Jason.

Answered in the other thread (the other one has been picked up
already).

> Marcel - you mentioned that you intend to try to open and
> maintain pci-next.  Would you like to pick this one up?  Otherwise, pls
> repost after 2.10.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all()
  2017-04-06 11:46     ` Peter Xu
@ 2017-04-07  4:17       ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2017-04-07  4:17 UTC (permalink / raw)
  To: Auger Eric
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, Marcel Apfelbaum, bd.aviv, David Gibson

On Thu, Apr 06, 2017 at 07:46:47PM +0800, Peter Xu wrote:
> On Thu, Apr 06, 2017 at 12:52:19PM +0200, Auger Eric wrote:
> > Hi Peter,
> > 
> > On 06/04/2017 09:08, Peter Xu wrote:

[...]

> > > +void memory_region_iommu_replay_all(MemoryRegion *mr)
> > > +{
> > > +    IOMMUNotifier *notifier;
> > > +
> > > +    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> > > +        memory_region_iommu_replay(mr, notifier, false);
> > It is not fully clear to me what is the consequence of setting
> > is_write=false always?
> 
> I am not quite sure about it either, on why we are having the is_write
> parameter on memory_region_iommu_replay(). It's there since the first
> commit that introduced the interface:
> 
>     commit a788f227ef7bd2912fcaacdfe13d13ece2998149
>     Author: David Gibson <david@gibson.dropbear.id.au>
>     Date:   Wed Sep 30 12:13:55 2015 +1000
> 
>     memory: Allow replay of IOMMU mapping notifications
> 
> However imho that should be a check against page RW permissions, which
> seems unecessary during replay. Or say, not sure whether below patch
> would be good (as a standalone one besides current patch/series):
> 
> ------8<------
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 6b33b9f..d008a4b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -488,7 +488,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> -        memory_region_iommu_replay(giommu->iommu, &giommu->n, false);
> +        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>  
>          return;
>      }
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index c4fc94d..5ab0151 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -725,11 +725,8 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
>   *
>   * @mr: the memory region to observe
>   * @n: the notifier to which to replay iommu mappings
> - * @is_write: Whether to treat the replay as a translate "write"
> - *     through the iommu
>   */
> -void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
> -                                bool is_write);
> +void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n);
>  
>  /**
>   * memory_region_iommu_replay_all: replay existing IOMMU translations
> diff --git a/memory.c b/memory.c
> index b782d5b..155ca1c 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1620,8 +1620,7 @@ uint64_t memory_region_iommu_get_min_page_size(MemoryRegion *mr)
>      return TARGET_PAGE_SIZE;
>  }
>  
> -void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
> -                                bool is_write)
> +void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
>  {
>      hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
> @@ -1635,7 +1634,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      granularity = memory_region_iommu_get_min_page_size(mr);
>  
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> -        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
> +        /*
> +         * For a replay, we don't need to do permission check.
> +         * Assuming a "read" operation here to make sure we will pass
> +         * the check (in case we don't have write permission on the
> +         * page).
> +         */
> +        iotlb = mr->iommu_ops->translate(mr, addr, false);
>          if (iotlb.perm != IOMMU_NONE) {
>              n->notify(n, &iotlb);
>          }
> @@ -1653,7 +1658,7 @@ void memory_region_iommu_replay_all(MemoryRegion *mr)
>      IOMMUNotifier *notifier;
>  
>      IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> -        memory_region_iommu_replay(mr, notifier, false);
> +        memory_region_iommu_replay(mr, notifier);
>      }
>  }
> 
> ----->8------

This follow-up patch has an assumption that all the pages would have
read permission. That may not be true, e.g., for write-only pages. A
better way (after a short discussion with David Gibson) would be
converting is_write in iommu_ops.translate() to IOMMUAccessFlags, so
that we can pass in IOMMU_NONE here (it means, we don't check
permission for this page translation). Further, we can remove the
is_write parameter in memory_region_iommu_replay().

In all cases, I'll move this change out of current series and send a
fresh-new post after 2.10. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-04-07  4:17 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06  7:08 [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Peter Xu
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier Peter Xu
2017-04-06 10:42   ` Auger Eric
2017-04-06 10:52     ` Peter Xu
2017-04-06 11:54   ` Michael S. Tsirkin
2017-04-06 15:10   ` Alex Williamson
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 2/9] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
2017-04-06 10:45   ` Auger Eric
2017-04-06 11:12     ` Peter Xu
2017-04-06 11:30       ` Auger Eric
2017-04-06 11:54   ` Michael S. Tsirkin
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all() Peter Xu
2017-04-06 10:52   ` Auger Eric
2017-04-06 11:46     ` Peter Xu
2017-04-07  4:17       ` Peter Xu
2017-04-06 11:55   ` Michael S. Tsirkin
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 4/9] memory: introduce memory_region_notify_one() Peter Xu
2017-04-06 10:54   ` Auger Eric
2017-04-06 11:55   ` Michael S. Tsirkin
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 5/9] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2017-04-06 10:58   ` Auger Eric
2017-04-06 11:57   ` Michael S. Tsirkin
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 6/9] intel_iommu: use the correct memory region for device IOTLB notification Peter Xu
2017-04-06 11:57   ` Michael S. Tsirkin
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 7/9] intel_iommu: provide its own replay() callback Peter Xu
2017-04-06 11:57   ` Michael S. Tsirkin
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 8/9] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2017-04-06 11:58   ` Michael S. Tsirkin
2017-04-06  7:08 ` [Qemu-devel] [PATCH v8 9/9] intel_iommu: enable remote IOTLB Peter Xu
2017-04-06 11:58   ` Michael S. Tsirkin
2017-04-06 11:53 ` [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
2017-04-06 15:25   ` Peter Xu
2017-04-06 12:00 ` Michael S. Tsirkin
2017-04-06 15:27   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.