All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances
@ 2017-01-20 13:08 Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well Peter Xu
                   ` (20 more replies)
  0 siblings, 21 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This is v4 of vt-d vfio enablement series.

Sorry that v4 growed to 20 patches. Some newly added patches (which
are quite necessary):

[01/20] vfio: trace map/unmap for notify as well
[02/20] vfio: introduce vfio_get_vaddr()
[03/20] vfio: allow to notify unmap for very large region

  Patches from RFC series:

  "[PATCH RFC 0/3] vfio: allow to notify unmap for very big region"

  Which is required by patch [19/20].

[11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro

  A helper only.

[19/20] intel_iommu: unmap existing pages before replay

  This solves Alex's concern that there might have existing mappings
  in previous domain when replay happens.

[20/20] intel_iommu: replay even with DSI/GLOBAL inv desc

  This solves Jason/Kevin's concern by handling DSI/GLOBAL
  invalidations as well.

Each individual patch will have more detailed explanation on itself.
Please refer to each of them.

Here I did separate work on patch 19/20 rather than squashing them
into patch 18 for easier modification and review. I prefer we have
them separately so we can see each problem separately, after all,
patch 18 survives in most use cases. Please let me know if we want to
squash them in some way. I can respin when necessary.

Besides the big things, lots of tiny tweaks as well. Here's the
changelog.

v4:
- convert all error_report()s into traces (in the two patches that did
  that)
- rebased to Jason's DMAR series (master + one more patch:
  "[PATCH V4 net-next] vhost_net: device IOTLB support")
- let vhost use the new api iommu_notifier_init() so it won't break
  vhost dmar [Jason]
- touch commit message of the patch:
  "intel_iommu: provide its own replay() callback"
  old replay is not a dead loop, but it will just consume lots of time
  [Jason]
- add comment for patch:
  "intel_iommu: do replay when context invalidate"
  telling why replay won't be a problem even without CM=1 [Jason]
- remove a useless comment line [Jason]
- remove dmar_enabled parameter for vtd_switch_address_space() and
  vtd_switch_address_space_all() [Mst, Jason]
- merged the vfio patches in, to support unmap of big ranges at the
  beginning ("[PATCH RFC 0/3] vfio: allow to notify unmap for very big
  region")
- using caching_mode instead of cache_mode_enabled, and "caching-mode"
  instead of "cache-mode" [Kevin]
- when receive context entry invalidation, we unmap the entire region
  first, then replay [Alex]
- fix commit message for patch:
  "intel_iommu: simplify irq region translation" [Kevin]
- handle domain/global invalidation, and notify where proper [Jason,
  Kevin]

v3:
- fix style error reported by patchew
- fix comment in domain switch patch: use "IOMMU address space" rather
  than "IOMMU region" [Kevin]
- add ack-by for Paolo in patch:
  "memory: add section range info for IOMMU notifier"
  (this is seperately collected besides this thread)
- remove 3 patches which are merged already (from Jason)
- rebase to master b6c0897

v2:
- change comment for "end" parameter in vtd_page_walk() [Tianyu]
- change comment for "a iova" to "an iova" [Yi]
- fix fault printed val for GPA address in vtd_page_walk_level (debug
  only)
- rebased to master (rather than Aviv's v6 series) and merged Aviv's
  series v6: picked patch 1 (as patch 1 in this series), dropped patch
  2, re-wrote patch 3 (as patch 17 of this series).
- picked up two more bugfix patches from Jason's DMAR series
- picked up the following patch as well:
  "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"

This RFC series is a re-work for Aviv B.D.'s vfio enablement series
with vt-d:

  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html

Aviv has done a great job there, and what we still lack there are
mostly the following:

(1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
    memory region.

(2) VT-d still haven't provide a correct replay() mechanism (e.g.,
    when IOMMU domain switches, things will broke).

This series should have solved the above two issues.

Online repo:

  https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v4

I would be glad to hear about any review comments for above patches.

=========
Test Done
=========

Build test passed for x86_64/arm/ppc64.

Simply tested with x86_64, assigning two PCI devices to a single VM,
boot the VM using:

bin=x86_64-softmmu/qemu-system-x86_64
$bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
     -device intel-iommu,intremap=on,eim=off,caching-mode=on \
     -netdev user,id=net0,hostfwd=tcp::5555-:22 \
     -device virtio-net-pci,netdev=net0 \
     -device vfio-pci,host=03:00.0 \
     -device vfio-pci,host=02:00.0 \
     -trace events=".trace.vfio" \
     /var/lib/libvirt/images/vm1.qcow2

pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
vtd_page_walk*
vtd_replay*
vtd_inv_desc*

Then, in the guest, run the following tool:

  https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c

With parameter:

  ./vfio-bind-group 00:03.0 00:04.0

Check host side trace log, I can see pages are replayed and mapped in
00:04.0 device address space, like:

...
vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
...

=========
Todo List
=========

- error reporting for the assigned devices (as Tianyu has mentioned)

- per-domain address-space: A better solution in the future may be -
  we maintain one address space per IOMMU domain in the guest (so
  multiple devices can share a same address space if they are sharing
  the same IOMMU domains in the guest), rather than one address space
  per device (which is current implementation of vt-d). However that's
  a step further than this series, and let's see whether we can first
  provide a workable version of device assignment with vt-d
  protection.

- more to come...

Thanks,

Aviv Ben-David (1):
  IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to
    guest

Peter Xu (19):
  vfio: trace map/unmap for notify as well
  vfio: introduce vfio_get_vaddr()
  vfio: allow to notify unmap for very large region
  intel_iommu: simplify irq region translation
  intel_iommu: renaming gpa to iova where proper
  intel_iommu: fix trace for inv desc handling
  intel_iommu: fix trace for addr translation
  intel_iommu: vtd_slpt_level_shift check level
  memory: add section range info for IOMMU notifier
  memory: provide IOMMU_NOTIFIER_FOREACH macro
  memory: provide iommu_replay_all()
  memory: introduce memory_region_notify_one()
  memory: add MemoryRegionIOMMUOps.replay() callback
  intel_iommu: provide its own replay() callback
  intel_iommu: do replay when context invalidate
  intel_iommu: allow dynamic switch of IOMMU region
  intel_iommu: enable vfio devices
  intel_iommu: unmap existing pages before replay
  intel_iommu: replay even with DSI/GLOBAL inv desc

 hw/i386/intel_iommu.c          | 674 +++++++++++++++++++++++++++++++----------
 hw/i386/intel_iommu_internal.h |   2 +
 hw/i386/trace-events           |  30 ++
 hw/vfio/common.c               |  68 +++--
 hw/vfio/trace-events           |   2 +-
 hw/virtio/vhost.c              |   4 +-
 include/exec/memory.h          |  49 ++-
 include/hw/i386/intel_iommu.h  |  12 +
 memory.c                       |  47 ++-
 9 files changed, 696 insertions(+), 192 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-23 18:20   ` Alex Williamson
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr() Peter Xu
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

We traces its range, but we don't know whether it's a MAP/UNMAP. Let's
dump it as well.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c     | 3 ++-
 hw/vfio/trace-events | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 801578b..174f351 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -305,7 +305,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     void *vaddr;
     int ret;
 
-    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
+    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
+                                iova, iova + iotlb->addr_mask);
 
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ef81609..7ae8233 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -84,7 +84,7 @@ vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
 # hw/vfio/common.c
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
-vfio_iommu_map_notify(uint64_t iova_start, uint64_t iova_end) "iommu map @ %"PRIx64" - %"PRIx64
+vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "iommu %s @ %"PRIx64" - %"PRIx64
 vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64
 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] %"PRIx64" - %"PRIx64
 vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] %"PRIx64" - %"PRIx64" [%p]"
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr()
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-23 18:49   ` Alex Williamson
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 03/20] vfio: allow to notify unmap for very large region Peter Xu
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

A cleanup for vfio_iommu_map_notify(). Should have no functional change,
just to make the function shorter and easier to understand.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c | 58 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 174f351..ce55dff 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -294,25 +294,14 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
-static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
+                           bool *read_only)
 {
-    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
-    VFIOContainer *container = giommu->container;
-    hwaddr iova = iotlb->iova + giommu->iommu_offset;
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
-    void *vaddr;
-    int ret;
-
-    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
-                                iova, iova + iotlb->addr_mask);
-
-    if (iotlb->target_as != &address_space_memory) {
-        error_report("Wrong target AS \"%s\", only system memory is allowed",
-                     iotlb->target_as->name ? iotlb->target_as->name : "none");
-        return;
-    }
+    bool ret = false;
+    bool writable = iotlb->perm & IOMMU_WO;
 
     /*
      * The IOMMU TLB entry we have just covers translation through
@@ -322,12 +311,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     rcu_read_lock();
     mr = address_space_translate(&address_space_memory,
                                  iotlb->translated_addr,
-                                 &xlat, &len, iotlb->perm & IOMMU_WO);
+                                 &xlat, &len, writable);
     if (!memory_region_is_ram(mr)) {
         error_report("iommu map to non memory area %"HWADDR_PRIx"",
                      xlat);
         goto out;
     }
+
     /*
      * Translation truncates length to the IOMMU page size,
      * check that it did not truncate too much.
@@ -337,11 +327,41 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         goto out;
     }
 
+    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    *read_only = !writable || mr->readonly;
+    ret = true;
+
+out:
+    rcu_read_unlock();
+    return ret;
+}
+
+static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    VFIOContainer *container = giommu->container;
+    hwaddr iova = iotlb->iova + giommu->iommu_offset;
+    bool read_only;
+    void *vaddr;
+    int ret;
+
+    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
+                                iova, iova + iotlb->addr_mask);
+
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+        return;
+    }
+
+    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+        return;
+    }
+
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        vaddr = memory_region_get_ram_ptr(mr) + xlat;
         ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
-                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
+                           read_only);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
@@ -357,8 +377,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          iotlb->addr_mask + 1, ret);
         }
     }
-out:
-    rcu_read_unlock();
 }
 
 static void vfio_listener_region_add(MemoryListener *listener,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 03/20] vfio: allow to notify unmap for very large region
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr() Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 04/20] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big
region. This can be leveraged by QEMU IOMMU implementation to cleanup
existing page mappings for an entire iova address space (by notifying
with an IOTLB with extremely huge addr_mask). However current
vfio_iommu_map_notify() does not allow that. It make sure that all the
translated address in IOTLB is falling into RAM range.

The check makes sense, but it should only be a sensible checker for
mapping operations, and mean little for unmap operations.

This patch moves this check into map logic only, so that we'll get
faster unmap handling (no need to translate again), and also we can then
better support unmapping a very big region when it covers non-ram ranges
or even not-existing ranges.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ce55dff..4d90844 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -354,11 +354,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         return;
     }
 
-    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
-        return;
-    }
-
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
+        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+            return;
+        }
         ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
                            read_only);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 04/20] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (2 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 03/20] vfio: allow to notify unmap for very large region Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-22  2:51   ` [Qemu-devel] [PATCH RFC v4.1 04/20] intel_iommu: add "caching-mode" option Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 05/20] intel_iommu: simplify irq region translation Peter Xu
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

From: Aviv Ben-David <bd.aviv@gmail.com>

This capability asks the guest to invalidate cache before each map operation.
We can use this invalidation to trap map operations in the hypervisor.

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
[using "caching-mode" instead of "cache-mode" to align with spec]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 5 +++++
 hw/i386/intel_iommu_internal.h | 1 +
 include/hw/i386/intel_iommu.h  | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ec62239..e58f1de 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2107,6 +2107,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
                             ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
+    DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -2488,6 +2489,10 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_DT;
     }
 
+    if (s->caching_mode) {
+        s->cap |= VTD_CAP_CM;
+    }
+
     vtd_reset_context_cache(s);
     vtd_reset_iotlb(s);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 356f188..4104121 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -202,6 +202,7 @@
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
 #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_CM                  (1ULL << 7)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 405c9d1..fe645aa 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -257,6 +257,8 @@ struct IntelIOMMUState {
     uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
     uint32_t version;
 
+    bool caching_mode;          /* RO - is cap CM enabled? */
+
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
     bool dmar_enabled;              /* Set if DMA remapping is enabled */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 05/20] intel_iommu: simplify irq region translation
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (3 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 04/20] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 06/20] intel_iommu: renaming gpa to iova where proper Peter Xu
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Now we have a standalone memory region for MSI, all the irq region
requests should be redirected there. Cleaning up the block with an
assertion instead.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e58f1de..55b8ff4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -818,28 +818,12 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     bool writes = true;
     VTDIOTLBEntry *iotlb_entry;
 
-    /* Check if the request is in interrupt address range */
-    if (vtd_is_interrupt_addr(addr)) {
-        if (is_write) {
-            /* FIXME: since we don't know the length of the access here, we
-             * treat Non-DWORD length write requests without PASID as
-             * interrupt requests, too. Withoud interrupt remapping support,
-             * we just use 1:1 mapping.
-             */
-            VTD_DPRINTF(MMU, "write request to interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            entry->iova = addr & VTD_PAGE_MASK_4K;
-            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
-            entry->addr_mask = ~VTD_PAGE_MASK_4K;
-            entry->perm = IOMMU_WO;
-            return;
-        } else {
-            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
-            return;
-        }
-    }
+    /*
+     * We have standalone memory region for interrupt addresses, we
+     * should never receive translation requests in this region.
+     */
+    assert(!vtd_is_interrupt_addr(addr));
+
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 06/20] intel_iommu: renaming gpa to iova where proper
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (4 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 05/20] intel_iommu: simplify irq region translation Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 07/20] intel_iommu: fix trace for inv desc handling Peter Xu
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

There are lots of places in current intel_iommu.c codes that named
"iova" as "gpa". It is really confusing to use a name "gpa" in these
places (which is very easily to be understood as "Guest Physical
Address", while it's not). To make the codes (much) easier to be read, I
decided to do this once and for all.

No functional change is made. Only literal ones.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 55b8ff4..b934b56 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -259,7 +259,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                 " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
                 domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
@@ -575,12 +575,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
     return slpte;
 }
 
-/* Given a gpa and the level of paging structure, return the offset of current
- * level.
+/* Given an iova and the level of paging structure, return the offset
+ * of current level.
  */
-static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
+static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
 {
-    return (gpa >> vtd_slpt_level_shift(level)) &
+    return (iova >> vtd_slpt_level_shift(level)) &
             ((1ULL << VTD_SL_LEVEL_BITS) - 1);
 }
 
@@ -628,10 +628,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     }
 }
 
-/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
+/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
+static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
                             uint64_t *slptep, uint32_t *slpte_level,
                             bool *reads, bool *writes)
 {
@@ -642,11 +642,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
     uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check;
 
-    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
-     * and AW in context-entry.
+    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
      */
-    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
-        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
+    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
@@ -654,13 +654,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
     access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
 
     while (true) {
-        offset = vtd_gpa_level_offset(gpa, level);
+        offset = vtd_iova_level_offset(iova, level);
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
             VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
-                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
-                        level, gpa);
+                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
+                        level, iova);
             if (level == vtd_get_level_from_context_entry(ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
@@ -672,8 +672,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
         *writes = (*writes) && (slpte & VTD_SL_W);
         if (!(slpte & access_right_check)) {
             VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
-                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (is_write ? "write" : "read"), gpa, slpte);
+                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
+                        (is_write ? "write" : "read"), iova, slpte);
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
@@ -827,7 +827,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                     " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
                     iotlb_entry->slpte, iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
@@ -2025,7 +2025,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
                            is_write, &ret);
     VTD_DPRINTF(MMU,
                 "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
-                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
+                " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
                 VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
                 vtd_as->devfn, addr, ret.translated_addr);
     return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 07/20] intel_iommu: fix trace for inv desc handling
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (5 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 06/20] intel_iommu: renaming gpa to iova where proper Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 08/20] intel_iommu: fix trace for addr translation Peter Xu
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
good, and we should end the day when we need to recompile the code
before getting useful debugging information for vt-d. Time to switch to
the trace system.

This is the first patch to do it.

Generally, the rule of mine is:

- for the old GENERAL typed message, I use trace_vtd_err*() in general.

- for the non-GENERAL typed messages, convert into specified trace_*().

- for useless DPRINTFs, I removed them.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 97 +++++++++++++++++++++------------------------------
 hw/i386/trace-events  | 15 ++++++++
 2 files changed, 54 insertions(+), 58 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b934b56..343a2ad 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -35,6 +35,7 @@
 #include "sysemu/kvm.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm_i386.h"
+#include "trace.h"
 
 /*#define DEBUG_INTEL_IOMMU*/
 #ifdef DEBUG_INTEL_IOMMU
@@ -474,22 +475,19 @@ static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
 /* Set the IWC field and try to generate an invalidation completion interrupt */
 static void vtd_generate_completion_event(IntelIOMMUState *s)
 {
-    VTD_DPRINTF(INV, "completes an invalidation wait command with "
-                "Interrupt Flag");
     if (vtd_get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
-        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
-                    "serviced by software, "
-                    "new invalidation event is not generated");
+        trace_vtd_inv_desc_wait_irq("One pending, skip current");
         return;
     }
     vtd_set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
     vtd_set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
     if (vtd_get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
-        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
-                    "event is not generated");
+        trace_vtd_inv_desc_wait_irq("IM in IECTL_REG is set, "
+                                    "new event not generated");
         return;
     } else {
         /* Generate the interrupt event */
+        trace_vtd_inv_desc_wait_irq("Generating complete event");
         vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
         vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
     }
@@ -923,6 +921,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_inv_desc_cc_global();
     s->context_cache_gen++;
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
@@ -962,9 +961,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
     uint16_t mask;
     VTDBus *vtd_bus;
     VTDAddressSpace *vtd_as;
-    uint16_t devfn;
+    uint8_t bus_n, devfn;
     uint16_t devfn_it;
 
+    trace_vtd_inv_desc_cc_devices(source_id, func_mask);
+
     switch (func_mask & 3) {
     case 0:
         mask = 0;   /* No bits in the SID field masked */
@@ -980,16 +981,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
         break;
     }
     mask = ~mask;
-    VTD_DPRINTF(INV, "device-selective invalidation source 0x%"PRIx16
-                    " mask %"PRIu16, source_id, mask);
-    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
+
+    bus_n = VTD_SID_TO_BUS(source_id);
+    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
     if (vtd_bus) {
         devfn = VTD_SID_TO_DEVFN(source_id);
         for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
             vtd_as = vtd_bus->dev_as[devfn_it];
             if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
-                VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
-                            devfn_it);
+                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
+                                             VTD_PCI_FUNC(devfn_it));
                 vtd_as->context_cache_entry.context_cache_gen = 0;
             }
         }
@@ -1302,9 +1303,7 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 {
     if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
         (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Invalidation "
-                    "Wait Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_err_nonzero_reserved("invalidation wait desc");
         return false;
     }
     if (inv_desc->lo & VTD_INV_DESC_WAIT_SW) {
@@ -1316,21 +1315,18 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
         /* FIXME: need to be masked with HAW? */
         dma_addr_t status_addr = inv_desc->hi;
-        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
-                    status_data, status_addr);
+        trace_vtd_inv_desc_wait_sw(status_addr, status_data);
         status_data = cpu_to_le32(status_data);
         if (dma_memory_write(&address_space_memory, status_addr, &status_data,
                              sizeof(status_data))) {
-            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
+            trace_vtd_err("Invalidate Desc Wait status write failed");
             return false;
         }
     } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
         /* Interrupt flag */
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
         vtd_generate_completion_event(s);
     } else {
-        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
+        trace_vtd_err("invalid Invalidation Wait Descriptor");
         return false;
     }
     return true;
@@ -1339,30 +1335,30 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
                                            VTDInvDesc *inv_desc)
 {
+    uint16_t sid, fmask;
+
     if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Context-cache "
-                    "Invalidate Descriptor");
+        trace_vtd_err_nonzero_reserved("Context-cache invalidation desc");
         return false;
     }
     switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
     case VTD_INV_DESC_CC_DOMAIN:
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
+        trace_vtd_inv_desc_cc_domain(
+            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
         /* Fall through */
     case VTD_INV_DESC_CC_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
         vtd_context_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_CC_DEVICE:
-        vtd_context_device_invalidate(s, VTD_INV_DESC_CC_SID(inv_desc->lo),
-                                      VTD_INV_DESC_CC_FM(inv_desc->lo));
+        sid = VTD_INV_DESC_CC_SID(inv_desc->lo);
+        fmask = VTD_INV_DESC_CC_FM(inv_desc->lo);
+        vtd_context_device_invalidate(s, sid, fmask);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in Context-cache "
-                    "Invalidate Descriptor hi 0x%"PRIx64  " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_err("invalid granularity in Context-cache "
+                      "Invalidate Descriptor");
         return false;
     }
     return true;
@@ -1376,22 +1372,19 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
     if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
         (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in IOTLB "
-                    "Invalidate Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_err_nonzero_reserved("IOTLB invalidation desc");
         return false;
     }
 
     switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
     case VTD_INV_DESC_IOTLB_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
+        trace_vtd_inv_desc_iotlb_global();
         vtd_iotlb_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_IOTLB_DOMAIN:
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    domain_id);
+        trace_vtd_inv_desc_iotlb_domain(domain_id);
         vtd_iotlb_domain_invalidate(s, domain_id);
         break;
 
@@ -1399,20 +1392,16 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
         addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
         am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
-        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
-                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
+        trace_vtd_inv_desc_iotlb_pages(domain_id, addr, am);
         if (am > VTD_MAMV) {
-            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
-                        "%"PRIu8, (uint8_t)VTD_MAMV);
+            trace_vtd_err("IOTLB page inv desc addr mask overflow");
             return false;
         }
         vtd_iotlb_page_invalidate(s, domain_id, addr, am);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in IOTLB Invalidate "
-                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_err("invalid granularity in IOTLB inv desc");
         return false;
     }
     return true;
@@ -1492,7 +1481,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
     VTDInvDesc inv_desc;
     uint8_t desc_type;
 
-    VTD_DPRINTF(INV, "iq head %"PRIu16, s->iq_head);
     if (!vtd_get_inv_desc(s->iq, s->iq_head, &inv_desc)) {
         s->iq_last_desc_type = VTD_INV_DESC_NONE;
         return false;
@@ -1503,33 +1491,28 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 
     switch (desc_type) {
     case VTD_INV_DESC_CC:
-        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("context-cache", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_context_cache_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IOTLB:
-        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iotlb", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_iotlb_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_WAIT:
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_wait_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IEC:
-        VTD_DPRINTF(INV, "Invalidation Interrupt Entry Cache "
-                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iec", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
             return false;
         }
@@ -1544,9 +1527,7 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
-                    inv_desc.hi, inv_desc.lo, desc_type);
+        trace_vtd_err("Unkonw Invalidation Descriptor type");
         return false;
     }
     s->iq_head++;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index d2b4973..eea3e84 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -10,6 +10,21 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
 # hw/i386/x86-iommu.c
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
+# hw/i386/intel_iommu.c
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
+vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
+vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
+vtd_inv_desc_cc_global(void) "context invalidate globally"
+vtd_inv_desc_cc_device(uint8_t bus, uint8_t dev, uint8_t fn) "context invalidate device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate devices sid 0x%"PRIx16" fmask 0x%"PRIx16
+vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
+vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
+vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
+vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
+vtd_inv_desc_wait_irq(const char *msg) "%s"
+vtd_err_nonzero_reserved(const char *msg) "Non-zero reserved field in %s"
+vtd_err(const char *msg) "%s"
+
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
 amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 08/20] intel_iommu: fix trace for addr translation
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (6 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 07/20] intel_iommu: fix trace for inv desc handling Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 09/20] intel_iommu: vtd_slpt_level_shift check level Peter Xu
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Another patch to convert the DPRINTF() stuffs. This patch focuses on the
address translation path and caching.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 84 ++++++++++++++++++++-------------------------------
 hw/i386/trace-events  |  7 +++++
 2 files changed, 39 insertions(+), 52 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 343a2ad..2c13b7b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -260,11 +260,9 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
-                domain_id);
+    trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
-        VTD_DPRINTF(CACHE, "iotlb exceeds size limit, forced to reset");
+        trace_vtd_iotlb_reset("iotlb exceeds size limit");
         vtd_reset_iotlb(s);
     }
 
@@ -505,8 +503,7 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
 
     addr = s->root + index * sizeof(*re);
     if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
-                    " + %"PRIu8, s->root, index);
+        trace_vtd_err("Fail to access root-entry");
         re->val = 0;
         return -VTD_FR_ROOT_TABLE_INV;
     }
@@ -525,14 +522,12 @@ static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
     dma_addr_t addr;
 
     if (!vtd_root_entry_present(root)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
+        trace_vtd_err("Root-entry is not present");
         return -VTD_FR_ROOT_ENTRY_P;
     }
     addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
     if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
-                    " + %"PRIu8,
-                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
+        trace_vtd_err("Fail to access context-entry");
         return -VTD_FR_CONTEXT_TABLE_INV;
     }
     ce->lo = le64_to_cpu(ce->lo);
@@ -644,7 +639,7 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
      * in CAP_REG and AW in context-entry.
      */
     if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
-        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
+        trace_vtd_err("IOVA exceeds limits");
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
@@ -656,9 +651,7 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
-            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
-                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
-                        level, iova);
+            trace_vtd_err("Fail to access second-level paging entry");
             if (level == vtd_get_level_from_context_entry(ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
@@ -669,15 +662,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
         *reads = (*reads) && (slpte & VTD_SL_R);
         *writes = (*writes) && (slpte & VTD_SL_W);
         if (!(slpte & access_right_check)) {
-            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
-                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (is_write ? "write" : "read"), iova, slpte);
+            trace_vtd_err("Lack of permission for page");
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
-            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
-                        "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
-                        level, slpte);
+            trace_vtd_err_nonzero_reserved("second level paging entry");
             return -VTD_FR_PAGING_ENTRY_RSVD;
         }
 
@@ -704,12 +693,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_root_entry_present(&re)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry #%"PRIu8 " is not present",
-                    bus_num);
+        /* Not error - it's okay we don't have root entry. */
+        trace_vtd_re_not_present(bus_num);
         return -VTD_FR_ROOT_ENTRY_P;
     } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
+        trace_vtd_err_nonzero_reserved("Root entry");
         return -VTD_FR_ROOT_ENTRY_RSVD;
     }
 
@@ -719,22 +707,17 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_context_entry_present(ce)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
-                    "is not present", devfn, bus_num);
+        /* Not error - it's okay we don't have context entry. */
+        trace_vtd_ce_not_present(bus_num, devfn);
         return -VTD_FR_CONTEXT_ENTRY_P;
     } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
                (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: non-zero reserved field in context-entry "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
+        trace_vtd_err_nonzero_reserved("Context entry");
         return -VTD_FR_CONTEXT_ENTRY_RSVD;
     }
     /* Check if the programming of context-entry is valid */
     if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
-        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
-                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    ce->hi, ce->lo);
+        trace_vtd_err("Unsupported Address Width value in context-entry");
         return -VTD_FR_CONTEXT_ENTRY_INV;
     } else {
         switch (ce->lo & VTD_CONTEXT_ENTRY_TT) {
@@ -743,9 +726,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         case VTD_CONTEXT_TT_DEV_IOTLB:
             break;
         default:
-            VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
-                        "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                        ce->hi, ce->lo);
+            trace_vtd_err("Unsupported Translation Type in context-entry");
             return -VTD_FR_CONTEXT_ENTRY_INV;
         }
     }
@@ -825,9 +806,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                    " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
-                    iotlb_entry->slpte, iotlb_entry->domain_id);
+        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+                                 iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
         reads = iotlb_entry->read_flags;
         writes = iotlb_entry->write_flags;
@@ -836,10 +816,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     }
     /* Try to fetch context-entry from cache first */
     if (cc_entry->context_cache_gen == s->context_cache_gen) {
-        VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 ")",
-                    bus_num, devfn, cc_entry->context_entry.hi,
-                    cc_entry->context_entry.lo, cc_entry->context_cache_gen);
+        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
+                               cc_entry->context_entry.lo,
+                               cc_entry->context_cache_gen);
         ce = cc_entry->context_entry;
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
     } else {
@@ -848,19 +827,18 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         if (ret_fr) {
             ret_fr = -ret_fr;
             if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
-                            "requests through this context-entry "
-                            "(with FPD Set)");
+                trace_vtd_err("Fault processing is disabled for DMA "
+                              "requests through this context-entry "
+                              "(with FPD Set)");
             } else {
                 vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
             }
             return;
         }
         /* Update context-cache */
-        VTD_DPRINTF(CACHE, "update context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 "->%"PRIu32 ")",
-                    bus_num, devfn, ce.hi, ce.lo,
-                    cc_entry->context_cache_gen, s->context_cache_gen);
+        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
+                                  cc_entry->context_cache_gen,
+                                  s->context_cache_gen);
         cc_entry->context_entry = ce;
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
@@ -870,8 +848,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     if (ret_fr) {
         ret_fr = -ret_fr;
         if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
-                        "through this context-entry (with FPD Set)");
+            trace_vtd_err("Fault processing is disabled for DMA "
+                          "requests through this context-entry "
+                          "(with FPD Set)");
         } else {
             vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
         }
@@ -1031,6 +1010,7 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
 
 static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_iotlb_reset("global invalidation recved");
     vtd_reset_iotlb(s);
 }
 
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index eea3e84..a273980 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,13 @@ vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write
 vtd_inv_desc_wait_irq(const char *msg) "%s"
 vtd_err_nonzero_reserved(const char *msg) "Non-zero reserved field in %s"
 vtd_err(const char *msg) "%s"
+vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
+vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
+vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
+vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
+vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 09/20] intel_iommu: vtd_slpt_level_shift check level
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (7 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 08/20] intel_iommu: fix trace for addr translation Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier Peter Xu
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This helps in debugging incorrect level passed in.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2c13b7b..6f5f68a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -168,6 +168,7 @@ static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
 /* The shift of an addr for a certain level of paging structure */
 static inline uint32_t vtd_slpt_level_shift(uint32_t level)
 {
+    assert(level != 0);
     return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (8 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 09/20] intel_iommu: vtd_slpt_level_shift check level Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-23 19:12   ` Alex Williamson
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

In this patch, IOMMUNotifier.{start|end} are introduced to store section
information for a specific notifier. When notification occurs, we not
only check the notification type (MAP|UNMAP), but also check whether the
notified iova is in the range of specific IOMMU notifier, and skip those
notifiers if not in the listened range.

When removing an region, we need to make sure we removed the correct
VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
changelog (start from vt-d vfio enablement series v3):
v4:
- introduce memory_region_iommu_notifier_init() [Jason]
---
 hw/vfio/common.c      | 12 +++++++++---
 hw/virtio/vhost.c     |  4 ++--
 include/exec/memory.h | 19 ++++++++++++++++++-
 memory.c              |  5 ++++-
 4 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4d90844..49dc035 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -471,8 +471,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
         giommu->iommu_offset = section->offset_within_address_space -
                                section->offset_within_region;
         giommu->container = container;
-        giommu->n.notify = vfio_iommu_map_notify;
-        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
+        llend = int128_add(int128_make64(section->offset_within_region),
+                           section->size);
+        llend = int128_sub(llend, int128_one());
+        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
+                            IOMMU_NOTIFIER_ALL,
+                            section->offset_within_region,
+                            int128_get64(llend));
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
@@ -543,7 +548,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (giommu->iommu == section->mr) {
+            if (giommu->iommu == section->mr &&
+                giommu->n.start == section->offset_within_region) {
                 memory_region_unregister_iommu_notifier(giommu->iommu,
                                                         &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 9cacf55..cc99c6a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1242,8 +1242,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         .priority = 10
     };
 
-    hdev->n.notify = vhost_iommu_unmap_notify;
-    hdev->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
+    iommu_notifier_init(&hdev->n, vhost_iommu_unmap_notify,
+                        IOMMU_NOTIFIER_UNMAP, 0, ~0ULL);
 
     if (hdev->migration_blocker == NULL) {
         if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index bec9756..ae4c9a9 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -81,13 +81,30 @@ typedef enum {
 
 #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
 
+struct IOMMUNotifier;
+typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
+                            IOMMUTLBEntry *data);
+
 struct IOMMUNotifier {
-    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
+    IOMMUNotify notify;
     IOMMUNotifierFlag notifier_flags;
+    /* Notify for address space range start <= addr <= end */
+    hwaddr start;
+    hwaddr end;
     QLIST_ENTRY(IOMMUNotifier) node;
 };
 typedef struct IOMMUNotifier IOMMUNotifier;
 
+static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
+                                       IOMMUNotifierFlag flags,
+                                       hwaddr start, hwaddr end)
+{
+    n->notify = fn;
+    n->notifier_flags = flags;
+    n->start = start;
+    n->end = end;
+}
+
 /* New-style MMIO accessors can indicate that the transaction failed.
  * A zero (MEMTX_OK) response means success; anything else is a failure
  * of some kind. The memory subsystem will bitwise-OR together results
diff --git a/memory.c b/memory.c
index 2bfc37f..89104b1 100644
--- a/memory.c
+++ b/memory.c
@@ -1610,6 +1610,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
 
     /* We need to register for at least one bitfield */
     assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
+    assert(n->start <= n->end);
     QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
     memory_region_update_iommu_notify_flags(mr);
 }
@@ -1671,7 +1672,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     }
 
     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
-        if (iommu_notifier->notifier_flags & request_flags) {
+        if (iommu_notifier->notifier_flags & request_flags &&
+            iommu_notifier->start <= entry.iova &&
+            iommu_notifier->end >= entry.iova) {
             iommu_notifier->notify(iommu_notifier, &entry);
         }
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (9 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 12/20] memory: provide iommu_replay_all() Peter Xu
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 3 +++
 memory.c              | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index ae4c9a9..f0cb631 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -243,6 +243,9 @@ struct MemoryRegion {
     IOMMUNotifierFlag iommu_notify_flags;
 };
 
+#define IOMMU_NOTIFIER_FOREACH(n, mr) \
+    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
+
 /**
  * MemoryListener: callbacks structure for updates to the physical memory map
  *
diff --git a/memory.c b/memory.c
index 89104b1..d1ee3e0 100644
--- a/memory.c
+++ b/memory.c
@@ -1587,7 +1587,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
     IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
     IOMMUNotifier *iommu_notifier;
 
-    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
         flags |= iommu_notifier->notifier_flags;
     }
 
@@ -1671,7 +1671,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
-    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
         if (iommu_notifier->notifier_flags & request_flags &&
             iommu_notifier->start <= entry.iova &&
             iommu_notifier->end >= entry.iova) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 12/20] memory: provide iommu_replay_all()
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (10 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 13/20] memory: introduce memory_region_notify_one() Peter Xu
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This is an "global" version of exising memory_region_iommu_replay() - we
announce the translations to all the registered notifiers, instead of a
specific one.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 8 ++++++++
 memory.c              | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index f0cb631..885c05f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -711,6 +711,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
                                 bool is_write);
 
 /**
+ * memory_region_iommu_replay_all: replay existing IOMMU translations
+ * to all the notifiers registered.
+ *
+ * @mr: the memory region to observe
+ */
+void memory_region_iommu_replay_all(MemoryRegion *mr);
+
+/**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
  * changes to IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index d1ee3e0..068666a 100644
--- a/memory.c
+++ b/memory.c
@@ -1646,6 +1646,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     }
 }
 
+void memory_region_iommu_replay_all(MemoryRegion *mr)
+{
+    IOMMUNotifier *notifier;
+
+    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
+        memory_region_iommu_replay(mr, notifier, false);
+    }
+}
+
 void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
                                              IOMMUNotifier *n)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 13/20] memory: introduce memory_region_notify_one()
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (11 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 12/20] memory: provide iommu_replay_all() Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 14/20] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Generalizing the notify logic in memory_region_notify_iommu() into a
single function. This can be further used in customized replay()
functions for IOMMUs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 15 +++++++++++++++
 memory.c              | 29 ++++++++++++++++++-----------
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 885c05f..75371e9 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -686,6 +686,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
                                 IOMMUTLBEntry entry);
 
 /**
+ * memory_region_notify_one: notify a change in an IOMMU translation
+ *                           entry to a single notifier
+ *
+ * This works just like memory_region_notify_iommu(), but it only
+ * notifies a specific notifier, not all of them.
+ *
+ * @notifier: the notifier to be notified
+ * @entry: the new entry in the IOMMU translation table.  The entry
+ *         replaces all old entries for the same virtual I/O address range.
+ *         Deleted entries have .@perm == 0.
+ */
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry);
+
+/**
  * memory_region_register_iommu_notifier: register a notifier for changes to
  * IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index 068666a..a4affda 100644
--- a/memory.c
+++ b/memory.c
@@ -1666,26 +1666,33 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
     memory_region_update_iommu_notify_flags(mr);
 }
 
-void memory_region_notify_iommu(MemoryRegion *mr,
-                                IOMMUTLBEntry entry)
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry)
 {
-    IOMMUNotifier *iommu_notifier;
     IOMMUNotifierFlag request_flags;
 
-    assert(memory_region_is_iommu(mr));
-
-    if (entry.perm & IOMMU_RW) {
+    if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
     } else {
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
+    if (notifier->notifier_flags & request_flags &&
+        notifier->start <= entry->iova &&
+        notifier->end >= entry->iova) {
+        notifier->notify(notifier, entry);
+    }
+}
+
+void memory_region_notify_iommu(MemoryRegion *mr,
+                                IOMMUTLBEntry entry)
+{
+    IOMMUNotifier *iommu_notifier;
+
+    assert(memory_region_is_iommu(mr));
+
     IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
-        if (iommu_notifier->notifier_flags & request_flags &&
-            iommu_notifier->start <= entry.iova &&
-            iommu_notifier->end >= entry.iova) {
-            iommu_notifier->notify(iommu_notifier, &entry);
-        }
+        memory_region_notify_one(iommu_notifier, &entry);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 14/20] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (12 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 13/20] memory: introduce memory_region_notify_one() Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback Peter Xu
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Originally we have one memory_region_iommu_replay() function, which is
the default behavior to replay the translations of the whole IOMMU
region. However, on some platform like x86, we may want our own replay
logic for IOMMU regions. This patch add one more hook for IOMMUOps for
the callback, and it'll override the default if set.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 2 ++
 memory.c              | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 75371e9..bb4e654 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -195,6 +195,8 @@ struct MemoryRegionIOMMUOps {
     void (*notify_flag_changed)(MemoryRegion *iommu,
                                 IOMMUNotifierFlag old_flags,
                                 IOMMUNotifierFlag new_flags);
+    /* Set this up to provide customized IOMMU replay function */
+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/memory.c b/memory.c
index a4affda..169ead6 100644
--- a/memory.c
+++ b/memory.c
@@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    /* If the IOMMU has its own replay callback, override */
+    if (mr->iommu_ops->replay) {
+        mr->iommu_ops->replay(mr, n);
+        return;
+    }
+
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (13 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 14/20] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-22  7:56   ` Jason Wang
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate Peter Xu
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

The default replay() don't work for VT-d since vt-d will have a huge
default memory region which covers address range 0-(2^64-1). This will
normally consumes a lot of time (which looks like a dead loop).

The solution is simple - we don't walk over all the regions. Instead, we
jump over the regions when we found that the page directories are empty.
It'll greatly reduce the time to walk the whole region.

To achieve this, we provided a page walk helper to do that, invoking
corresponding hook function when we found an page we are interested in.
vtd_page_walk_level() is the core logic for the page walking. It's
interface is designed to suite further use case, e.g., to invalidate a
range of addresses.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/i386/trace-events  |   7 ++
 include/exec/memory.h |   2 +
 3 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6f5f68a..f9c5142 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -598,6 +598,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
+static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
+{
+    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
+    return 1ULL << MIN(ce_agaw, VTD_MGAW);
+}
+
+/* Return true if IOVA passes range check, otherwise false. */
+static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
+{
+    /*
+     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
+     */
+    return !(iova & ~(vtd_iova_limit(ce) - 1));
+}
+
 static const uint64_t vtd_paging_entry_rsvd_field[] = {
     [0] = ~0ULL,
     /* For not large page */
@@ -633,13 +649,9 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     uint32_t level = vtd_get_level_from_context_entry(ce);
     uint32_t offset;
     uint64_t slpte;
-    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check;
 
-    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
-     * in CAP_REG and AW in context-entry.
-     */
-    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+    if (!vtd_iova_range_check(iova, ce)) {
         trace_vtd_err("IOVA exceeds limits");
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
@@ -681,6 +693,168 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     }
 }
 
+typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
+
+/**
+ * vtd_page_walk_level - walk over specific level for IOVA range
+ *
+ * @addr: base GPA addr to start the walk
+ * @start: IOVA range start address
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @read: whether parent level has read permission
+ * @write: whether parent level has write permission
+ * @skipped: accumulated skipped ranges
+ * @notify_unmap: whether we should notify invalid entries
+ */
+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
+                               uint64_t end, vtd_page_walk_hook hook_fn,
+                               void *private, uint32_t level,
+                               bool read, bool write, uint64_t *skipped,
+                               bool notify_unmap)
+{
+    bool read_cur, write_cur, entry_valid;
+    uint32_t offset;
+    uint64_t slpte;
+    uint64_t subpage_size, subpage_mask;
+    IOMMUTLBEntry entry;
+    uint64_t iova = start;
+    uint64_t iova_next;
+    uint64_t skipped_local = 0;
+    int ret = 0;
+
+    trace_vtd_page_walk_level(addr, level, start, end);
+
+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
+    subpage_mask = vtd_slpt_level_page_mask(level);
+
+    while (iova < end) {
+        iova_next = (iova & subpage_mask) + subpage_size;
+
+        offset = vtd_iova_level_offset(iova, level);
+        slpte = vtd_get_slpte(addr, offset);
+
+        /*
+         * When one of the following case happens, we assume the whole
+         * range is invalid:
+         *
+         * 1. read block failed
+         * 3. reserved area non-zero
+         * 2. both read & write flag are not set
+         */
+
+        if (slpte == (uint64_t)-1) {
+            trace_vtd_page_walk_skip_read(iova, iova_next);
+            skipped_local++;
+            goto next;
+        }
+
+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
+            skipped_local++;
+            goto next;
+        }
+
+        /* Permissions are stacked with parents' */
+        read_cur = read && (slpte & VTD_SL_R);
+        write_cur = write && (slpte & VTD_SL_W);
+
+        /*
+         * As long as we have either read/write permission, this is
+         * a valid entry. The rule works for both page or page tables.
+         */
+        entry_valid = read_cur | write_cur;
+
+        if (vtd_is_last_slpte(slpte, level)) {
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            /*
+             * This might be meaningless addr if (!read_cur &&
+             * !write_cur), but after all this field will be
+             * meaningless in that case, so let's share the code to
+             * generate the IOTLBs no matter it's an MAP or UNMAP
+             */
+            entry.translated_addr = vtd_get_slpte_addr(slpte);
+            entry.addr_mask = ~subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            if (!entry_valid && !notify_unmap) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                skipped_local++;
+                goto next;
+            }
+            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
+                                    entry.addr_mask, entry.perm);
+            if (hook_fn) {
+                ret = hook_fn(&entry, private);
+                if (ret < 0) {
+                    error_report("Detected error in page walk hook "
+                                 "function, stop walk.");
+                    return ret;
+                }
+            }
+        } else {
+            if (!entry_valid) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                skipped_local++;
+                goto next;
+            }
+            trace_vtd_page_walk_level(vtd_get_slpte_addr(slpte), level - 1,
+                                      iova, MIN(iova_next, end));
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
+                                      MIN(iova_next, end), hook_fn, private,
+                                      level - 1, read_cur, write_cur,
+                                      &skipped_local, notify_unmap);
+            if (ret < 0) {
+                error_report("Detected page walk error on addr 0x%"PRIx64
+                             " level %"PRIu32", stop walk.", addr, level - 1);
+                return ret;
+            }
+        }
+
+next:
+        iova = iova_next;
+    }
+
+    if (skipped) {
+        *skipped += skipped_local;
+    }
+
+    return 0;
+}
+
+/**
+ * vtd_page_walk - walk specific IOVA range, and call the hook
+ *
+ * @ce: context entry to walk upon
+ * @start: IOVA address to start the walk
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: the hook that to be called for each detected area
+ * @private: private data for the hook function
+ */
+static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
+                         vtd_page_walk_hook hook_fn, void *private)
+{
+    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
+    uint32_t level = vtd_get_level_from_context_entry(ce);
+
+    if (!vtd_iova_range_check(start, ce)) {
+        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
+                     start, end);
+        return -VTD_FR_ADDR_BEYOND_MGAW;
+    }
+
+    if (!vtd_iova_range_check(end, ce)) {
+        /* Fix end so that it reaches the maximum */
+        end = vtd_iova_limit(ce);
+    }
+
+    trace_vtd_page_walk_level(addr, level, start, end);
+
+    return vtd_page_walk_level(addr, start, end, hook_fn, private,
+                               level, true, true, NULL, false);
+}
+
 /* Map a device to its corresponding domain (context-entry) */
 static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
                                     uint8_t devfn, VTDContextEntry *ce)
@@ -2395,6 +2569,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
+{
+    memory_region_notify_one((IOMMUNotifier *)private, entry);
+    return 0;
+}
+
+static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
+{
+    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_n = pci_bus_num(vtd_as->bus);
+    VTDContextEntry ce;
+
+    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+        /*
+         * Scanned a valid context entry, walk over the pages and
+         * notify when needed.
+         */
+        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                  PCI_FUNC(vtd_as->devfn),
+                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
+                                  ce.hi, ce.lo);
+        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
+    } else {
+        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                    PCI_FUNC(vtd_as->devfn));
+    }
+
+    return;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -2409,6 +2614,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     s->iommu_ops.translate = vtd_iommu_translate;
     s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
+    s->iommu_ops.replay = vtd_iommu_replay;
     s->root = 0;
     s->root_extended = false;
     s->dmar_enabled = false;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index a273980..a3e1a9d 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -31,6 +31,13 @@ vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t doma
 vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
+vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
+vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
+vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
+vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/exec/memory.h b/include/exec/memory.h
index bb4e654..9fd3232 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -59,6 +59,8 @@ typedef enum {
     IOMMU_RW   = 3,
 } IOMMUAccessFlags;
 
+#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
+
 struct IOMMUTLBEntry {
     AddressSpace    *target_as;
     hwaddr           iova;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (14 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-23 10:36   ` Jason Wang
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 17/20] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Before this one we only invalidate context cache when we receive context
entry invalidations. However it's possible that the invalidation also
contains a domain switch (only if cache-mode is enabled for vIOMMU). In
that case we need to notify all the registered components about the new
mapping.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f9c5142..4b08b4d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1146,6 +1146,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                 trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
                                              VTD_PCI_FUNC(devfn_it));
                 vtd_as->context_cache_entry.context_cache_gen = 0;
+                /*
+                 * So a device is moving out of (or moving into) a
+                 * domain, a replay() suites here to notify all the
+                 * IOMMU_NOTIFIER_MAP registers about this change.
+                 * This won't bring bad even if we have no such
+                 * notifier registered - the IOMMU notification
+                 * framework will skip MAP notifications if that
+                 * happened.
+                 */
+                memory_region_iommu_replay_all(&vtd_as->iommu);
             }
         }
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 17/20] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (15 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Peter Xu
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This is preparation work to finally enabled dynamic switching ON/OFF for
VT-d protection. The old VT-d codes is using static IOMMU address space,
and that won't satisfy vfio-pci device listeners.

Let me explain.

vfio-pci devices depend on the memory region listener and IOMMU replay
mechanism to make sure the device mapping is coherent with the guest
even if there are domain switches. And there are two kinds of domain
switches:

  (1) switch from domain A -> B
  (2) switch from domain A -> no domain (e.g., turn DMAR off)

Case (1) is handled by the context entry invalidation handling by the
VT-d replay logic. What the replay function should do here is to replay
the existing page mappings in domain B.

However for case (2), we don't want to replay any domain mappings - we
just need the default GPA->HPA mappings (the address_space_memory
mapping). And this patch helps on case (2) to build up the mapping
automatically by leveraging the vfio-pci memory listeners.

Another important thing that this patch does is to seperate
IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
depend on the DMAR region (like before this patch). It should be a
standalone region, and it should be able to be activated without
DMAR (which is a common behavior of Linux kernel - by default it enables
IR while disabled DMAR).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 78 ++++++++++++++++++++++++++++++++++++++++---
 hw/i386/trace-events          |  2 +-
 include/hw/i386/intel_iommu.h |  2 ++
 3 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4b08b4d..83a2e1f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1336,9 +1336,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
 }
 
+static void vtd_switch_address_space(VTDAddressSpace *as)
+{
+    assert(as);
+
+    trace_vtd_switch_address_space(pci_bus_num(as->bus),
+                                   VTD_PCI_SLOT(as->devfn),
+                                   VTD_PCI_FUNC(as->devfn),
+                                   as->iommu_state->dmar_enabled);
+
+    /* Turn off first then on the other */
+    if (as->iommu_state->dmar_enabled) {
+        memory_region_set_enabled(&as->sys_alias, false);
+        memory_region_set_enabled(&as->iommu, true);
+    } else {
+        memory_region_set_enabled(&as->iommu, false);
+        memory_region_set_enabled(&as->sys_alias, true);
+    }
+}
+
+static void vtd_switch_address_space_all(IntelIOMMUState *s)
+{
+    GHashTableIter iter;
+    VTDBus *vtd_bus;
+    int i;
+
+    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
+    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
+        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
+            if (!vtd_bus->dev_as[i]) {
+                continue;
+            }
+            vtd_switch_address_space(vtd_bus->dev_as[i]);
+        }
+    }
+}
+
 /* Handle Translation Enable/Disable */
 static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 {
+    if (s->dmar_enabled == en) {
+        return;
+    }
+
     VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
 
     if (en) {
@@ -1353,6 +1393,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
         /* Ok - report back to driver */
         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
     }
+
+    vtd_switch_address_space_all(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -2566,15 +2608,43 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+
+        /*
+         * Memory region relationships looks like (Address range shows
+         * only lower 32 bits to make it short in length...):
+         *
+         * |-----------------+-------------------+----------|
+         * | Name            | Address range     | Priority |
+         * |-----------------+-------------------+----------+
+         * | vtd_root        | 00000000-ffffffff |        0 |
+         * |  intel_iommu    | 00000000-ffffffff |        1 |
+         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
+         * |  intel_iommu_ir | fee00000-feefffff |       64 |
+         * |-----------------+-------------------+----------|
+         *
+         * We enable/disable DMAR by switching enablement for
+         * vtd_sys_alias and intel_iommu regions. IR region is always
+         * enabled.
+         */
         memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
                                  &s->iommu_ops, "intel_iommu", UINT64_MAX);
+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
+                                 "vtd_sys_alias", get_system_memory(),
+                                 0, memory_region_size(get_system_memory()));
         memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
                               &vtd_mem_ir_ops, s, "intel_iommu_ir",
                               VTD_INTERRUPT_ADDR_SIZE);
-        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
-                                    &vtd_dev_as->iommu_ir);
-        address_space_init(&vtd_dev_as->as,
-                           &vtd_dev_as->iommu, name);
+        memory_region_init(&vtd_dev_as->root, OBJECT(s),
+                           "vtd_root", UINT64_MAX);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root,
+                                            VTD_INTERRUPT_ADDR_FIRST,
+                                            &vtd_dev_as->iommu_ir, 64);
+        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->sys_alias, 1);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->iommu, 1);
+        vtd_switch_address_space(vtd_dev_as);
     }
     return vtd_dev_as;
 }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index a3e1a9d..bd57b0a 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -11,7 +11,6 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
 # hw/i386/intel_iommu.c
-vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
 vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
 vtd_inv_desc_cc_global(void) "context invalidate globally"
@@ -38,6 +37,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fe645aa..8f212a1 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -83,6 +83,8 @@ struct VTDAddressSpace {
     uint8_t devfn;
     AddressSpace as;
     MemoryRegion iommu;
+    MemoryRegion root;
+    MemoryRegion sys_alias;
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (16 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 17/20] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-22  8:08   ` Jason Wang
  2017-01-23  2:01   ` Jason Wang
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay Peter Xu
                   ` (2 subsequent siblings)
  20 siblings, 2 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
upstream:

  "IOMMU: enable intel_iommu map and unmap notifiers"
  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html

However I removed/fixed some content, and added my own codes.

Instead of translate() every page for iotlb invalidations (which is
slower), we walk the pages when needed and notify in a hook function.

This patch enables vfio devices for VT-d emulation.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
 include/hw/i386/intel_iommu.h |  8 ++++++
 2 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83a2e1f..7cbf057 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -833,7 +833,8 @@ next:
  * @private: private data for the hook function
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
-                         vtd_page_walk_hook hook_fn, void *private)
+                         vtd_page_walk_hook hook_fn, void *private,
+                         bool notify_unmap)
 {
     dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
     uint32_t level = vtd_get_level_from_context_entry(ce);
@@ -852,7 +853,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
     trace_vtd_page_walk_level(addr, level, start, end);
 
     return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, NULL, false);
+                               level, true, true, NULL, notify_unmap);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1205,6 +1206,33 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
                                 &domain_id);
 }
 
+static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
+                                           void *private)
+{
+    memory_region_notify_iommu((MemoryRegion *)private, *entry);
+    return 0;
+}
+
+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
+                                           uint16_t domain_id, hwaddr addr,
+                                           uint8_t am)
+{
+    IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    int ret;
+
+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
+        VTDAddressSpace *vtd_as = node->vtd_as;
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce);
+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
+                          vtd_page_invalidate_notify_hook,
+                          (void *)&vtd_as->iommu, true);
+        }
+    }
+}
+
 static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                                       hwaddr addr, uint8_t am)
 {
@@ -1215,6 +1243,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
 }
 
 /* Flush IOTLB
@@ -2224,15 +2253,33 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu,
                                           IOMMUNotifierFlag new)
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    IntelIOMMUNotifierNode *node = NULL;
+    IntelIOMMUNotifierNode *next_node = NULL;
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_report("Device at bus %s addr %02x.%d requires iommu "
-                     "notifier which is currently not supported by "
-                     "intel-iommu emulation",
-                     vtd_as->bus->qbus.name, PCI_SLOT(vtd_as->devfn),
-                     PCI_FUNC(vtd_as->devfn));
+    if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
+        error_report("We need to set cache_mode=1 for intel-iommu to enable "
+                     "device assignment with IOMMU protection.");
         exit(1);
     }
+
+    if (old == IOMMU_NOTIFIER_NONE) {
+        node = g_malloc0(sizeof(*node));
+        node->vtd_as = vtd_as;
+        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
+        return;
+    }
+
+    /* update notifier node with new flags */
+    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
+        if (node->vtd_as == vtd_as) {
+            if (new == IOMMU_NOTIFIER_NONE) {
+                QLIST_REMOVE(node, next);
+                g_free(node);
+            }
+            return;
+        }
+    }
 }
 
 static const VMStateDescription vtd_vmstate = {
@@ -2671,7 +2718,7 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
                                   PCI_FUNC(vtd_as->devfn),
                                   VTD_CONTEXT_ENTRY_DID(ce.hi),
                                   ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
+        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n, false);
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
@@ -2853,6 +2900,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    QLIST_INIT(&s->notifiers_list);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 8f212a1..3e51876 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -63,6 +63,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDIrq VTDIrq;
 typedef struct VTD_MSIMessage VTD_MSIMessage;
+typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -249,6 +250,11 @@ struct VTD_MSIMessage {
 /* When IR is enabled, all MSI/MSI-X data bits should be zero */
 #define VTD_IR_MSI_DATA          (0)
 
+struct IntelIOMMUNotifierNode {
+    VTDAddressSpace *vtd_as;
+    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
+};
+
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
@@ -286,6 +292,8 @@ struct IntelIOMMUState {
     MemoryRegionIOMMUOps iommu_ops;
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    /* list of registered notifiers */
+    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (17 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-22  8:13   ` Jason Wang
  2017-01-23 10:40   ` Jason Wang
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 20/20] intel_iommu: replay even with DSI/GLOBAL inv desc Peter Xu
  2017-01-23 15:55 ` [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
  20 siblings, 2 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Previous replay works for domain switch only if the original domain does
not have mapped pages. For example, if we switch domain from A to B, it
will only work if A has no existing mapping. If there is, then there's
problem - current replay didn't make sure the old mappings are cleared
before replaying the new one.

This patch let the replay go well even if original domain A has existing
mappings.

The idea is, when we replay, we unmap the whole address space first, no
matter what. Then, we replay the region, rebuild the pages.

We are leveraging the feature provided by vfio driver that it allows to
unmap a very big range of region, even if it is bigger than the mapped
area. It'll free all the mapped pages within the range. Here, we
choosed (0, 1ULL << VTD_MGAW) as the range to make sure every mapped
pages are unmapped.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 64 ++++++++++++++++++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |  1 +
 hw/i386/trace-events           |  1 +
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7cbf057..a038651 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2696,6 +2696,63 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+/*
+ * Unmap the whole range in the notifier's scope. If we have recorded
+ * any high watermark (VTDAddressSpace.iova_max), we use it to limit
+ * the n->end as well.
+ */
+static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
+{
+    IOMMUTLBEntry entry;
+    hwaddr size;
+    hwaddr start = n->start;
+    hwaddr end = n->end;
+
+    /*
+     * Note: all the codes in this function has a assumption that IOVA
+     * bits are no more than VTD_MGAW bits (which is restricted by
+     * VT-d spec), otherwise we need to consider overflow of 64 bits.
+     */
+
+    if (end > VTD_ADDRESS_SIZE) {
+        /*
+         * Don't need to unmap regions that is bigger than the whole
+         * VT-d supported address space size
+         */
+        end = VTD_ADDRESS_SIZE;
+    }
+
+    assert(start <= end);
+    size = end - start;
+
+    if (ctpop64(size) != 1) {
+        /*
+         * This size cannot format a correct mask. Let's enlarge it to
+         * suite the minimum available mask.
+         */
+        int n = 64 - clz64(size);
+        if (n > VTD_MGAW) {
+            /* should not happen, but in case it happens, limit it */
+            trace_vtd_err("Address space unmap found size too big");
+            n = VTD_MGAW;
+        }
+        size = 1ULL << n;
+    }
+
+    entry.target_as = &address_space_memory;
+    entry.iova = n->start;
+    entry.translated_addr = 0;  /* useless for unmap */
+    entry.perm = IOMMU_NONE;
+    entry.addr_mask = size - 1;
+
+    trace_vtd_as_unmap_whole(pci_bus_num(as->bus),
+                             VTD_PCI_SLOT(as->devfn),
+                             VTD_PCI_FUNC(as->devfn),
+                             entry.iova, size);
+
+    memory_region_notify_one(n, &entry);
+}
+
 static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
 {
     memory_region_notify_one((IOMMUNotifier *)private, entry);
@@ -2711,13 +2768,16 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
 
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
         /*
-         * Scanned a valid context entry, walk over the pages and
-         * notify when needed.
+         * Scanned a valid context entry, we first make sure to remove
+         * all existing mappings in old domain, by sending UNMAP to
+         * all the notifiers. Then, we walk over the pages and notify
+         * with existing mapped new entries in the new domain.
          */
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn),
                                   VTD_CONTEXT_ENTRY_DID(ce.hi),
                                   ce.hi, ce.lo);
+        vtd_address_space_unmap(vtd_as, n);
         vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n, false);
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4104121..29d6707 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -197,6 +197,7 @@
 #define VTD_DOMAIN_ID_MASK          ((1UL << VTD_DOMAIN_ID_SHIFT) - 1)
 #define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
 #define VTD_MGAW                    39  /* Maximum Guest Address Width */
+#define VTD_ADDRESS_SIZE            (1ULL << VTD_MGAW)
 #define VTD_CAP_MGAW                (((VTD_MGAW - 1) & 0x3fULL) << 16)
 #define VTD_MAMV                    18ULL
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index bd57b0a..ef725ca 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -38,6 +38,7 @@ vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"P
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
+vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4 20/20] intel_iommu: replay even with DSI/GLOBAL inv desc
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (18 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay Peter Xu
@ 2017-01-20 13:08 ` Peter Xu
  2017-01-23 15:55 ` [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
  20 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-20 13:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

We were capturing context entry invalidations to trap IOMMU mapping
changes. This patch listens to domain/global invalidation requests too.

We need this for the sake that guest operating system might send one
domain/global invalidation instead of several PSIs in some cases. To
better survive with that, we'd better replay corresponding regions as
well for these invalidations, even if this will turn the performance
down a bit.

An example in Linux (4.10.0) Intel IOMMU driver:

    /*
     * Fallback to domain selective flush if no PSI support or the size is
     * too big.
     * PSI requires page size to be 2 ^ x, and the base address is naturally
     * aligned to the size
     */
    if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
        iommu->flush.flush_iotlb(iommu, did, 0, 0,
                        DMA_TLB_DSI_FLUSH);
    else
        iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
                        DMA_TLB_PSI_FLUSH);

If we don't have this, when above DSI FLUSH happens, we might have
unaligned mapping.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a038651..e958f53 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1196,14 +1196,33 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
 
 static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 {
+    IntelIOMMUNotifierNode *node;
+
     trace_vtd_iotlb_reset("global invalidation recved");
     vtd_reset_iotlb(s);
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        memory_region_iommu_replay_all(&node->vtd_as->iommu);
+    }
 }
 
 static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 {
+    IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    VTDAddressSpace *vtd_as;
+
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        vtd_as = node->vtd_as;
+        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                      vtd_as->devfn, &ce) &&
+            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            memory_region_iommu_replay_all(&vtd_as->iommu);
+        }
+    }
 }
 
 static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH RFC v4.1 04/20] intel_iommu: add "caching-mode" option
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 04/20] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
@ 2017-01-22  2:51   ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-22  2:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv, eblake

From: Aviv Ben-David <bd.aviv@gmail.com>

This capability asks the guest to invalidate cache before each map operation.
We can use this invalidation to trap map operations in the hypervisor.

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
[peterx: using "caching-mode" instead of "cache-mode" to align with spec]
[peterx: re-write the subject to make it short and clear]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 5 +++++
 hw/i386/intel_iommu_internal.h | 1 +
 include/hw/i386/intel_iommu.h  | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ec62239..e58f1de 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2107,6 +2107,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
                             ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
+    DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -2488,6 +2489,10 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_DT;
     }
 
+    if (s->caching_mode) {
+        s->cap |= VTD_CAP_CM;
+    }
+
     vtd_reset_context_cache(s);
     vtd_reset_iotlb(s);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 356f188..4104121 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -202,6 +202,7 @@
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
 #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_CM                  (1ULL << 7)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 405c9d1..fe645aa 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -257,6 +257,8 @@ struct IntelIOMMUState {
     uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
     uint32_t version;
 
+    bool caching_mode;          /* RO - is cap CM enabled? */
+
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
     bool dmar_enabled;              /* Set if DMA remapping is enabled */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-01-22  7:56   ` Jason Wang
  2017-01-22  8:51     ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-22  7:56 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月20日 21:08, Peter Xu wrote:
> The default replay() don't work for VT-d since vt-d will have a huge
> default memory region which covers address range 0-(2^64-1). This will
> normally consumes a lot of time (which looks like a dead loop).
>
> The solution is simple - we don't walk over all the regions. Instead, we
> jump over the regions when we found that the page directories are empty.
> It'll greatly reduce the time to walk the whole region.
>
> To achieve this, we provided a page walk helper to do that, invoking
> corresponding hook function when we found an page we are interested in.
> vtd_page_walk_level() is the core logic for the page walking. It's
> interface is designed to suite further use case, e.g., to invalidate a
> range of addresses.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++--
>   hw/i386/trace-events  |   7 ++
>   include/exec/memory.h |   2 +
>   3 files changed, 220 insertions(+), 5 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 6f5f68a..f9c5142 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -598,6 +598,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
>       return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
>   }
>   
> +static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
> +{
> +    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
> +    return 1ULL << MIN(ce_agaw, VTD_MGAW);
> +}
> +
> +/* Return true if IOVA passes range check, otherwise false. */
> +static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
> +{
> +    /*
> +     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
> +     */
> +    return !(iova & ~(vtd_iova_limit(ce) - 1));
> +}
> +
>   static const uint64_t vtd_paging_entry_rsvd_field[] = {
>       [0] = ~0ULL,
>       /* For not large page */
> @@ -633,13 +649,9 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>       uint32_t level = vtd_get_level_from_context_entry(ce);
>       uint32_t offset;
>       uint64_t slpte;
> -    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>       uint64_t access_right_check;
>   
> -    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> -     * in CAP_REG and AW in context-entry.
> -     */
> -    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +    if (!vtd_iova_range_check(iova, ce)) {
>           trace_vtd_err("IOVA exceeds limits");
>           return -VTD_FR_ADDR_BEYOND_MGAW;
>       }
> @@ -681,6 +693,168 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>       }
>   }
>   
> +typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
> +
> +/**
> + * vtd_page_walk_level - walk over specific level for IOVA range
> + *
> + * @addr: base GPA addr to start the walk
> + * @start: IOVA range start address
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: hook func to be called when detected page
> + * @private: private data to be passed into hook func
> + * @read: whether parent level has read permission
> + * @write: whether parent level has write permission
> + * @skipped: accumulated skipped ranges

What's the usage for this parameter? Looks like it was never used in 
this series.

> + * @notify_unmap: whether we should notify invalid entries
> + */
> +static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> +                               uint64_t end, vtd_page_walk_hook hook_fn,
> +                               void *private, uint32_t level,
> +                               bool read, bool write, uint64_t *skipped,
> +                               bool notify_unmap)
> +{
> +    bool read_cur, write_cur, entry_valid;
> +    uint32_t offset;
> +    uint64_t slpte;
> +    uint64_t subpage_size, subpage_mask;
> +    IOMMUTLBEntry entry;
> +    uint64_t iova = start;
> +    uint64_t iova_next;
> +    uint64_t skipped_local = 0;
> +    int ret = 0;
> +
> +    trace_vtd_page_walk_level(addr, level, start, end);
> +
> +    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> +    subpage_mask = vtd_slpt_level_page_mask(level);
> +
> +    while (iova < end) {
> +        iova_next = (iova & subpage_mask) + subpage_size;
> +
> +        offset = vtd_iova_level_offset(iova, level);
> +        slpte = vtd_get_slpte(addr, offset);
> +
> +        /*
> +         * When one of the following case happens, we assume the whole
> +         * range is invalid:
> +         *
> +         * 1. read block failed

Don't get the meaning (and don't see any code relate to this comment).

> +         * 3. reserved area non-zero
> +         * 2. both read & write flag are not set

Should be 1,2,3? And the above comment is conflict with the code at 
least when notify_unmap is true.

> +         */
> +
> +        if (slpte == (uint64_t)-1) {

If this is true, vtd_slpte_nonzero_rsvd(slpte) should be true too I think?

> +            trace_vtd_page_walk_skip_read(iova, iova_next);
> +            skipped_local++;
> +            goto next;
> +        }
> +
> +        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> +            skipped_local++;
> +            goto next;
> +        }
> +
> +        /* Permissions are stacked with parents' */
> +        read_cur = read && (slpte & VTD_SL_R);
> +        write_cur = write && (slpte & VTD_SL_W);
> +
> +        /*
> +         * As long as we have either read/write permission, this is
> +         * a valid entry. The rule works for both page or page tables.
> +         */
> +        entry_valid = read_cur | write_cur;
> +
> +        if (vtd_is_last_slpte(slpte, level)) {
> +            entry.target_as = &address_space_memory;
> +            entry.iova = iova & subpage_mask;
> +            /*
> +             * This might be meaningless addr if (!read_cur &&
> +             * !write_cur), but after all this field will be
> +             * meaningless in that case, so let's share the code to
> +             * generate the IOTLBs no matter it's an MAP or UNMAP
> +             */
> +            entry.translated_addr = vtd_get_slpte_addr(slpte);
> +            entry.addr_mask = ~subpage_mask;
> +            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +            if (!entry_valid && !notify_unmap) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                skipped_local++;
> +                goto next;
> +            }

Under which case should we care about unmap here (better with a comment 
I think)?

> +            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> +                                    entry.addr_mask, entry.perm);
> +            if (hook_fn) {
> +                ret = hook_fn(&entry, private);

For better performance, we could try to merge adjacent mappings here. I 
think both vfio and vhost support this and it can save a lot of ioctls.

> +                if (ret < 0) {
> +                    error_report("Detected error in page walk hook "
> +                                 "function, stop walk.");
> +                    return ret;
> +                }
> +            }
> +        } else {
> +            if (!entry_valid) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                skipped_local++;
> +                goto next;
> +            }
> +            trace_vtd_page_walk_level(vtd_get_slpte_addr(slpte), level - 1,
> +                                      iova, MIN(iova_next, end));

This looks duplicated?

> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
> +                                      MIN(iova_next, end), hook_fn, private,
> +                                      level - 1, read_cur, write_cur,
> +                                      &skipped_local, notify_unmap);
> +            if (ret < 0) {
> +                error_report("Detected page walk error on addr 0x%"PRIx64
> +                             " level %"PRIu32", stop walk.", addr, level - 1);

Guest triggered, so better use debug macro or tracepoint.

> +                return ret;
> +            }
> +        }
> +
> +next:
> +        iova = iova_next;
> +    }
> +
> +    if (skipped) {
> +        *skipped += skipped_local;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: the hook that to be called for each detected area
> + * @private: private data for the hook function
> + */
> +static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> +                         vtd_page_walk_hook hook_fn, void *private)
> +{
> +    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> +    uint32_t level = vtd_get_level_from_context_entry(ce);
> +
> +    if (!vtd_iova_range_check(start, ce)) {
> +        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
> +                     start, end);

Guest triggered, better use debug macro or tracepoint.

> +        return -VTD_FR_ADDR_BEYOND_MGAW;
> +    }
> +
> +    if (!vtd_iova_range_check(end, ce)) {
> +        /* Fix end so that it reaches the maximum */
> +        end = vtd_iova_limit(ce);
> +    }
> +
> +    trace_vtd_page_walk_level(addr, level, start, end);

Duplicated with the tracepoint in vtd_page_walk_level() too?

> +
> +    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> +                               level, true, true, NULL, false);
> +}
> +
>   /* Map a device to its corresponding domain (context-entry) */
>   static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>                                       uint8_t devfn, VTDContextEntry *ce)
> @@ -2395,6 +2569,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>       return vtd_dev_as;
>   }
>   
> +static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> +{
> +    memory_region_notify_one((IOMMUNotifier *)private, entry);
> +    return 0;
> +}
> +
> +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> +{
> +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> +    VTDContextEntry ce;
> +
> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> +        /*
> +         * Scanned a valid context entry, walk over the pages and
> +         * notify when needed.
> +         */
> +        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> +                                  PCI_FUNC(vtd_as->devfn),
> +                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
> +                                  ce.hi, ce.lo);
> +        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);

~0ULL?

> +    } else {
> +        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> +                                    PCI_FUNC(vtd_as->devfn));
> +    }
> +
> +    return;
> +}
> +
>   /* Do the initialization. It will also be called when reset, so pay
>    * attention when adding new initialization stuff.
>    */
> @@ -2409,6 +2614,7 @@ static void vtd_init(IntelIOMMUState *s)
>   
>       s->iommu_ops.translate = vtd_iommu_translate;
>       s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
> +    s->iommu_ops.replay = vtd_iommu_replay;
>       s->root = 0;
>       s->root_extended = false;
>       s->dmar_enabled = false;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index a273980..a3e1a9d 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -31,6 +31,13 @@ vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t doma
>   vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
>   vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
>   vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
> +vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
> +vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
> +vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
> +vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
> +vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
> +vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
>   
>   # hw/i386/amd_iommu.c
>   amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index bb4e654..9fd3232 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -59,6 +59,8 @@ typedef enum {
>       IOMMU_RW   = 3,
>   } IOMMUAccessFlags;
>   
> +#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
> +
>   struct IOMMUTLBEntry {
>       AddressSpace    *target_as;
>       hwaddr           iova;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Peter Xu
@ 2017-01-22  8:08   ` Jason Wang
  2017-01-22  9:04     ` Peter Xu
  2017-01-23  2:01   ` Jason Wang
  1 sibling, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-22  8:08 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月20日 21:08, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
>
>    "IOMMU: enable intel_iommu map and unmap notifiers"
>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>
> However I removed/fixed some content, and added my own codes.
>
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
>
> This patch enables vfio devices for VT-d emulation.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
>   include/hw/i386/intel_iommu.h |  8 ++++++
>   2 files changed, 65 insertions(+), 9 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 83a2e1f..7cbf057 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -833,7 +833,8 @@ next:
>    * @private: private data for the hook function
>    */
>   static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> -                         vtd_page_walk_hook hook_fn, void *private)
> +                         vtd_page_walk_hook hook_fn, void *private,
> +                         bool notify_unmap)
>   {
>       dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
>       uint32_t level = vtd_get_level_from_context_entry(ce);
> @@ -852,7 +853,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>       trace_vtd_page_walk_level(addr, level, start, end);
>   
>       return vtd_page_walk_level(addr, start, end, hook_fn, private,
> -                               level, true, true, NULL, false);
> +                               level, true, true, NULL, notify_unmap);
>   }
>   
>   /* Map a device to its corresponding domain (context-entry) */
> @@ -1205,6 +1206,33 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>                                   &domain_id);
>   }
>   
> +static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
> +                                           void *private)
> +{
> +    memory_region_notify_iommu((MemoryRegion *)private, *entry);
> +    return 0;
> +}
> +
> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> +                                           uint16_t domain_id, hwaddr addr,
> +                                           uint8_t am)
> +{
> +    IntelIOMMUNotifierNode *node;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> +        VTDAddressSpace *vtd_as = node->vtd_as;
> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> +                                       vtd_as->devfn, &ce);
> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> +                          vtd_page_invalidate_notify_hook,
> +                          (void *)&vtd_as->iommu, true);

Why not simply trigger the notifier here? (or is this vfio required?)

> +        }
> +    }
> +}
> +
>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>                                         hwaddr addr, uint8_t am)
>   {
> @@ -1215,6 +1243,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>       info.addr = addr;
>       info.mask = ~((1 << am) - 1);
>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);

I think it's better to squash DSI and GLOBAL invalidation into this 
patch, otherwise the patch is buggy.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay Peter Xu
@ 2017-01-22  8:13   ` Jason Wang
  2017-01-22  9:09     ` Peter Xu
  2017-01-23 10:40   ` Jason Wang
  1 sibling, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-22  8:13 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月20日 21:08, Peter Xu wrote:
> Previous replay works for domain switch only if the original domain does
> not have mapped pages. For example, if we switch domain from A to B, it
> will only work if A has no existing mapping. If there is, then there's
> problem - current replay didn't make sure the old mappings are cleared
> before replaying the new one.

I'm not quite sure this is needed. I thought the only thing we need to 
do is stop DMA of device during the moving? Or is there an example that 
will cause trouble?

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-22  7:56   ` Jason Wang
@ 2017-01-22  8:51     ` Peter Xu
  2017-01-22  9:36       ` Peter Xu
                         ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-22  8:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Sun, Jan 22, 2017 at 03:56:10PM +0800, Jason Wang wrote:

[...]

> >+/**
> >+ * vtd_page_walk_level - walk over specific level for IOVA range
> >+ *
> >+ * @addr: base GPA addr to start the walk
> >+ * @start: IOVA range start address
> >+ * @end: IOVA range end address (start <= addr < end)
> >+ * @hook_fn: hook func to be called when detected page
> >+ * @private: private data to be passed into hook func
> >+ * @read: whether parent level has read permission
> >+ * @write: whether parent level has write permission
> >+ * @skipped: accumulated skipped ranges
> 
> What's the usage for this parameter? Looks like it was never used in this
> series.

This was for debugging purpose before, and I kept it in case one day
it can be used again, considering that will not affect much on the
overall performance.

> 
> >+ * @notify_unmap: whether we should notify invalid entries
> >+ */
> >+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> >+                               uint64_t end, vtd_page_walk_hook hook_fn,
> >+                               void *private, uint32_t level,
> >+                               bool read, bool write, uint64_t *skipped,
> >+                               bool notify_unmap)
> >+{
> >+    bool read_cur, write_cur, entry_valid;
> >+    uint32_t offset;
> >+    uint64_t slpte;
> >+    uint64_t subpage_size, subpage_mask;
> >+    IOMMUTLBEntry entry;
> >+    uint64_t iova = start;
> >+    uint64_t iova_next;
> >+    uint64_t skipped_local = 0;
> >+    int ret = 0;
> >+
> >+    trace_vtd_page_walk_level(addr, level, start, end);
> >+
> >+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> >+    subpage_mask = vtd_slpt_level_page_mask(level);
> >+
> >+    while (iova < end) {
> >+        iova_next = (iova & subpage_mask) + subpage_size;
> >+
> >+        offset = vtd_iova_level_offset(iova, level);
> >+        slpte = vtd_get_slpte(addr, offset);
> >+
> >+        /*
> >+         * When one of the following case happens, we assume the whole
> >+         * range is invalid:
> >+         *
> >+         * 1. read block failed
> 
> Don't get the meaning (and don't see any code relate to this comment).

I took above vtd_get_slpte() a "read", so I was trying to mean that we
failed to read the SLPTE due to some reason, we assume the range is
invalid.

> 
> >+         * 3. reserved area non-zero
> >+         * 2. both read & write flag are not set
> 
> Should be 1,2,3? And the above comment is conflict with the code at least
> when notify_unmap is true.

Yes, okay I don't know why 3 jumped. :(

And yes, I should mention that "both read & write flag not set" only
suites for page tables here.

> 
> >+         */
> >+
> >+        if (slpte == (uint64_t)-1) {
> 
> If this is true, vtd_slpte_nonzero_rsvd(slpte) should be true too I think?

Yes, but we are doing two checks here:

- checking against -1 to make sure slpte read success
- checking against nonzero reserved fields to make sure it follows spec

Imho we should not skip the first check here, only if one day removing
this may really matter (e.g., for performance reason? I cannot think
of one yet).

> 
> >+            trace_vtd_page_walk_skip_read(iova, iova_next);
> >+            skipped_local++;
> >+            goto next;
> >+        }
> >+
> >+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> >+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> >+            skipped_local++;
> >+            goto next;
> >+        }
> >+
> >+        /* Permissions are stacked with parents' */
> >+        read_cur = read && (slpte & VTD_SL_R);
> >+        write_cur = write && (slpte & VTD_SL_W);
> >+
> >+        /*
> >+         * As long as we have either read/write permission, this is
> >+         * a valid entry. The rule works for both page or page tables.
> >+         */
> >+        entry_valid = read_cur | write_cur;
> >+
> >+        if (vtd_is_last_slpte(slpte, level)) {
> >+            entry.target_as = &address_space_memory;
> >+            entry.iova = iova & subpage_mask;
> >+            /*
> >+             * This might be meaningless addr if (!read_cur &&
> >+             * !write_cur), but after all this field will be
> >+             * meaningless in that case, so let's share the code to
> >+             * generate the IOTLBs no matter it's an MAP or UNMAP
> >+             */
> >+            entry.translated_addr = vtd_get_slpte_addr(slpte);
> >+            entry.addr_mask = ~subpage_mask;
> >+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> >+            if (!entry_valid && !notify_unmap) {
> >+                trace_vtd_page_walk_skip_perm(iova, iova_next);
> >+                skipped_local++;
> >+                goto next;
> >+            }
> 
> Under which case should we care about unmap here (better with a comment I
> think)?

When PSIs are for invalidation, rather than newly mapped entries. In
that case, notify_unmap will be true, and here we need to notify
IOMMU_NONE to do the cache flush or unmap.

(this page walk is not only for replaying, it is for cache flushing as
 well)

Do you have suggestion on the comments?

> 
> >+            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> >+                                    entry.addr_mask, entry.perm);
> >+            if (hook_fn) {
> >+                ret = hook_fn(&entry, private);
> 
> For better performance, we could try to merge adjacent mappings here. I
> think both vfio and vhost support this and it can save a lot of ioctls.

Looks so, and this is in my todo list.

Do you mind I do it later after this series merged? I would really
appreciate if we can have this codes settled down first (considering
that this series has been dangling for half a year, or more, startint
from Aviv's series), and I am just afraid this will led to
unconvergence of this series (and I believe there are other places
that can be enhanced in the future as well).

> 
> >+                if (ret < 0) {
> >+                    error_report("Detected error in page walk hook "
> >+                                 "function, stop walk.");
> >+                    return ret;
> >+                }
> >+            }
> >+        } else {
> >+            if (!entry_valid) {
> >+                trace_vtd_page_walk_skip_perm(iova, iova_next);
> >+                skipped_local++;
> >+                goto next;
> >+            }
> >+            trace_vtd_page_walk_level(vtd_get_slpte_addr(slpte), level - 1,
> >+                                      iova, MIN(iova_next, end));
> 
> This looks duplicated?

I suppose not. The level is different.

> 
> >+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
> >+                                      MIN(iova_next, end), hook_fn, private,
> >+                                      level - 1, read_cur, write_cur,
> >+                                      &skipped_local, notify_unmap);
> >+            if (ret < 0) {
> >+                error_report("Detected page walk error on addr 0x%"PRIx64
> >+                             " level %"PRIu32", stop walk.", addr, level - 1);
> 
> Guest triggered, so better use debug macro or tracepoint.

Sorry. Will replace all the error_report() in the whole series.

> 
> >+                return ret;
> >+            }
> >+        }
> >+
> >+next:
> >+        iova = iova_next;
> >+    }
> >+
> >+    if (skipped) {
> >+        *skipped += skipped_local;
> >+    }
> >+
> >+    return 0;
> >+}
> >+
> >+/**
> >+ * vtd_page_walk - walk specific IOVA range, and call the hook
> >+ *
> >+ * @ce: context entry to walk upon
> >+ * @start: IOVA address to start the walk
> >+ * @end: IOVA range end address (start <= addr < end)
> >+ * @hook_fn: the hook that to be called for each detected area
> >+ * @private: private data for the hook function
> >+ */
> >+static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> >+                         vtd_page_walk_hook hook_fn, void *private)
> >+{
> >+    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> >+    uint32_t level = vtd_get_level_from_context_entry(ce);
> >+
> >+    if (!vtd_iova_range_check(start, ce)) {
> >+        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
> >+                     start, end);
> 
> Guest triggered, better use debug macro or tracepoint.

Same.

> 
> >+        return -VTD_FR_ADDR_BEYOND_MGAW;
> >+    }
> >+
> >+    if (!vtd_iova_range_check(end, ce)) {
> >+        /* Fix end so that it reaches the maximum */
> >+        end = vtd_iova_limit(ce);
> >+    }
> >+
> >+    trace_vtd_page_walk_level(addr, level, start, end);
> 
> Duplicated with the tracepoint in vtd_page_walk_level() too?

Nop? This should be the top level.

> 
> >+
> >+    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> >+                               level, true, true, NULL, false);
> >+}
> >+
> >  /* Map a device to its corresponding domain (context-entry) */
> >  static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> >                                      uint8_t devfn, VTDContextEntry *ce)
> >@@ -2395,6 +2569,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> >      return vtd_dev_as;
> >  }
> >+static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> >+{
> >+    memory_region_notify_one((IOMMUNotifier *)private, entry);
> >+    return 0;
> >+}
> >+
> >+static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> >+{
> >+    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> >+    IntelIOMMUState *s = vtd_as->iommu_state;
> >+    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> >+    VTDContextEntry ce;
> >+
> >+    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> >+        /*
> >+         * Scanned a valid context entry, walk over the pages and
> >+         * notify when needed.
> >+         */
> >+        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> >+                                  PCI_FUNC(vtd_as->devfn),
> >+                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
> >+                                  ce.hi, ce.lo);
> >+        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
> 
> ~0ULL?

Fixing up.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-22  8:08   ` Jason Wang
@ 2017-01-22  9:04     ` Peter Xu
  2017-01-23  1:55       ` Jason Wang
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-22  9:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:

[...]

> >+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >+                                           uint16_t domain_id, hwaddr addr,
> >+                                           uint8_t am)
> >+{
> >+    IntelIOMMUNotifierNode *node;
> >+    VTDContextEntry ce;
> >+    int ret;
> >+
> >+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >+        VTDAddressSpace *vtd_as = node->vtd_as;
> >+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >+                                       vtd_as->devfn, &ce);
> >+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >+                          vtd_page_invalidate_notify_hook,
> >+                          (void *)&vtd_as->iommu, true);
> 
> Why not simply trigger the notifier here? (or is this vfio required?)

Because we may only want to notify part of the region - we are with
mask here, but not exact size.

Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
the mask will be extended to 16K in the guest. In that case, we need
to explicitly go over the page entry to know that the 4th page should
not be notified.

> 
> >+        }
> >+    }
> >+}
> >+
> >  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >                                        hwaddr addr, uint8_t am)
> >  {
> >@@ -1215,6 +1243,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >      info.addr = addr;
> >      info.mask = ~((1 << am) - 1);
> >      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> >+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> 
> I think it's better to squash DSI and GLOBAL invalidation into this patch,
> otherwise the patch is buggy.

I can do this. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-22  8:13   ` Jason Wang
@ 2017-01-22  9:09     ` Peter Xu
  2017-01-23  1:57       ` Jason Wang
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-22  9:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Sun, Jan 22, 2017 at 04:13:32PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月20日 21:08, Peter Xu wrote:
> >Previous replay works for domain switch only if the original domain does
> >not have mapped pages. For example, if we switch domain from A to B, it
> >will only work if A has no existing mapping. If there is, then there's
> >problem - current replay didn't make sure the old mappings are cleared
> >before replaying the new one.
> 
> I'm not quite sure this is needed. I thought the only thing we need to do is
> stop DMA of device during the moving? Or is there an example that will cause
> trouble?

I think this patch is essential.

Example:

- device D1 moved to domain A, domain A has no mapping
- map page P1 in domain A, so D1 will have a mapping of page P1
- create domain B with mapping P2
- move D1 from domain A to domain B

Here if we don't unmap existing pages in domain A (P1), after the
switch, we'll have D1 with both P1/P2 mapped, while domain B actually
only has P2. That's unaligned mapping, and it should be wrong.

If you (or anyone) think this is a bug as well for patch 18, I can
just squash both 19/20 into patch 18.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-22  8:51     ` Peter Xu
@ 2017-01-22  9:36       ` Peter Xu
  2017-01-23  1:50         ` Jason Wang
  2017-01-23  1:48       ` Jason Wang
  2017-01-23 19:33       ` Alex Williamson
  2 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-22  9:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Sun, Jan 22, 2017 at 04:51:18PM +0800, Peter Xu wrote:
> On Sun, Jan 22, 2017 at 03:56:10PM +0800, Jason Wang wrote:
> 
> [...]
> 
> > >+/**
> > >+ * vtd_page_walk_level - walk over specific level for IOVA range
> > >+ *
> > >+ * @addr: base GPA addr to start the walk
> > >+ * @start: IOVA range start address
> > >+ * @end: IOVA range end address (start <= addr < end)
> > >+ * @hook_fn: hook func to be called when detected page
> > >+ * @private: private data to be passed into hook func
> > >+ * @read: whether parent level has read permission
> > >+ * @write: whether parent level has write permission
> > >+ * @skipped: accumulated skipped ranges
> > 
> > What's the usage for this parameter? Looks like it was never used in this
> > series.
> 
> This was for debugging purpose before, and I kept it in case one day
> it can be used again, considering that will not affect much on the
> overall performance.
> 
> > 
> > >+ * @notify_unmap: whether we should notify invalid entries
> > >+ */
> > >+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > >+                               uint64_t end, vtd_page_walk_hook hook_fn,
> > >+                               void *private, uint32_t level,
> > >+                               bool read, bool write, uint64_t *skipped,
> > >+                               bool notify_unmap)
> > >+{
> > >+    bool read_cur, write_cur, entry_valid;
> > >+    uint32_t offset;
> > >+    uint64_t slpte;
> > >+    uint64_t subpage_size, subpage_mask;
> > >+    IOMMUTLBEntry entry;
> > >+    uint64_t iova = start;
> > >+    uint64_t iova_next;
> > >+    uint64_t skipped_local = 0;
> > >+    int ret = 0;
> > >+
> > >+    trace_vtd_page_walk_level(addr, level, start, end);
> > >+
> > >+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> > >+    subpage_mask = vtd_slpt_level_page_mask(level);
> > >+
> > >+    while (iova < end) {
> > >+        iova_next = (iova & subpage_mask) + subpage_size;
> > >+
> > >+        offset = vtd_iova_level_offset(iova, level);
> > >+        slpte = vtd_get_slpte(addr, offset);
> > >+
> > >+        /*
> > >+         * When one of the following case happens, we assume the whole
> > >+         * range is invalid:
> > >+         *
> > >+         * 1. read block failed
> > 
> > Don't get the meaning (and don't see any code relate to this comment).
> 
> I took above vtd_get_slpte() a "read", so I was trying to mean that we
> failed to read the SLPTE due to some reason, we assume the range is
> invalid.
> 
> > 
> > >+         * 3. reserved area non-zero
> > >+         * 2. both read & write flag are not set
> > 
> > Should be 1,2,3? And the above comment is conflict with the code at least
> > when notify_unmap is true.
> 
> Yes, okay I don't know why 3 jumped. :(
> 
> And yes, I should mention that "both read & write flag not set" only
> suites for page tables here.
> 
> > 
> > >+         */
> > >+
> > >+        if (slpte == (uint64_t)-1) {
> > 
> > If this is true, vtd_slpte_nonzero_rsvd(slpte) should be true too I think?
> 
> Yes, but we are doing two checks here:
> 
> - checking against -1 to make sure slpte read success
> - checking against nonzero reserved fields to make sure it follows spec
> 
> Imho we should not skip the first check here, only if one day removing
> this may really matter (e.g., for performance reason? I cannot think
> of one yet).
> 
> > 
> > >+            trace_vtd_page_walk_skip_read(iova, iova_next);
> > >+            skipped_local++;
> > >+            goto next;
> > >+        }
> > >+
> > >+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > >+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> > >+            skipped_local++;
> > >+            goto next;
> > >+        }
> > >+
> > >+        /* Permissions are stacked with parents' */
> > >+        read_cur = read && (slpte & VTD_SL_R);
> > >+        write_cur = write && (slpte & VTD_SL_W);
> > >+
> > >+        /*
> > >+         * As long as we have either read/write permission, this is
> > >+         * a valid entry. The rule works for both page or page tables.
> > >+         */
> > >+        entry_valid = read_cur | write_cur;
> > >+
> > >+        if (vtd_is_last_slpte(slpte, level)) {
> > >+            entry.target_as = &address_space_memory;
> > >+            entry.iova = iova & subpage_mask;
> > >+            /*
> > >+             * This might be meaningless addr if (!read_cur &&
> > >+             * !write_cur), but after all this field will be
> > >+             * meaningless in that case, so let's share the code to
> > >+             * generate the IOTLBs no matter it's an MAP or UNMAP
> > >+             */
> > >+            entry.translated_addr = vtd_get_slpte_addr(slpte);
> > >+            entry.addr_mask = ~subpage_mask;
> > >+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> > >+            if (!entry_valid && !notify_unmap) {
> > >+                trace_vtd_page_walk_skip_perm(iova, iova_next);
> > >+                skipped_local++;
> > >+                goto next;
> > >+            }
> > 
> > Under which case should we care about unmap here (better with a comment I
> > think)?
> 
> When PSIs are for invalidation, rather than newly mapped entries. In
> that case, notify_unmap will be true, and here we need to notify
> IOMMU_NONE to do the cache flush or unmap.
> 
> (this page walk is not only for replaying, it is for cache flushing as
>  well)
> 
> Do you have suggestion on the comments?

Besides this one, I tried to fix the comments in this function as
below, hope this is better (I removed 1-3 thing since I think that's
clearer from below code):

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e958f53..f3fe8c4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -735,15 +735,6 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
         offset = vtd_iova_level_offset(iova, level);
         slpte = vtd_get_slpte(addr, offset);

-        /*
-         * When one of the following case happens, we assume the whole
-         * range is invalid:
-         *
-         * 1. read block failed
-         * 3. reserved area non-zero
-         * 2. both read & write flag are not set
-         */
-
         if (slpte == (uint64_t)-1) {
             trace_vtd_page_walk_skip_read(iova, iova_next);
             skipped_local++;
@@ -761,20 +752,16 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
         write_cur = write && (slpte & VTD_SL_W);

         /*
-         * As long as we have either read/write permission, this is
-         * a valid entry. The rule works for both page or page tables.
+         * As long as we have either read/write permission, this is a
+         * valid entry. The rule works for both page entries and page
+         * table entries.
          */
         entry_valid = read_cur | write_cur;

         if (vtd_is_last_slpte(slpte, level)) {
             entry.target_as = &address_space_memory;
             entry.iova = iova & subpage_mask;
-            /*
-             * This might be meaningless addr if (!read_cur &&
-             * !write_cur), but after all this field will be
-             * meaningless in that case, so let's share the code to
-             * generate the IOTLBs no matter it's an MAP or UNMAP
-             */
+            /* NOTE: this is only meaningful if entry_valid == true */
             entry.translated_addr = vtd_get_slpte_addr(slpte);
             entry.addr_mask = ~subpage_mask;
             entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);

Thanks,

-- peterx

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-22  8:51     ` Peter Xu
  2017-01-22  9:36       ` Peter Xu
@ 2017-01-23  1:48       ` Jason Wang
  2017-01-23  2:54         ` Peter Xu
  2017-01-23 19:33       ` Alex Williamson
  2 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-23  1:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月22日 16:51, Peter Xu wrote:
> On Sun, Jan 22, 2017 at 03:56:10PM +0800, Jason Wang wrote:
>
> [...]
>
>>> +/**
>>> + * vtd_page_walk_level - walk over specific level for IOVA range
>>> + *
>>> + * @addr: base GPA addr to start the walk
>>> + * @start: IOVA range start address
>>> + * @end: IOVA range end address (start <= addr < end)
>>> + * @hook_fn: hook func to be called when detected page
>>> + * @private: private data to be passed into hook func
>>> + * @read: whether parent level has read permission
>>> + * @write: whether parent level has write permission
>>> + * @skipped: accumulated skipped ranges
>> What's the usage for this parameter? Looks like it was never used in this
>> series.
> This was for debugging purpose before, and I kept it in case one day
> it can be used again, considering that will not affect much on the
> overall performance.

I think we usually do not keep debugging codes outside debug macros.

>
>>> + * @notify_unmap: whether we should notify invalid entries
>>> + */
>>> +static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>>> +                               uint64_t end, vtd_page_walk_hook hook_fn,
>>> +                               void *private, uint32_t level,
>>> +                               bool read, bool write, uint64_t *skipped,
>>> +                               bool notify_unmap)
>>> +{
>>> +    bool read_cur, write_cur, entry_valid;
>>> +    uint32_t offset;
>>> +    uint64_t slpte;
>>> +    uint64_t subpage_size, subpage_mask;
>>> +    IOMMUTLBEntry entry;
>>> +    uint64_t iova = start;
>>> +    uint64_t iova_next;
>>> +    uint64_t skipped_local = 0;
>>> +    int ret = 0;
>>> +
>>> +    trace_vtd_page_walk_level(addr, level, start, end);
>>> +
>>> +    subpage_size = 1ULL << vtd_slpt_level_shift(level);
>>> +    subpage_mask = vtd_slpt_level_page_mask(level);
>>> +
>>> +    while (iova < end) {
>>> +        iova_next = (iova & subpage_mask) + subpage_size;
>>> +
>>> +        offset = vtd_iova_level_offset(iova, level);
>>> +        slpte = vtd_get_slpte(addr, offset);
>>> +
>>> +        /*
>>> +         * When one of the following case happens, we assume the whole
>>> +         * range is invalid:
>>> +         *
>>> +         * 1. read block failed
>> Don't get the meaning (and don't see any code relate to this comment).
> I took above vtd_get_slpte() a "read", so I was trying to mean that we
> failed to read the SLPTE due to some reason, we assume the range is
> invalid.

I see, so we'd better move the comment above of vtd_get_slpte().

>
>>> +         * 3. reserved area non-zero
>>> +         * 2. both read & write flag are not set
>> Should be 1,2,3? And the above comment is conflict with the code at least
>> when notify_unmap is true.
> Yes, okay I don't know why 3 jumped. :(
>
> And yes, I should mention that "both read & write flag not set" only
> suites for page tables here.
>
>>> +         */
>>> +
>>> +        if (slpte == (uint64_t)-1) {
>> If this is true, vtd_slpte_nonzero_rsvd(slpte) should be true too I think?
> Yes, but we are doing two checks here:
>
> - checking against -1 to make sure slpte read success
> - checking against nonzero reserved fields to make sure it follows spec
>
> Imho we should not skip the first check here, only if one day removing
> this may really matter (e.g., for performance reason? I cannot think
> of one yet).

Ok. (return -1 seems not good, but we can address this in the future).

>
>>> +            trace_vtd_page_walk_skip_read(iova, iova_next);
>>> +            skipped_local++;
>>> +            goto next;
>>> +        }
>>> +
>>> +        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
>>> +            trace_vtd_page_walk_skip_reserve(iova, iova_next);
>>> +            skipped_local++;
>>> +            goto next;
>>> +        }
>>> +
>>> +        /* Permissions are stacked with parents' */
>>> +        read_cur = read && (slpte & VTD_SL_R);
>>> +        write_cur = write && (slpte & VTD_SL_W);
>>> +
>>> +        /*
>>> +         * As long as we have either read/write permission, this is
>>> +         * a valid entry. The rule works for both page or page tables.
>>> +         */
>>> +        entry_valid = read_cur | write_cur;
>>> +
>>> +        if (vtd_is_last_slpte(slpte, level)) {
>>> +            entry.target_as = &address_space_memory;
>>> +            entry.iova = iova & subpage_mask;
>>> +            /*
>>> +             * This might be meaningless addr if (!read_cur &&
>>> +             * !write_cur), but after all this field will be
>>> +             * meaningless in that case, so let's share the code to
>>> +             * generate the IOTLBs no matter it's an MAP or UNMAP
>>> +             */
>>> +            entry.translated_addr = vtd_get_slpte_addr(slpte);
>>> +            entry.addr_mask = ~subpage_mask;
>>> +            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>>> +            if (!entry_valid && !notify_unmap) {
>>> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
>>> +                skipped_local++;
>>> +                goto next;
>>> +            }
>> Under which case should we care about unmap here (better with a comment I
>> think)?
> When PSIs are for invalidation, rather than newly mapped entries. In
> that case, notify_unmap will be true, and here we need to notify
> IOMMU_NONE to do the cache flush or unmap.
>
> (this page walk is not only for replaying, it is for cache flushing as
>   well)
>
> Do you have suggestion on the comments?

I think then we'd better move this to patch 18 which will use notify_unmap.

>
>>> +            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
>>> +                                    entry.addr_mask, entry.perm);
>>> +            if (hook_fn) {
>>> +                ret = hook_fn(&entry, private);
>> For better performance, we could try to merge adjacent mappings here. I
>> think both vfio and vhost support this and it can save a lot of ioctls.
> Looks so, and this is in my todo list.
>
> Do you mind I do it later after this series merged? I would really
> appreciate if we can have this codes settled down first (considering
> that this series has been dangling for half a year, or more, startint
> from Aviv's series), and I am just afraid this will led to
> unconvergence of this series (and I believe there are other places
> that can be enhanced in the future as well).

Yes, let's do it on top.

>
>>> +                if (ret < 0) {
>>> +                    error_report("Detected error in page walk hook "
>>> +                                 "function, stop walk.");
>>> +                    return ret;
>>> +                }
>>> +            }
>>> +        } else {
>>> +            if (!entry_valid) {
>>> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
>>> +                skipped_local++;
>>> +                goto next;
>>> +            }
>>> +            trace_vtd_page_walk_level(vtd_get_slpte_addr(slpte), level - 1,
>>> +                                      iova, MIN(iova_next, end));
>> This looks duplicated?
> I suppose not. The level is different.

Seem not? level - 1 was passed to vtd_page_walk_level() as level which did:

trace_vtd_page_walk_level(addr, level, start, end);


>
>>> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
>>> +                                      MIN(iova_next, end), hook_fn, private,
>>> +                                      level - 1, read_cur, write_cur,
>>> +                                      &skipped_local, notify_unmap);
>>> +            if (ret < 0) {
>>> +                error_report("Detected page walk error on addr 0x%"PRIx64
>>> +                             " level %"PRIu32", stop walk.", addr, level - 1);
>> Guest triggered, so better use debug macro or tracepoint.
> Sorry. Will replace all the error_report() in the whole series.
>
>>> +                return ret;
>>> +            }
>>> +        }
>>> +
>>> +next:
>>> +        iova = iova_next;
>>> +    }
>>> +
>>> +    if (skipped) {
>>> +        *skipped += skipped_local;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +/**
>>> + * vtd_page_walk - walk specific IOVA range, and call the hook
>>> + *
>>> + * @ce: context entry to walk upon
>>> + * @start: IOVA address to start the walk
>>> + * @end: IOVA range end address (start <= addr < end)
>>> + * @hook_fn: the hook that to be called for each detected area
>>> + * @private: private data for the hook function
>>> + */
>>> +static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>>> +                         vtd_page_walk_hook hook_fn, void *private)
>>> +{
>>> +    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
>>> +    uint32_t level = vtd_get_level_from_context_entry(ce);
>>> +
>>> +    if (!vtd_iova_range_check(start, ce)) {
>>> +        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
>>> +                     start, end);
>> Guest triggered, better use debug macro or tracepoint.
> Same.
>
>>> +        return -VTD_FR_ADDR_BEYOND_MGAW;
>>> +    }
>>> +
>>> +    if (!vtd_iova_range_check(end, ce)) {
>>> +        /* Fix end so that it reaches the maximum */
>>> +        end = vtd_iova_limit(ce);
>>> +    }
>>> +
>>> +    trace_vtd_page_walk_level(addr, level, start, end);
>> Duplicated with the tracepoint in vtd_page_walk_level() too?
> Nop? This should be the top level.
>
>>> +
>>> +    return vtd_page_walk_level(addr, start, end, hook_fn, private,
>>> +                               level, true, true, NULL, false);
>>> +}
>>> +
>>>   /* Map a device to its corresponding domain (context-entry) */
>>>   static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>>                                       uint8_t devfn, VTDContextEntry *ce)
>>> @@ -2395,6 +2569,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>>       return vtd_dev_as;
>>>   }
>>> +static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
>>> +{
>>> +    memory_region_notify_one((IOMMUNotifier *)private, entry);
>>> +    return 0;
>>> +}
>>> +
>>> +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
>>> +{
>>> +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
>>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>>> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
>>> +    VTDContextEntry ce;
>>> +
>>> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>>> +        /*
>>> +         * Scanned a valid context entry, walk over the pages and
>>> +         * notify when needed.
>>> +         */
>>> +        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
>>> +                                  PCI_FUNC(vtd_as->devfn),
>>> +                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
>>> +                                  ce.hi, ce.lo);
>>> +        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
>> ~0ULL?
> Fixing up.
>
> Thanks,
>
> -- peterx
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-22  9:36       ` Peter Xu
@ 2017-01-23  1:50         ` Jason Wang
  0 siblings, 0 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-23  1:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月22日 17:36, Peter Xu wrote:
> Besides this one, I tried to fix the comments in this function as
> below, hope this is better (I removed 1-3 thing since I think that's
> clearer from below code):
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index e958f53..f3fe8c4 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -735,15 +735,6 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           offset = vtd_iova_level_offset(iova, level);
>           slpte = vtd_get_slpte(addr, offset);
>
> -        /*
> -         * When one of the following case happens, we assume the whole
> -         * range is invalid:
> -         *
> -         * 1. read block failed
> -         * 3. reserved area non-zero
> -         * 2. both read & write flag are not set
> -         */
> -
>           if (slpte == (uint64_t)-1) {
>               trace_vtd_page_walk_skip_read(iova, iova_next);
>               skipped_local++;
> @@ -761,20 +752,16 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           write_cur = write && (slpte & VTD_SL_W);
>
>           /*
> -         * As long as we have either read/write permission, this is
> -         * a valid entry. The rule works for both page or page tables.
> +         * As long as we have either read/write permission, this is a
> +         * valid entry. The rule works for both page entries and page
> +         * table entries.
>            */
>           entry_valid = read_cur | write_cur;
>
>           if (vtd_is_last_slpte(slpte, level)) {
>               entry.target_as = &address_space_memory;
>               entry.iova = iova & subpage_mask;
> -            /*
> -             * This might be meaningless addr if (!read_cur &&
> -             * !write_cur), but after all this field will be
> -             * meaningless in that case, so let's share the code to
> -             * generate the IOTLBs no matter it's an MAP or UNMAP
> -             */
> +            /* NOTE: this is only meaningful if entry_valid == true */
>               entry.translated_addr = vtd_get_slpte_addr(slpte);
>               entry.addr_mask = ~subpage_mask;
>               entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>
> Thanks,
>
> -- peterx

I still prefer to do this on patch 18 probably.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-22  9:04     ` Peter Xu
@ 2017-01-23  1:55       ` Jason Wang
  2017-01-23  3:34         ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-23  1:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月22日 17:04, Peter Xu wrote:
> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>
> [...]
>
>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>> +                                           uint16_t domain_id, hwaddr addr,
>>> +                                           uint8_t am)
>>> +{
>>> +    IntelIOMMUNotifierNode *node;
>>> +    VTDContextEntry ce;
>>> +    int ret;
>>> +
>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>> +                                       vtd_as->devfn, &ce);
>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>> +                          vtd_page_invalidate_notify_hook,
>>> +                          (void *)&vtd_as->iommu, true);
>> Why not simply trigger the notifier here? (or is this vfio required?)
> Because we may only want to notify part of the region - we are with
> mask here, but not exact size.
>
> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> the mask will be extended to 16K in the guest. In that case, we need
> to explicitly go over the page entry to know that the 4th page should
> not be notified.

I see. Then it was required by vfio only, I think we can add a fast path 
for !CM in this case by triggering the notifier directly.

Another possible issue is, consider (with CM) a 16K contiguous iova with 
the last page has already been mapped. In this case, if we want to map 
first three pages, when handling IOTLB invalidation, am would be 16K, 
then the last page will be mapped twice. Can this lead some issue?

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-22  9:09     ` Peter Xu
@ 2017-01-23  1:57       ` Jason Wang
  2017-01-23  7:30         ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-23  1:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月22日 17:09, Peter Xu wrote:
> On Sun, Jan 22, 2017 at 04:13:32PM +0800, Jason Wang wrote:
>>
>> On 2017年01月20日 21:08, Peter Xu wrote:
>>> Previous replay works for domain switch only if the original domain does
>>> not have mapped pages. For example, if we switch domain from A to B, it
>>> will only work if A has no existing mapping. If there is, then there's
>>> problem - current replay didn't make sure the old mappings are cleared
>>> before replaying the new one.
>> I'm not quite sure this is needed. I thought the only thing we need to do is
>> stop DMA of device during the moving? Or is there an example that will cause
>> trouble?
> I think this patch is essential.
>
> Example:
>
> - device D1 moved to domain A, domain A has no mapping
> - map page P1 in domain A, so D1 will have a mapping of page P1
> - create domain B with mapping P2
> - move D1 from domain A to domain B
>
> Here if we don't unmap existing pages in domain A (P1),

I thought driver should do this work instead of device, because only 
driver knows whether or not iova is still needed?

Thanks

>   after the
> switch, we'll have D1 with both P1/P2 mapped, while domain B actually
> only has P2. That's unaligned mapping, and it should be wrong.
>
> If you (or anyone) think this is a bug as well for patch 18, I can
> just squash both 19/20 into patch 18.
>
> Thanks,
>
> -- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Peter Xu
  2017-01-22  8:08   ` Jason Wang
@ 2017-01-23  2:01   ` Jason Wang
  2017-01-23  2:17     ` Jason Wang
  2017-01-23  3:40     ` Peter Xu
  1 sibling, 2 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-23  2:01 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月20日 21:08, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
>
>    "IOMMU: enable intel_iommu map and unmap notifiers"
>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>
> However I removed/fixed some content, and added my own codes.
>
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
>
> This patch enables vfio devices for VT-d emulation.
>
> Signed-off-by: Peter Xu<peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
>   include/hw/i386/intel_iommu.h |  8 ++++++
>   2 files changed, 65 insertions(+), 9 deletions(-)

A good side effect of this patch is that it makes vhost device IOTLB 
works without ATS (though may be slow). We probably need a better title :)

And I think we should block notifiers during PSI/DSI/GLOBAL for device 
with ATS enabled.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23  2:01   ` Jason Wang
@ 2017-01-23  2:17     ` Jason Wang
  2017-01-23  3:40     ` Peter Xu
  1 sibling, 0 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-23  2:17 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, alex.williamson



On 2017年01月23日 10:01, Jason Wang wrote:
> On 2017年01月20日 21:08, Peter Xu wrote:
>> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
>> upstream:
>>
>>    "IOMMU: enable intel_iommu map and unmap notifiers"
>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>>
>> However I removed/fixed some content, and added my own codes.
>>
>> Instead of translate() every page for iotlb invalidations (which is
>> slower), we walk the pages when needed and notify in a hook function.
>>
>> This patch enables vfio devices for VT-d emulation.
>>
>> Signed-off-by: Peter Xu<peterx@redhat.com>
>> ---
>>   hw/i386/intel_iommu.c         | 66 
>> +++++++++++++++++++++++++++++++++++++------
>>   include/hw/i386/intel_iommu.h |  8 ++++++
>>   2 files changed, 65 insertions(+), 9 deletions(-)
>
> A good side effect of this patch is that it makes vhost device IOTLB 
> works without ATS (though may be slow). We probably need a better 
> title :)

Probably something like "remote IOMMU/IOTLB" support.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-23  1:48       ` Jason Wang
@ 2017-01-23  2:54         ` Peter Xu
  2017-01-23  3:12           ` Jason Wang
  2017-01-23 19:34           ` Alex Williamson
  0 siblings, 2 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-23  2:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 23, 2017 at 09:48:48AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月22日 16:51, Peter Xu wrote:
> >On Sun, Jan 22, 2017 at 03:56:10PM +0800, Jason Wang wrote:
> >
> >[...]
> >
> >>>+/**
> >>>+ * vtd_page_walk_level - walk over specific level for IOVA range
> >>>+ *
> >>>+ * @addr: base GPA addr to start the walk
> >>>+ * @start: IOVA range start address
> >>>+ * @end: IOVA range end address (start <= addr < end)
> >>>+ * @hook_fn: hook func to be called when detected page
> >>>+ * @private: private data to be passed into hook func
> >>>+ * @read: whether parent level has read permission
> >>>+ * @write: whether parent level has write permission
> >>>+ * @skipped: accumulated skipped ranges
> >>What's the usage for this parameter? Looks like it was never used in this
> >>series.
> >This was for debugging purpose before, and I kept it in case one day
> >it can be used again, considering that will not affect much on the
> >overall performance.
> 
> I think we usually do not keep debugging codes outside debug macros.

I'll remove it.

> 
> >
> >>>+ * @notify_unmap: whether we should notify invalid entries
> >>>+ */
> >>>+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> >>>+                               uint64_t end, vtd_page_walk_hook hook_fn,
> >>>+                               void *private, uint32_t level,
> >>>+                               bool read, bool write, uint64_t *skipped,
> >>>+                               bool notify_unmap)
> >>>+{
> >>>+    bool read_cur, write_cur, entry_valid;
> >>>+    uint32_t offset;
> >>>+    uint64_t slpte;
> >>>+    uint64_t subpage_size, subpage_mask;
> >>>+    IOMMUTLBEntry entry;
> >>>+    uint64_t iova = start;
> >>>+    uint64_t iova_next;
> >>>+    uint64_t skipped_local = 0;
> >>>+    int ret = 0;
> >>>+
> >>>+    trace_vtd_page_walk_level(addr, level, start, end);
> >>>+
> >>>+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> >>>+    subpage_mask = vtd_slpt_level_page_mask(level);
> >>>+
> >>>+    while (iova < end) {
> >>>+        iova_next = (iova & subpage_mask) + subpage_size;
> >>>+
> >>>+        offset = vtd_iova_level_offset(iova, level);
> >>>+        slpte = vtd_get_slpte(addr, offset);
> >>>+
> >>>+        /*
> >>>+         * When one of the following case happens, we assume the whole
> >>>+         * range is invalid:
> >>>+         *
> >>>+         * 1. read block failed
> >>Don't get the meaning (and don't see any code relate to this comment).
> >I took above vtd_get_slpte() a "read", so I was trying to mean that we
> >failed to read the SLPTE due to some reason, we assume the range is
> >invalid.
> 
> I see, so we'd better move the comment above of vtd_get_slpte().

Let me remove the whole comment block... I think the codes explained
it clearly even without any comment. (when people see the code check
slpte against -1, it'll think about above function naturally)

> 
> >
> >>>+         * 3. reserved area non-zero
> >>>+         * 2. both read & write flag are not set
> >>Should be 1,2,3? And the above comment is conflict with the code at least
> >>when notify_unmap is true.
> >Yes, okay I don't know why 3 jumped. :(
> >
> >And yes, I should mention that "both read & write flag not set" only
> >suites for page tables here.
> >
> >>>+         */
> >>>+
> >>>+        if (slpte == (uint64_t)-1) {
> >>If this is true, vtd_slpte_nonzero_rsvd(slpte) should be true too I think?
> >Yes, but we are doing two checks here:
> >
> >- checking against -1 to make sure slpte read success
> >- checking against nonzero reserved fields to make sure it follows spec
> >
> >Imho we should not skip the first check here, only if one day removing
> >this may really matter (e.g., for performance reason? I cannot think
> >of one yet).
> 
> Ok. (return -1 seems not good, but we can address this in the future).

Thanks.

> 
> >
> >>>+            trace_vtd_page_walk_skip_read(iova, iova_next);
> >>>+            skipped_local++;
> >>>+            goto next;
> >>>+        }
> >>>+
> >>>+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> >>>+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> >>>+            skipped_local++;
> >>>+            goto next;
> >>>+        }
> >>>+
> >>>+        /* Permissions are stacked with parents' */
> >>>+        read_cur = read && (slpte & VTD_SL_R);
> >>>+        write_cur = write && (slpte & VTD_SL_W);
> >>>+
> >>>+        /*
> >>>+         * As long as we have either read/write permission, this is
> >>>+         * a valid entry. The rule works for both page or page tables.
> >>>+         */
> >>>+        entry_valid = read_cur | write_cur;
> >>>+
> >>>+        if (vtd_is_last_slpte(slpte, level)) {
> >>>+            entry.target_as = &address_space_memory;
> >>>+            entry.iova = iova & subpage_mask;
> >>>+            /*
> >>>+             * This might be meaningless addr if (!read_cur &&
> >>>+             * !write_cur), but after all this field will be
> >>>+             * meaningless in that case, so let's share the code to
> >>>+             * generate the IOTLBs no matter it's an MAP or UNMAP
> >>>+             */
> >>>+            entry.translated_addr = vtd_get_slpte_addr(slpte);
> >>>+            entry.addr_mask = ~subpage_mask;
> >>>+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> >>>+            if (!entry_valid && !notify_unmap) {
> >>>+                trace_vtd_page_walk_skip_perm(iova, iova_next);
> >>>+                skipped_local++;
> >>>+                goto next;
> >>>+            }
> >>Under which case should we care about unmap here (better with a comment I
> >>think)?
> >When PSIs are for invalidation, rather than newly mapped entries. In
> >that case, notify_unmap will be true, and here we need to notify
> >IOMMU_NONE to do the cache flush or unmap.
> >
> >(this page walk is not only for replaying, it is for cache flushing as
> >  well)
> >
> >Do you have suggestion on the comments?
> 
> I think then we'd better move this to patch 18 which will use notify_unmap.

Do you mean to add some more comment on patch 18?

> 
> >
> >>>+            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> >>>+                                    entry.addr_mask, entry.perm);
> >>>+            if (hook_fn) {
> >>>+                ret = hook_fn(&entry, private);
> >>For better performance, we could try to merge adjacent mappings here. I
> >>think both vfio and vhost support this and it can save a lot of ioctls.
> >Looks so, and this is in my todo list.
> >
> >Do you mind I do it later after this series merged? I would really
> >appreciate if we can have this codes settled down first (considering
> >that this series has been dangling for half a year, or more, startint
> >from Aviv's series), and I am just afraid this will led to
> >unconvergence of this series (and I believe there are other places
> >that can be enhanced in the future as well).
> 
> Yes, let's do it on top.

Thanks.

> 
> >
> >>>+                if (ret < 0) {
> >>>+                    error_report("Detected error in page walk hook "
> >>>+                                 "function, stop walk.");
> >>>+                    return ret;
> >>>+                }
> >>>+            }
> >>>+        } else {
> >>>+            if (!entry_valid) {
> >>>+                trace_vtd_page_walk_skip_perm(iova, iova_next);
> >>>+                skipped_local++;
> >>>+                goto next;
> >>>+            }
> >>>+            trace_vtd_page_walk_level(vtd_get_slpte_addr(slpte), level - 1,
> >>>+                                      iova, MIN(iova_next, end));
> >>This looks duplicated?
> >I suppose not. The level is different.
> 
> Seem not? level - 1 was passed to vtd_page_walk_level() as level which did:
> 
> trace_vtd_page_walk_level(addr, level, start, end);

Hmm yes I didn't notice that I had one at the entry. :(

Let me only keep that one.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-23  2:54         ` Peter Xu
@ 2017-01-23  3:12           ` Jason Wang
  2017-01-23  3:35             ` Peter Xu
  2017-01-23 19:34           ` Alex Williamson
  1 sibling, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-23  3:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月23日 10:54, Peter Xu wrote:
>>>>> +            trace_vtd_page_walk_skip_read(iova, iova_next);
>>>>> +            skipped_local++;
>>>>> +            goto next;
>>>>> +        }
>>>>> +
>>>>> +        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
>>>>> +            trace_vtd_page_walk_skip_reserve(iova, iova_next);
>>>>> +            skipped_local++;
>>>>> +            goto next;
>>>>> +        }
>>>>> +
>>>>> +        /* Permissions are stacked with parents' */
>>>>> +        read_cur = read && (slpte & VTD_SL_R);
>>>>> +        write_cur = write && (slpte & VTD_SL_W);
>>>>> +
>>>>> +        /*
>>>>> +         * As long as we have either read/write permission, this is
>>>>> +         * a valid entry. The rule works for both page or page tables.
>>>>> +         */
>>>>> +        entry_valid = read_cur | write_cur;
>>>>> +
>>>>> +        if (vtd_is_last_slpte(slpte, level)) {
>>>>> +            entry.target_as = &address_space_memory;
>>>>> +            entry.iova = iova & subpage_mask;
>>>>> +            /*
>>>>> +             * This might be meaningless addr if (!read_cur &&
>>>>> +             * !write_cur), but after all this field will be
>>>>> +             * meaningless in that case, so let's share the code to
>>>>> +             * generate the IOTLBs no matter it's an MAP or UNMAP
>>>>> +             */
>>>>> +            entry.translated_addr = vtd_get_slpte_addr(slpte);
>>>>> +            entry.addr_mask = ~subpage_mask;
>>>>> +            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>>>>> +            if (!entry_valid && !notify_unmap) {
>>>>> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
>>>>> +                skipped_local++;
>>>>> +                goto next;
>>>>> +            }
>>>> Under which case should we care about unmap here (better with a comment I
>>>> think)?
>>> When PSIs are for invalidation, rather than newly mapped entries. In
>>> that case, notify_unmap will be true, and here we need to notify
>>> IOMMU_NONE to do the cache flush or unmap.
>>>
>>> (this page walk is not only for replaying, it is for cache flushing as
>>>   well)
>>>
>>> Do you have suggestion on the comments?
>> I think then we'd better move this to patch 18 which will use notify_unmap.
> Do you mean to add some more comment on patch 18?
>

I mean move the unmap_nofity and its comment to patch 18 (real user). 
But if you want to keep it in the patch, I'm also fine.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23  1:55       ` Jason Wang
@ 2017-01-23  3:34         ` Peter Xu
  2017-01-23 10:23           ` Jason Wang
  2017-01-23 18:03           ` Alex Williamson
  0 siblings, 2 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-23  3:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月22日 17:04, Peter Xu wrote:
> >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >
> >[...]
> >
> >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>+                                           uint16_t domain_id, hwaddr addr,
> >>>+                                           uint8_t am)
> >>>+{
> >>>+    IntelIOMMUNotifierNode *node;
> >>>+    VTDContextEntry ce;
> >>>+    int ret;
> >>>+
> >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>+                                       vtd_as->devfn, &ce);
> >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>+                          vtd_page_invalidate_notify_hook,
> >>>+                          (void *)&vtd_as->iommu, true);
> >>Why not simply trigger the notifier here? (or is this vfio required?)
> >Because we may only want to notify part of the region - we are with
> >mask here, but not exact size.
> >
> >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >the mask will be extended to 16K in the guest. In that case, we need
> >to explicitly go over the page entry to know that the 4th page should
> >not be notified.
> 
> I see. Then it was required by vfio only, I think we can add a fast path for
> !CM in this case by triggering the notifier directly.

I noted this down (to be further investigated in my todo), but I don't
know whether this can work, due to the fact that I think it is still
legal that guest merge more than one PSIs into one. For example, I
don't know whether below is legal:

- guest invalidate page (0, 4k)
- guest map new page (4k, 8k)
- guest send single PSI of (0, 8k)

In that case, it contains both map/unmap, and looks like it didn't
disobay the spec as well?

> 
> Another possible issue is, consider (with CM) a 16K contiguous iova with the
> last page has already been mapped. In this case, if we want to map first
> three pages, when handling IOTLB invalidation, am would be 16K, then the
> last page will be mapped twice. Can this lead some issue?

I don't know whether guest has special handling of this kind of
request.

Besides, imho to completely solve this problem, we still need that
per-domain tree. Considering that currently the tree is inside vfio, I
see this not a big issue as well. In that case, the last page mapping
request will fail (we might see one error line from QEMU stderr),
however that'll not affect too much since currently vfio allows that
failure to happen (ioctl fail, but that page is still mapped, which is
what we wanted).

(But of course above error message can be used by an in-guest attacker
 as well just like general error_report() issues reported before,
 though again I will appreciate if we can have this series
 functionally work first :)

And, I should be able to emulate this behavior in guest with a tiny C
program to make sure of it, possibly after this series if allowed.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-23  3:12           ` Jason Wang
@ 2017-01-23  3:35             ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-23  3:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 23, 2017 at 11:12:27AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月23日 10:54, Peter Xu wrote:
> >>>>>+            trace_vtd_page_walk_skip_read(iova, iova_next);
> >>>>>+            skipped_local++;
> >>>>>+            goto next;
> >>>>>+        }
> >>>>>+
> >>>>>+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> >>>>>+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> >>>>>+            skipped_local++;
> >>>>>+            goto next;
> >>>>>+        }
> >>>>>+
> >>>>>+        /* Permissions are stacked with parents' */
> >>>>>+        read_cur = read && (slpte & VTD_SL_R);
> >>>>>+        write_cur = write && (slpte & VTD_SL_W);
> >>>>>+
> >>>>>+        /*
> >>>>>+         * As long as we have either read/write permission, this is
> >>>>>+         * a valid entry. The rule works for both page or page tables.
> >>>>>+         */
> >>>>>+        entry_valid = read_cur | write_cur;
> >>>>>+
> >>>>>+        if (vtd_is_last_slpte(slpte, level)) {
> >>>>>+            entry.target_as = &address_space_memory;
> >>>>>+            entry.iova = iova & subpage_mask;
> >>>>>+            /*
> >>>>>+             * This might be meaningless addr if (!read_cur &&
> >>>>>+             * !write_cur), but after all this field will be
> >>>>>+             * meaningless in that case, so let's share the code to
> >>>>>+             * generate the IOTLBs no matter it's an MAP or UNMAP
> >>>>>+             */
> >>>>>+            entry.translated_addr = vtd_get_slpte_addr(slpte);
> >>>>>+            entry.addr_mask = ~subpage_mask;
> >>>>>+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> >>>>>+            if (!entry_valid && !notify_unmap) {
> >>>>>+                trace_vtd_page_walk_skip_perm(iova, iova_next);
> >>>>>+                skipped_local++;
> >>>>>+                goto next;
> >>>>>+            }
> >>>>Under which case should we care about unmap here (better with a comment I
> >>>>think)?
> >>>When PSIs are for invalidation, rather than newly mapped entries. In
> >>>that case, notify_unmap will be true, and here we need to notify
> >>>IOMMU_NONE to do the cache flush or unmap.
> >>>
> >>>(this page walk is not only for replaying, it is for cache flushing as
> >>>  well)
> >>>
> >>>Do you have suggestion on the comments?
> >>I think then we'd better move this to patch 18 which will use notify_unmap.
> >Do you mean to add some more comment on patch 18?
> >
> 
> I mean move the unmap_nofity and its comment to patch 18 (real user). But if
> you want to keep it in the patch, I'm also fine.

(Since we discussed in IRC for this :)

So I'll keep it for this time. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23  2:01   ` Jason Wang
  2017-01-23  2:17     ` Jason Wang
@ 2017-01-23  3:40     ` Peter Xu
  2017-01-23 10:27       ` Jason Wang
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-23  3:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 23, 2017 at 10:01:11AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月20日 21:08, Peter Xu wrote:
> >This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> >upstream:
> >
> >   "IOMMU: enable intel_iommu map and unmap notifiers"
> >   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
> >
> >However I removed/fixed some content, and added my own codes.
> >
> >Instead of translate() every page for iotlb invalidations (which is
> >slower), we walk the pages when needed and notify in a hook function.
> >
> >This patch enables vfio devices for VT-d emulation.
> >
> >Signed-off-by: Peter Xu<peterx@redhat.com>
> >---
> >  hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
> >  include/hw/i386/intel_iommu.h |  8 ++++++
> >  2 files changed, 65 insertions(+), 9 deletions(-)
> 
> A good side effect of this patch is that it makes vhost device IOTLB works
> without ATS (though may be slow). We probably need a better title :)

How about I mention it in the commit message at the end? Like:

"And, since we already have vhost DMAR support via device-iotlb, a
 natural benefit that this patch brings is that vt-d enabled vhost can
 live even without ATS capability now. Though more tests are needed."

> 
> And I think we should block notifiers during PSI/DSI/GLOBAL for device with
> ATS enabled.

Again, would that be okay I note this in my todo list? :)

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-23  1:57       ` Jason Wang
@ 2017-01-23  7:30         ` Peter Xu
  2017-01-23 10:29           ` Jason Wang
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-23  7:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 23, 2017 at 09:57:23AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月22日 17:09, Peter Xu wrote:
> >On Sun, Jan 22, 2017 at 04:13:32PM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月20日 21:08, Peter Xu wrote:
> >>>Previous replay works for domain switch only if the original domain does
> >>>not have mapped pages. For example, if we switch domain from A to B, it
> >>>will only work if A has no existing mapping. If there is, then there's
> >>>problem - current replay didn't make sure the old mappings are cleared
> >>>before replaying the new one.
> >>I'm not quite sure this is needed. I thought the only thing we need to do is
> >>stop DMA of device during the moving? Or is there an example that will cause
> >>trouble?
> >I think this patch is essential.
> >
> >Example:
> >
> >- device D1 moved to domain A, domain A has no mapping
> >- map page P1 in domain A, so D1 will have a mapping of page P1
> >- create domain B with mapping P2
> >- move D1 from domain A to domain B
> >
> >Here if we don't unmap existing pages in domain A (P1),
> 
> I thought driver should do this work instead of device, because only driver
> knows whether or not iova is still needed?

Do you mean "device driver" above?

I don't know whether I understood the question above, but the problem
should be there no matter which one is managing iova?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23  3:34         ` Peter Xu
@ 2017-01-23 10:23           ` Jason Wang
  2017-01-23 19:40             ` Alex Williamson
  2017-01-24  4:42             ` Peter Xu
  2017-01-23 18:03           ` Alex Williamson
  1 sibling, 2 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-23 10:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月23日 11:34, Peter Xu wrote:
> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
>>
>> On 2017年01月22日 17:04, Peter Xu wrote:
>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>>>
>>> [...]
>>>
>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>>>> +                                           uint16_t domain_id, hwaddr addr,
>>>>> +                                           uint8_t am)
>>>>> +{
>>>>> +    IntelIOMMUNotifierNode *node;
>>>>> +    VTDContextEntry ce;
>>>>> +    int ret;
>>>>> +
>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>>> +                                       vtd_as->devfn, &ce);
>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>>>> +                          vtd_page_invalidate_notify_hook,
>>>>> +                          (void *)&vtd_as->iommu, true);
>>>> Why not simply trigger the notifier here? (or is this vfio required?)
>>> Because we may only want to notify part of the region - we are with
>>> mask here, but not exact size.
>>>
>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
>>> the mask will be extended to 16K in the guest. In that case, we need
>>> to explicitly go over the page entry to know that the 4th page should
>>> not be notified.
>> I see. Then it was required by vfio only, I think we can add a fast path for
>> !CM in this case by triggering the notifier directly.
> I noted this down (to be further investigated in my todo), but I don't
> know whether this can work, due to the fact that I think it is still
> legal that guest merge more than one PSIs into one. For example, I
> don't know whether below is legal:
>
> - guest invalidate page (0, 4k)
> - guest map new page (4k, 8k)
> - guest send single PSI of (0, 8k)
>
> In that case, it contains both map/unmap, and looks like it didn't
> disobay the spec as well?

Not sure I get your meaning, you mean just send single PSI instead of two?

>
>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
>> last page has already been mapped. In this case, if we want to map first
>> three pages, when handling IOTLB invalidation, am would be 16K, then the
>> last page will be mapped twice. Can this lead some issue?
> I don't know whether guest has special handling of this kind of
> request.

This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:

static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
                   struct dmar_domain *domain,
                   unsigned long pfn, unsigned int pages,
                   int ih, int map)
{
     unsigned int mask = ilog2(__roundup_pow_of_two(pages));
     uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
     u16 did = domain->iommu_did[iommu->seq_id];
...


>
> Besides, imho to completely solve this problem, we still need that
> per-domain tree. Considering that currently the tree is inside vfio, I
> see this not a big issue as well.

Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems 
become guest trigger-able. And since VFIO allocate its own structure to 
record dma mapping, this seems open a window for evil guest to exhaust 
host memory which is even worse.

>   In that case, the last page mapping
> request will fail (we might see one error line from QEMU stderr),
> however that'll not affect too much since currently vfio allows that
> failure to happen (ioctl fail, but that page is still mapped, which is
> what we wanted).

Works but sub-optimal or maybe even buggy.

>
> (But of course above error message can be used by an in-guest attacker
>   as well just like general error_report() issues reported before,
>   though again I will appreciate if we can have this series
>   functionally work first :)
>
> And, I should be able to emulate this behavior in guest with a tiny C
> program to make sure of it, possibly after this series if allowed.

Or through your vtd unittest :) ?

Thanks

>
> Thanks,
>
> -- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23  3:40     ` Peter Xu
@ 2017-01-23 10:27       ` Jason Wang
  0 siblings, 0 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-23 10:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月23日 11:40, Peter Xu wrote:
> On Mon, Jan 23, 2017 at 10:01:11AM +0800, Jason Wang wrote:
>>
>> On 2017年01月20日 21:08, Peter Xu wrote:
>>> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
>>> upstream:
>>>
>>>    "IOMMU: enable intel_iommu map and unmap notifiers"
>>>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>>>
>>> However I removed/fixed some content, and added my own codes.
>>>
>>> Instead of translate() every page for iotlb invalidations (which is
>>> slower), we walk the pages when needed and notify in a hook function.
>>>
>>> This patch enables vfio devices for VT-d emulation.
>>>
>>> Signed-off-by: Peter Xu<peterx@redhat.com>
>>> ---
>>>   hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
>>>   include/hw/i386/intel_iommu.h |  8 ++++++
>>>   2 files changed, 65 insertions(+), 9 deletions(-)
>> A good side effect of this patch is that it makes vhost device IOTLB works
>> without ATS (though may be slow). We probably need a better title :)
> How about I mention it in the commit message at the end? Like:
>
> "And, since we already have vhost DMAR support via device-iotlb, a
>   natural benefit that this patch brings is that vt-d enabled vhost can
>   live even without ATS capability now. Though more tests are needed."
>

Ok for me.

>> And I think we should block notifiers during PSI/DSI/GLOBAL for device with
>> ATS enabled.
> Again, would that be okay I note this in my todo list? :)
>
> Thanks,
>
> -- peterx

Yes, on top.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-23  7:30         ` Peter Xu
@ 2017-01-23 10:29           ` Jason Wang
  0 siblings, 0 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-23 10:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月23日 15:30, Peter Xu wrote:
> On Mon, Jan 23, 2017 at 09:57:23AM +0800, Jason Wang wrote:
>>
>> On 2017年01月22日 17:09, Peter Xu wrote:
>>> On Sun, Jan 22, 2017 at 04:13:32PM +0800, Jason Wang wrote:
>>>> On 2017年01月20日 21:08, Peter Xu wrote:
>>>>> Previous replay works for domain switch only if the original domain does
>>>>> not have mapped pages. For example, if we switch domain from A to B, it
>>>>> will only work if A has no existing mapping. If there is, then there's
>>>>> problem - current replay didn't make sure the old mappings are cleared
>>>>> before replaying the new one.
>>>> I'm not quite sure this is needed. I thought the only thing we need to do is
>>>> stop DMA of device during the moving? Or is there an example that will cause
>>>> trouble?
>>> I think this patch is essential.
>>>
>>> Example:
>>>
>>> - device D1 moved to domain A, domain A has no mapping
>>> - map page P1 in domain A, so D1 will have a mapping of page P1
>>> - create domain B with mapping P2
>>> - move D1 from domain A to domain B
>>>
>>> Here if we don't unmap existing pages in domain A (P1),
>> I thought driver should do this work instead of device, because only driver
>> knows whether or not iova is still needed?
> Do you mean "device driver" above?
>
> I don't know whether I understood the question above, but the problem
> should be there no matter which one is managing iova?
>
> Thanks,
>
> -- peterx

Yes, I misread the code, this is in fact triggered by guest.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate Peter Xu
@ 2017-01-23 10:36   ` Jason Wang
  2017-01-24  4:52     ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-23 10:36 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月20日 21:08, Peter Xu wrote:
> Before this one we only invalidate context cache when we receive context
> entry invalidations. However it's possible that the invalidation also
> contains a domain switch (only if cache-mode is enabled for vIOMMU). In
> that case we need to notify all the registered components about the new
> mapping.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f9c5142..4b08b4d 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1146,6 +1146,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                   trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
>                                                VTD_PCI_FUNC(devfn_it));
>                   vtd_as->context_cache_entry.context_cache_gen = 0;
> +                /*
> +                 * So a device is moving out of (or moving into) a
> +                 * domain, a replay() suites here to notify all the
> +                 * IOMMU_NOTIFIER_MAP registers about this change.
> +                 * This won't bring bad even if we have no such
> +                 * notifier registered - the IOMMU notification
> +                 * framework will skip MAP notifications if that
> +                 * happened.
> +                 */
> +                memory_region_iommu_replay_all(&vtd_as->iommu);

DSI and GLOBAL questions come back again or no need to care about them :) ?

Thanks

>               }
>           }
>       }

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay Peter Xu
  2017-01-22  8:13   ` Jason Wang
@ 2017-01-23 10:40   ` Jason Wang
  2017-01-24  7:31     ` Peter Xu
  1 sibling, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-23 10:40 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月20日 21:08, Peter Xu wrote:
>   static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
>   {
>       memory_region_notify_one((IOMMUNotifier *)private, entry);
> @@ -2711,13 +2768,16 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
>   
>       if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>           /*
> -         * Scanned a valid context entry, walk over the pages and
> -         * notify when needed.
> +         * Scanned a valid context entry, we first make sure to remove
> +         * all existing mappings in old domain, by sending UNMAP to
> +         * all the notifiers. Then, we walk over the pages and notify
> +         * with existing mapped new entries in the new domain.
>            */

A question is what if the context cache was invalidated but the device 
were not moved to a new domain. Then the code here does not do anything 
I believe? I think we should move vtd_address_space_unmap() in the 
context entry invalidation processing.

Thanks

>           trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
>                                     PCI_FUNC(vtd_as->devfn),
>                                     VTD_CONTEXT_ENTRY_DID(ce.hi),
>                                     ce.hi, ce.lo);
> +        vtd_address_space_unmap(vtd_as, n);
>           vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n, false);
>       } else {
>           trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_intern

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances
  2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (19 preceding siblings ...)
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 20/20] intel_iommu: replay even with DSI/GLOBAL inv desc Peter Xu
@ 2017-01-23 15:55 ` Michael S. Tsirkin
  2017-01-24  7:40   ` Peter Xu
  20 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2017-01-23 15:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 09:08:36PM +0800, Peter Xu wrote:
> This is v4 of vt-d vfio enablement series.
> 
> Sorry that v4 growed to 20 patches. Some newly added patches (which
> are quite necessary):
> 
> [01/20] vfio: trace map/unmap for notify as well
> [02/20] vfio: introduce vfio_get_vaddr()
> [03/20] vfio: allow to notify unmap for very large region
> 
>   Patches from RFC series:
> 
>   "[PATCH RFC 0/3] vfio: allow to notify unmap for very big region"
> 
>   Which is required by patch [19/20].
> 
> [11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro
> 
>   A helper only.
> 
> [19/20] intel_iommu: unmap existing pages before replay
> 
>   This solves Alex's concern that there might have existing mappings
>   in previous domain when replay happens.
> 
> [20/20] intel_iommu: replay even with DSI/GLOBAL inv desc
> 
>   This solves Jason/Kevin's concern by handling DSI/GLOBAL
>   invalidations as well.
> 
> Each individual patch will have more detailed explanation on itself.
> Please refer to each of them.
> 
> Here I did separate work on patch 19/20 rather than squashing them
> into patch 18 for easier modification and review. I prefer we have
> them separately so we can see each problem separately, after all,
> patch 18 survives in most use cases. Please let me know if we want to
> squash them in some way. I can respin when necessary.
> 
> Besides the big things, lots of tiny tweaks as well. Here's the
> changelog.

It would be nice to add to the log
- known issues / missing features, if any
- are there patches ready to be merged here?
  if yes pls post them without the rfc tag


> v4:
> - convert all error_report()s into traces (in the two patches that did
>   that)
> - rebased to Jason's DMAR series (master + one more patch:
>   "[PATCH V4 net-next] vhost_net: device IOTLB support")
> - let vhost use the new api iommu_notifier_init() so it won't break
>   vhost dmar [Jason]
> - touch commit message of the patch:
>   "intel_iommu: provide its own replay() callback"
>   old replay is not a dead loop, but it will just consume lots of time
>   [Jason]
> - add comment for patch:
>   "intel_iommu: do replay when context invalidate"
>   telling why replay won't be a problem even without CM=1 [Jason]
> - remove a useless comment line [Jason]
> - remove dmar_enabled parameter for vtd_switch_address_space() and
>   vtd_switch_address_space_all() [Mst, Jason]
> - merged the vfio patches in, to support unmap of big ranges at the
>   beginning ("[PATCH RFC 0/3] vfio: allow to notify unmap for very big
>   region")
> - using caching_mode instead of cache_mode_enabled, and "caching-mode"
>   instead of "cache-mode" [Kevin]
> - when receive context entry invalidation, we unmap the entire region
>   first, then replay [Alex]
> - fix commit message for patch:
>   "intel_iommu: simplify irq region translation" [Kevin]
> - handle domain/global invalidation, and notify where proper [Jason,
>   Kevin]
> 
> v3:
> - fix style error reported by patchew
> - fix comment in domain switch patch: use "IOMMU address space" rather
>   than "IOMMU region" [Kevin]
> - add ack-by for Paolo in patch:
>   "memory: add section range info for IOMMU notifier"
>   (this is seperately collected besides this thread)
> - remove 3 patches which are merged already (from Jason)
> - rebase to master b6c0897
> 
> v2:
> - change comment for "end" parameter in vtd_page_walk() [Tianyu]
> - change comment for "a iova" to "an iova" [Yi]
> - fix fault printed val for GPA address in vtd_page_walk_level (debug
>   only)
> - rebased to master (rather than Aviv's v6 series) and merged Aviv's
>   series v6: picked patch 1 (as patch 1 in this series), dropped patch
>   2, re-wrote patch 3 (as patch 17 of this series).
> - picked up two more bugfix patches from Jason's DMAR series
> - picked up the following patch as well:
>   "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"
> 
> This RFC series is a re-work for Aviv B.D.'s vfio enablement series
> with vt-d:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html
> 
> Aviv has done a great job there, and what we still lack there are
> mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> This series should have solved the above two issues.
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v4
> 
> I would be glad to hear about any review comments for above patches.
> 
> =========
> Test Done
> =========
> 
> Build test passed for x86_64/arm/ppc64.
> 
> Simply tested with x86_64, assigning two PCI devices to a single VM,
> boot the VM using:
> 
> bin=x86_64-softmmu/qemu-system-x86_64
> $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
>      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
>      -device virtio-net-pci,netdev=net0 \
>      -device vfio-pci,host=03:00.0 \
>      -device vfio-pci,host=02:00.0 \
>      -trace events=".trace.vfio" \
>      /var/lib/libvirt/images/vm1.qcow2
> 
> pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> vtd_page_walk*
> vtd_replay*
> vtd_inv_desc*
> 
> Then, in the guest, run the following tool:
> 
>   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> 
> With parameter:
> 
>   ./vfio-bind-group 00:03.0 00:04.0
> 
> Check host side trace log, I can see pages are replayed and mapped in
> 00:04.0 device address space, like:
> 
> ...
> vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
> vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
> vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
> ...
> 
> =========
> Todo List
> =========
> 
> - error reporting for the assigned devices (as Tianyu has mentioned)
> 
> - per-domain address-space: A better solution in the future may be -
>   we maintain one address space per IOMMU domain in the guest (so
>   multiple devices can share a same address space if they are sharing
>   the same IOMMU domains in the guest), rather than one address space
>   per device (which is current implementation of vt-d). However that's
>   a step further than this series, and let's see whether we can first
>   provide a workable version of device assignment with vt-d
>   protection.
> 
> - more to come...
> 
> Thanks,
> 
> Aviv Ben-David (1):
>   IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to
>     guest
> 
> Peter Xu (19):
>   vfio: trace map/unmap for notify as well
>   vfio: introduce vfio_get_vaddr()
>   vfio: allow to notify unmap for very large region
>   intel_iommu: simplify irq region translation
>   intel_iommu: renaming gpa to iova where proper
>   intel_iommu: fix trace for inv desc handling
>   intel_iommu: fix trace for addr translation
>   intel_iommu: vtd_slpt_level_shift check level
>   memory: add section range info for IOMMU notifier
>   memory: provide IOMMU_NOTIFIER_FOREACH macro
>   memory: provide iommu_replay_all()
>   memory: introduce memory_region_notify_one()
>   memory: add MemoryRegionIOMMUOps.replay() callback
>   intel_iommu: provide its own replay() callback
>   intel_iommu: do replay when context invalidate
>   intel_iommu: allow dynamic switch of IOMMU region
>   intel_iommu: enable vfio devices
>   intel_iommu: unmap existing pages before replay
>   intel_iommu: replay even with DSI/GLOBAL inv desc
> 
>  hw/i386/intel_iommu.c          | 674 +++++++++++++++++++++++++++++++----------
>  hw/i386/intel_iommu_internal.h |   2 +
>  hw/i386/trace-events           |  30 ++
>  hw/vfio/common.c               |  68 +++--
>  hw/vfio/trace-events           |   2 +-
>  hw/virtio/vhost.c              |   4 +-
>  include/exec/memory.h          |  49 ++-
>  include/hw/i386/intel_iommu.h  |  12 +
>  memory.c                       |  47 ++-
>  9 files changed, 696 insertions(+), 192 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23  3:34         ` Peter Xu
  2017-01-23 10:23           ` Jason Wang
@ 2017-01-23 18:03           ` Alex Williamson
  2017-01-24  7:22             ` Peter Xu
  1 sibling, 1 reply; 75+ messages in thread
From: Alex Williamson @ 2017-01-23 18:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel

On Mon, 23 Jan 2017 11:34:29 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> > 
> > 
> > On 2017年01月22日 17:04, Peter Xu wrote:  
> > >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> > >
> > >[...]
> > >  
> > >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> > >>>+                                           uint16_t domain_id, hwaddr addr,
> > >>>+                                           uint8_t am)
> > >>>+{
> > >>>+    IntelIOMMUNotifierNode *node;
> > >>>+    VTDContextEntry ce;
> > >>>+    int ret;
> > >>>+
> > >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> > >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> > >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> > >>>+                                       vtd_as->devfn, &ce);
> > >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> > >>>+                          vtd_page_invalidate_notify_hook,
> > >>>+                          (void *)&vtd_as->iommu, true);  
> > >>Why not simply trigger the notifier here? (or is this vfio required?)  
> > >Because we may only want to notify part of the region - we are with
> > >mask here, but not exact size.
> > >
> > >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> > >the mask will be extended to 16K in the guest. In that case, we need
> > >to explicitly go over the page entry to know that the 4th page should
> > >not be notified.  
> > 
> > I see. Then it was required by vfio only, I think we can add a fast path for
> > !CM in this case by triggering the notifier directly.  
> 
> I noted this down (to be further investigated in my todo), but I don't
> know whether this can work, due to the fact that I think it is still
> legal that guest merge more than one PSIs into one. For example, I
> don't know whether below is legal:
> 
> - guest invalidate page (0, 4k)
> - guest map new page (4k, 8k)
> - guest send single PSI of (0, 8k)
> 
> In that case, it contains both map/unmap, and looks like it didn't
> disobay the spec as well?

The topic of mapping and invalidation granularity also makes me
slightly concerned with the abstraction we use for the type1 IOMMU
backend.  With the "v2" type1 configuration we currently use in QEMU,
the user may only unmap with the same minimum granularity with which
the original mapping was created.  For instance if an iommu notifier
map request gets to vfio with an 8k range, the resulting mapping can
only be removed by an invalidation covering the full range.  Trying to
bisect that original mapping by only invalidating 4k of the range will
generate an error.

I would think (but please confirm), that when we're only tracking
mappings generated by the guest OS that this works.  If the guest OS
maps with 4k pages, we get map notifies for each of those 4k pages.  If
they use 2MB pages, we get 2MB ranges and invalidations will come in
the same granularity.

An area of concern though is the replay mechanism in QEMU, I'll need to
look for it in the code, but replaying an IOMMU domain into a new
container *cannot* coalesce mappings or else it limits the granularity
with which we can later accept unmaps.  Take for instance a guest that
has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
page within that range.  However if vfio gets a single 2MB mapping
rather than 512 4K mappings, then the host IOMMU may use a hugepage
mapping where our granularity is now 2MB.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well Peter Xu
@ 2017-01-23 18:20   ` Alex Williamson
  0 siblings, 0 replies; 75+ messages in thread
From: Alex Williamson @ 2017-01-23 18:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Fri, 20 Jan 2017 21:08:37 +0800
Peter Xu <peterx@redhat.com> wrote:

> We traces its range, but we don't know whether it's a MAP/UNMAP. Let's
> dump it as well.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/vfio/common.c     | 3 ++-
>  hw/vfio/trace-events | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)

Acked-by: Alex Williamson <alex.williamson@redhat.com>

> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 801578b..174f351 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -305,7 +305,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      void *vaddr;
>      int ret;
>  
> -    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> +    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> +                                iova, iova + iotlb->addr_mask);
>  
>      if (iotlb->target_as != &address_space_memory) {
>          error_report("Wrong target AS \"%s\", only system memory is allowed",
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ef81609..7ae8233 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -84,7 +84,7 @@ vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
>  # hw/vfio/common.c
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
> -vfio_iommu_map_notify(uint64_t iova_start, uint64_t iova_end) "iommu map @ %"PRIx64" - %"PRIx64
> +vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "iommu %s @ %"PRIx64" - %"PRIx64
>  vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64
>  vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] %"PRIx64" - %"PRIx64
>  vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] %"PRIx64" - %"PRIx64" [%p]"

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr()
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr() Peter Xu
@ 2017-01-23 18:49   ` Alex Williamson
  2017-01-24  3:28     ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Alex Williamson @ 2017-01-23 18:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Fri, 20 Jan 2017 21:08:38 +0800
Peter Xu <peterx@redhat.com> wrote:

> A cleanup for vfio_iommu_map_notify(). Should have no functional change,
> just to make the function shorter and easier to understand.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/vfio/common.c | 58 +++++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 38 insertions(+), 20 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 174f351..ce55dff 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -294,25 +294,14 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>             section->offset_within_address_space & (1ULL << 63);
>  }
>  
> -static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> +static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                           bool *read_only)
>  {
> -    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> -    VFIOContainer *container = giommu->container;
> -    hwaddr iova = iotlb->iova + giommu->iommu_offset;
>      MemoryRegion *mr;
>      hwaddr xlat;
>      hwaddr len = iotlb->addr_mask + 1;
> -    void *vaddr;
> -    int ret;
> -
> -    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> -                                iova, iova + iotlb->addr_mask);
> -
> -    if (iotlb->target_as != &address_space_memory) {
> -        error_report("Wrong target AS \"%s\", only system memory is allowed",
> -                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> -        return;
> -    }
> +    bool ret = false;
> +    bool writable = iotlb->perm & IOMMU_WO;
>  
>      /*
>       * The IOMMU TLB entry we have just covers translation through
> @@ -322,12 +311,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      rcu_read_lock();
>      mr = address_space_translate(&address_space_memory,
>                                   iotlb->translated_addr,
> -                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> +                                 &xlat, &len, writable);
>      if (!memory_region_is_ram(mr)) {
>          error_report("iommu map to non memory area %"HWADDR_PRIx"",
>                       xlat);
>          goto out;
>      }
> +
>      /*
>       * Translation truncates length to the IOMMU page size,
>       * check that it did not truncate too much.
> @@ -337,11 +327,41 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>          goto out;
>      }
>  
> +    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    *read_only = !writable || mr->readonly;
> +    ret = true;
> +
> +out:
> +    rcu_read_unlock();
> +    return ret;
> +}
> +
> +static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> +{
> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> +    VFIOContainer *container = giommu->container;
> +    hwaddr iova = iotlb->iova + giommu->iommu_offset;
> +    bool read_only;
> +    void *vaddr;
> +    int ret;
> +
> +    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> +                                iova, iova + iotlb->addr_mask);
> +
> +    if (iotlb->target_as != &address_space_memory) {
> +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> +                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> +        return;
> +    }
> +
> +    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +        return;
> +    }
> +
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        vaddr = memory_region_get_ram_ptr(mr) + xlat;
>          ret = vfio_dma_map(container, iova,
>                             iotlb->addr_mask + 1, vaddr,
> -                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> +                           read_only);

Is it really valid to move the map ioctl out of the rcu read lock?
We're making use of vaddr, which is directly a property of a
MemoryRegion which may have now disappeared.  With the lock released,
could an unmap race the map resulting in the wrong ordering?  As noted
previously, there are some subtle changes here, we do the
memory_region_get_ram_ptr() translation on both map and unmap (fixed in
next patch) and then pull map out of the rcu lock.  I'm not sure the
extra function is worthwhile or really has no functional change.
Thanks,

Alex

>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> @@ -357,8 +377,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                           iotlb->addr_mask + 1, ret);
>          }
>      }
> -out:
> -    rcu_read_unlock();
>  }
>  
>  static void vfio_listener_region_add(MemoryListener *listener,

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier
  2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-01-23 19:12   ` Alex Williamson
  2017-01-24  7:48     ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Alex Williamson @ 2017-01-23 19:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Fri, 20 Jan 2017 21:08:46 +0800
Peter Xu <peterx@redhat.com> wrote:

> In this patch, IOMMUNotifier.{start|end} are introduced to store section
> information for a specific notifier. When notification occurs, we not
> only check the notification type (MAP|UNMAP), but also check whether the
> notified iova is in the range of specific IOMMU notifier, and skip those
> notifiers if not in the listened range.
> 
> When removing an region, we need to make sure we removed the correct
> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> 
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> changelog (start from vt-d vfio enablement series v3):
> v4:
> - introduce memory_region_iommu_notifier_init() [Jason]
> ---
>  hw/vfio/common.c      | 12 +++++++++---
>  hw/virtio/vhost.c     |  4 ++--
>  include/exec/memory.h | 19 ++++++++++++++++++-
>  memory.c              |  5 ++++-
>  4 files changed, 33 insertions(+), 7 deletions(-)


Acked-by: Alex Williamson <alex.williamson@redhat.com>


> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4d90844..49dc035 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -471,8 +471,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          giommu->iommu_offset = section->offset_within_address_space -
>                                 section->offset_within_region;
>          giommu->container = container;
> -        giommu->n.notify = vfio_iommu_map_notify;
> -        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> +        llend = int128_add(int128_make64(section->offset_within_region),
> +                           section->size);
> +        llend = int128_sub(llend, int128_one());
> +        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> +                            IOMMU_NOTIFIER_ALL,
> +                            section->offset_within_region,
> +                            int128_get64(llend));
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> @@ -543,7 +548,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          VFIOGuestIOMMU *giommu;
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> -            if (giommu->iommu == section->mr) {
> +            if (giommu->iommu == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
>                  memory_region_unregister_iommu_notifier(giommu->iommu,
>                                                          &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 9cacf55..cc99c6a 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1242,8 +1242,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>          .priority = 10
>      };
>  
> -    hdev->n.notify = vhost_iommu_unmap_notify;
> -    hdev->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
> +    iommu_notifier_init(&hdev->n, vhost_iommu_unmap_notify,
> +                        IOMMU_NOTIFIER_UNMAP, 0, ~0ULL);
>  
>      if (hdev->migration_blocker == NULL) {
>          if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index bec9756..ae4c9a9 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -81,13 +81,30 @@ typedef enum {
>  
>  #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>  
> +struct IOMMUNotifier;
> +typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
> +                            IOMMUTLBEntry *data);
> +
>  struct IOMMUNotifier {
> -    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
> +    IOMMUNotify notify;
>      IOMMUNotifierFlag notifier_flags;
> +    /* Notify for address space range start <= addr <= end */
> +    hwaddr start;
> +    hwaddr end;
>      QLIST_ENTRY(IOMMUNotifier) node;
>  };
>  typedef struct IOMMUNotifier IOMMUNotifier;
>  
> +static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
> +                                       IOMMUNotifierFlag flags,
> +                                       hwaddr start, hwaddr end)
> +{
> +    n->notify = fn;
> +    n->notifier_flags = flags;
> +    n->start = start;
> +    n->end = end;
> +}
> +
>  /* New-style MMIO accessors can indicate that the transaction failed.
>   * A zero (MEMTX_OK) response means success; anything else is a failure
>   * of some kind. The memory subsystem will bitwise-OR together results
> diff --git a/memory.c b/memory.c
> index 2bfc37f..89104b1 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1610,6 +1610,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
>  
>      /* We need to register for at least one bitfield */
>      assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
> +    assert(n->start <= n->end);
>      QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
>      memory_region_update_iommu_notify_flags(mr);
>  }
> @@ -1671,7 +1672,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      }
>  
>      QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> -        if (iommu_notifier->notifier_flags & request_flags) {
> +        if (iommu_notifier->notifier_flags & request_flags &&
> +            iommu_notifier->start <= entry.iova &&
> +            iommu_notifier->end >= entry.iova) {
>              iommu_notifier->notify(iommu_notifier, &entry);
>          }
>      }

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-22  8:51     ` Peter Xu
  2017-01-22  9:36       ` Peter Xu
  2017-01-23  1:48       ` Jason Wang
@ 2017-01-23 19:33       ` Alex Williamson
  2 siblings, 0 replies; 75+ messages in thread
From: Alex Williamson @ 2017-01-23 19:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv

On Sun, 22 Jan 2017 16:51:18 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Sun, Jan 22, 2017 at 03:56:10PM +0800, Jason Wang wrote:
> >   
> > >+            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> > >+                                    entry.addr_mask, entry.perm);
> > >+            if (hook_fn) {
> > >+                ret = hook_fn(&entry, private);  
> > 
> > For better performance, we could try to merge adjacent mappings here. I
> > think both vfio and vhost support this and it can save a lot of ioctls.  
> 
> Looks so, and this is in my todo list.
> 
> Do you mind I do it later after this series merged? I would really
> appreciate if we can have this codes settled down first (considering
> that this series has been dangling for half a year, or more, startint
> from Aviv's series), and I am just afraid this will led to
> unconvergence of this series (and I believe there are other places
> that can be enhanced in the future as well).

NAK, we can't merge mappings per my comment on 18/20.  You're looking
at an entirely new or at best revised version of the vfio IOMMU
interface to do so.  vfio does not support invalidations at a smaller
granularity than the original mapping.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-23  2:54         ` Peter Xu
  2017-01-23  3:12           ` Jason Wang
@ 2017-01-23 19:34           ` Alex Williamson
  2017-01-24  4:04             ` Peter Xu
  1 sibling, 1 reply; 75+ messages in thread
From: Alex Williamson @ 2017-01-23 19:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel

On Mon, 23 Jan 2017 10:54:49 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jan 23, 2017 at 09:48:48AM +0800, Jason Wang wrote:
> > 
> > 
> > On 2017年01月22日 16:51, Peter Xu wrote:  
> > >On Sun, Jan 22, 2017 at 03:56:10PM +0800, Jason Wang wrote:
> > >
> > >[...]
> > >  
> > >>>+/**
> > >>>+ * vtd_page_walk_level - walk over specific level for IOVA range
> > >>>+ *
> > >>>+ * @addr: base GPA addr to start the walk
> > >>>+ * @start: IOVA range start address
> > >>>+ * @end: IOVA range end address (start <= addr < end)
> > >>>+ * @hook_fn: hook func to be called when detected page
> > >>>+ * @private: private data to be passed into hook func
> > >>>+ * @read: whether parent level has read permission
> > >>>+ * @write: whether parent level has write permission
> > >>>+ * @skipped: accumulated skipped ranges  
> > >>What's the usage for this parameter? Looks like it was never used in this
> > >>series.  
> > >This was for debugging purpose before, and I kept it in case one day
> > >it can be used again, considering that will not affect much on the
> > >overall performance.  
> > 
> > I think we usually do not keep debugging codes outside debug macros.  
> 
> I'll remove it.

While you're at it, what's the value in using a void* private rather
than just passing around an IOMMUNotifier*.  Seems like unnecessary
abstraction.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23 10:23           ` Jason Wang
@ 2017-01-23 19:40             ` Alex Williamson
  2017-01-25  1:19               ` Jason Wang
  2017-01-24  4:42             ` Peter Xu
  1 sibling, 1 reply; 75+ messages in thread
From: Alex Williamson @ 2017-01-23 19:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Xu, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel

On Mon, 23 Jan 2017 18:23:44 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2017年01月23日 11:34, Peter Xu wrote:
> > On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:  
> >>
> >> On 2017年01月22日 17:04, Peter Xu wrote:  
> >>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >>>
> >>> [...]
> >>>  
> >>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>>> +                                           uint16_t domain_id, hwaddr addr,
> >>>>> +                                           uint8_t am)
> >>>>> +{
> >>>>> +    IntelIOMMUNotifierNode *node;
> >>>>> +    VTDContextEntry ce;
> >>>>> +    int ret;
> >>>>> +
> >>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>>> +                                       vtd_as->devfn, &ce);
> >>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>>> +                          vtd_page_invalidate_notify_hook,
> >>>>> +                          (void *)&vtd_as->iommu, true);  
> >>>> Why not simply trigger the notifier here? (or is this vfio required?)  
> >>> Because we may only want to notify part of the region - we are with
> >>> mask here, but not exact size.
> >>>
> >>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >>> the mask will be extended to 16K in the guest. In that case, we need
> >>> to explicitly go over the page entry to know that the 4th page should
> >>> not be notified.  
> >> I see. Then it was required by vfio only, I think we can add a fast path for
> >> !CM in this case by triggering the notifier directly.  
> > I noted this down (to be further investigated in my todo), but I don't
> > know whether this can work, due to the fact that I think it is still
> > legal that guest merge more than one PSIs into one. For example, I
> > don't know whether below is legal:
> >
> > - guest invalidate page (0, 4k)
> > - guest map new page (4k, 8k)
> > - guest send single PSI of (0, 8k)
> >
> > In that case, it contains both map/unmap, and looks like it didn't
> > disobay the spec as well?  
> 
> Not sure I get your meaning, you mean just send single PSI instead of two?
> 
> >  
> >> Another possible issue is, consider (with CM) a 16K contiguous iova with the
> >> last page has already been mapped. In this case, if we want to map first
> >> three pages, when handling IOTLB invalidation, am would be 16K, then the
> >> last page will be mapped twice. Can this lead some issue?  
> > I don't know whether guest has special handling of this kind of
> > request.  
> 
> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
> 
> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>                    struct dmar_domain *domain,
>                    unsigned long pfn, unsigned int pages,
>                    int ih, int map)
> {
>      unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>      uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>      u16 did = domain->iommu_did[iommu->seq_id];
> ...
> 
> 
> >
> > Besides, imho to completely solve this problem, we still need that
> > per-domain tree. Considering that currently the tree is inside vfio, I
> > see this not a big issue as well.  
> 
> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems 
> become guest trigger-able. And since VFIO allocate its own structure to 
> record dma mapping, this seems open a window for evil guest to exhaust 
> host memory which is even worse.

You're thinking of pci-assign, vfio does page accounting such that a
user can only lock pages up to their locked memory limit.  Exposing the
mapping ioctl within the guest is not a different problem from exposing
the ioctl to the host user from a vfio perspective.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr()
  2017-01-23 18:49   ` Alex Williamson
@ 2017-01-24  3:28     ` Peter Xu
  2017-01-24  4:30       ` Alex Williamson
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-24  3:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Mon, Jan 23, 2017 at 11:49:05AM -0700, Alex Williamson wrote:
> On Fri, 20 Jan 2017 21:08:38 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > A cleanup for vfio_iommu_map_notify(). Should have no functional change,
> > just to make the function shorter and easier to understand.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/vfio/common.c | 58 +++++++++++++++++++++++++++++++++++++-------------------
> >  1 file changed, 38 insertions(+), 20 deletions(-)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 174f351..ce55dff 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -294,25 +294,14 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >             section->offset_within_address_space & (1ULL << 63);
> >  }
> >  
> > -static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > +static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> > +                           bool *read_only)
> >  {
> > -    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> > -    VFIOContainer *container = giommu->container;
> > -    hwaddr iova = iotlb->iova + giommu->iommu_offset;
> >      MemoryRegion *mr;
> >      hwaddr xlat;
> >      hwaddr len = iotlb->addr_mask + 1;
> > -    void *vaddr;
> > -    int ret;
> > -
> > -    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> > -                                iova, iova + iotlb->addr_mask);
> > -
> > -    if (iotlb->target_as != &address_space_memory) {
> > -        error_report("Wrong target AS \"%s\", only system memory is allowed",
> > -                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> > -        return;
> > -    }
> > +    bool ret = false;
> > +    bool writable = iotlb->perm & IOMMU_WO;
> >  
> >      /*
> >       * The IOMMU TLB entry we have just covers translation through
> > @@ -322,12 +311,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >      rcu_read_lock();
> >      mr = address_space_translate(&address_space_memory,
> >                                   iotlb->translated_addr,
> > -                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> > +                                 &xlat, &len, writable);
> >      if (!memory_region_is_ram(mr)) {
> >          error_report("iommu map to non memory area %"HWADDR_PRIx"",
> >                       xlat);
> >          goto out;
> >      }
> > +
> >      /*
> >       * Translation truncates length to the IOMMU page size,
> >       * check that it did not truncate too much.
> > @@ -337,11 +327,41 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >          goto out;
> >      }
> >  
> > +    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> > +    *read_only = !writable || mr->readonly;
> > +    ret = true;
> > +
> > +out:
> > +    rcu_read_unlock();
> > +    return ret;
> > +}
> > +
> > +static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > +{
> > +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> > +    VFIOContainer *container = giommu->container;
> > +    hwaddr iova = iotlb->iova + giommu->iommu_offset;
> > +    bool read_only;
> > +    void *vaddr;
> > +    int ret;
> > +
> > +    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> > +                                iova, iova + iotlb->addr_mask);
> > +
> > +    if (iotlb->target_as != &address_space_memory) {
> > +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> > +                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> > +        return;
> > +    }
> > +
> > +    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> > +        return;
> > +    }
> > +
> >      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> > -        vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >          ret = vfio_dma_map(container, iova,
> >                             iotlb->addr_mask + 1, vaddr,
> > -                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> > +                           read_only);
> 
> Is it really valid to move the map ioctl out of the rcu read lock?
> We're making use of vaddr, which is directly a property of a
> MemoryRegion which may have now disappeared.  With the lock released,
> could an unmap race the map resulting in the wrong ordering?  As noted
> previously, there are some subtle changes here, we do the
> memory_region_get_ram_ptr() translation on both map and unmap (fixed in
> next patch) and then pull map out of the rcu lock.  I'm not sure the
> extra function is worthwhile or really has no functional change.
> Thanks,

Thanks for raising this question up.

IIUC this function can be triggered by three cases (this is for x86, I
suppose the rule should be same for all platforms):

- memory hot add/remove
- a PSI (page selective invalidation) for a newly mapped io page
- a domain switch (needs a iommu replay)

IMHO all these places are protected by the QBL (both 2nd/3rd cases
should be invoked in VT-d IOMMU MMIO write to queue invalidation
registers)? And I thought BQL should be regarded as a write lock even
stronger than RCU read lock?

If I understand it correctly above, looks like we should be safe here
as long as we are always with BQL? And, if so, do we really need RCU
read protection here?

Please kindly correct if I missed anything.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
  2017-01-23 19:34           ` Alex Williamson
@ 2017-01-24  4:04             ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-24  4:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Wang, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel

On Mon, Jan 23, 2017 at 12:34:29PM -0700, Alex Williamson wrote:
> On Mon, 23 Jan 2017 10:54:49 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jan 23, 2017 at 09:48:48AM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2017年01月22日 16:51, Peter Xu wrote:  
> > > >On Sun, Jan 22, 2017 at 03:56:10PM +0800, Jason Wang wrote:
> > > >
> > > >[...]
> > > >  
> > > >>>+/**
> > > >>>+ * vtd_page_walk_level - walk over specific level for IOVA range
> > > >>>+ *
> > > >>>+ * @addr: base GPA addr to start the walk
> > > >>>+ * @start: IOVA range start address
> > > >>>+ * @end: IOVA range end address (start <= addr < end)
> > > >>>+ * @hook_fn: hook func to be called when detected page
> > > >>>+ * @private: private data to be passed into hook func
> > > >>>+ * @read: whether parent level has read permission
> > > >>>+ * @write: whether parent level has write permission
> > > >>>+ * @skipped: accumulated skipped ranges  
> > > >>What's the usage for this parameter? Looks like it was never used in this
> > > >>series.  
> > > >This was for debugging purpose before, and I kept it in case one day
> > > >it can be used again, considering that will not affect much on the
> > > >overall performance.  
> > > 
> > > I think we usually do not keep debugging codes outside debug macros.  
> > 
> > I'll remove it.
> 
> While you're at it, what's the value in using a void* private rather
> than just passing around an IOMMUNotifier*.  Seems like unnecessary
> abstraction.  Thanks,

When handling PSIs (in continuous patches, not this one), we were
passing in MemoryRegion* rather than IOMMUNotifier*:

        vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
                        vtd_page_invalidate_notify_hook,
                        (void *)&vtd_as->iommu, true);

So a void* might still be required. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr()
  2017-01-24  3:28     ` Peter Xu
@ 2017-01-24  4:30       ` Alex Williamson
  0 siblings, 0 replies; 75+ messages in thread
From: Alex Williamson @ 2017-01-24  4:30 UTC (permalink / raw)
  To: Peter Xu, Paolo Bonzini
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	bd.aviv, David Gibson

On Tue, 24 Jan 2017 11:28:18 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jan 23, 2017 at 11:49:05AM -0700, Alex Williamson wrote:
> > On Fri, 20 Jan 2017 21:08:38 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > A cleanup for vfio_iommu_map_notify(). Should have no functional change,
> > > just to make the function shorter and easier to understand.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  hw/vfio/common.c | 58 +++++++++++++++++++++++++++++++++++++-------------------
> > >  1 file changed, 38 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index 174f351..ce55dff 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -294,25 +294,14 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> > >             section->offset_within_address_space & (1ULL << 63);
> > >  }
> > >  
> > > -static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > > +static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> > > +                           bool *read_only)
> > >  {
> > > -    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> > > -    VFIOContainer *container = giommu->container;
> > > -    hwaddr iova = iotlb->iova + giommu->iommu_offset;
> > >      MemoryRegion *mr;
> > >      hwaddr xlat;
> > >      hwaddr len = iotlb->addr_mask + 1;
> > > -    void *vaddr;
> > > -    int ret;
> > > -
> > > -    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> > > -                                iova, iova + iotlb->addr_mask);
> > > -
> > > -    if (iotlb->target_as != &address_space_memory) {
> > > -        error_report("Wrong target AS \"%s\", only system memory is allowed",
> > > -                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> > > -        return;
> > > -    }
> > > +    bool ret = false;
> > > +    bool writable = iotlb->perm & IOMMU_WO;
> > >  
> > >      /*
> > >       * The IOMMU TLB entry we have just covers translation through
> > > @@ -322,12 +311,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > >      rcu_read_lock();
> > >      mr = address_space_translate(&address_space_memory,
> > >                                   iotlb->translated_addr,
> > > -                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> > > +                                 &xlat, &len, writable);
> > >      if (!memory_region_is_ram(mr)) {
> > >          error_report("iommu map to non memory area %"HWADDR_PRIx"",
> > >                       xlat);
> > >          goto out;
> > >      }
> > > +
> > >      /*
> > >       * Translation truncates length to the IOMMU page size,
> > >       * check that it did not truncate too much.
> > > @@ -337,11 +327,41 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > >          goto out;
> > >      }
> > >  
> > > +    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> > > +    *read_only = !writable || mr->readonly;
> > > +    ret = true;
> > > +
> > > +out:
> > > +    rcu_read_unlock();
> > > +    return ret;
> > > +}
> > > +
> > > +static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > > +{
> > > +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> > > +    VFIOContainer *container = giommu->container;
> > > +    hwaddr iova = iotlb->iova + giommu->iommu_offset;
> > > +    bool read_only;
> > > +    void *vaddr;
> > > +    int ret;
> > > +
> > > +    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> > > +                                iova, iova + iotlb->addr_mask);
> > > +
> > > +    if (iotlb->target_as != &address_space_memory) {
> > > +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> > > +                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> > > +        return;
> > > +    }
> > > +
> > > +    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> > > +        return;
> > > +    }
> > > +
> > >      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> > > -        vaddr = memory_region_get_ram_ptr(mr) + xlat;
> > >          ret = vfio_dma_map(container, iova,
> > >                             iotlb->addr_mask + 1, vaddr,
> > > -                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> > > +                           read_only);  
> > 
> > Is it really valid to move the map ioctl out of the rcu read lock?
> > We're making use of vaddr, which is directly a property of a
> > MemoryRegion which may have now disappeared.  With the lock released,
> > could an unmap race the map resulting in the wrong ordering?  As noted
> > previously, there are some subtle changes here, we do the
> > memory_region_get_ram_ptr() translation on both map and unmap (fixed in
> > next patch) and then pull map out of the rcu lock.  I'm not sure the
> > extra function is worthwhile or really has no functional change.
> > Thanks,  
> 
> Thanks for raising this question up.
> 
> IIUC this function can be triggered by three cases (this is for x86, I
> suppose the rule should be same for all platforms):
> 
> - memory hot add/remove
> - a PSI (page selective invalidation) for a newly mapped io page
> - a domain switch (needs a iommu replay)
> 
> IMHO all these places are protected by the QBL (both 2nd/3rd cases
> should be invoked in VT-d IOMMU MMIO write to queue invalidation
> registers)? And I thought BQL should be regarded as a write lock even
> stronger than RCU read lock?
> 
> If I understand it correctly above, looks like we should be safe here
> as long as we are always with BQL? And, if so, do we really need RCU
> read protection here?
> 
> Please kindly correct if I missed anything.

Note that this code is originally used on power systems, David Gibson
will also need to review whether the SPAPR IOMMU can handle the
relaxed unmap in patch 03/20.  I suspect you're right about current
usage vs BQL, but I wonder how that plays into the long term plans for
the BQL and whether the intention of 41063e1 was to standardize all
address_space_translate() callers regardless of any one caller's BQL
usage.

Paolo, what do you think?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23 10:23           ` Jason Wang
  2017-01-23 19:40             ` Alex Williamson
@ 2017-01-24  4:42             ` Peter Xu
  1 sibling, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-24  4:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 23, 2017 at 06:23:44PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月23日 11:34, Peter Xu wrote:
> >On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月22日 17:04, Peter Xu wrote:
> >>>On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >>>
> >>>[...]
> >>>
> >>>>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>>>+                                           uint16_t domain_id, hwaddr addr,
> >>>>>+                                           uint8_t am)
> >>>>>+{
> >>>>>+    IntelIOMMUNotifierNode *node;
> >>>>>+    VTDContextEntry ce;
> >>>>>+    int ret;
> >>>>>+
> >>>>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>>>+                                       vtd_as->devfn, &ce);
> >>>>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>>>+                          vtd_page_invalidate_notify_hook,
> >>>>>+                          (void *)&vtd_as->iommu, true);
> >>>>Why not simply trigger the notifier here? (or is this vfio required?)
> >>>Because we may only want to notify part of the region - we are with
> >>>mask here, but not exact size.
> >>>
> >>>Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >>>the mask will be extended to 16K in the guest. In that case, we need
> >>>to explicitly go over the page entry to know that the 4th page should
> >>>not be notified.
> >>I see. Then it was required by vfio only, I think we can add a fast path for
> >>!CM in this case by triggering the notifier directly.
> >I noted this down (to be further investigated in my todo), but I don't
> >know whether this can work, due to the fact that I think it is still
> >legal that guest merge more than one PSIs into one. For example, I
> >don't know whether below is legal:
> >
> >- guest invalidate page (0, 4k)
> >- guest map new page (4k, 8k)
> >- guest send single PSI of (0, 8k)
> >
> >In that case, it contains both map/unmap, and looks like it didn't
> >disobay the spec as well?
> 
> Not sure I get your meaning, you mean just send single PSI instead of two?

Yes, and looks like that still doesn't violate the spec?

Actually for now, I think the best way to do with this series is that,
we can first let it in (so that advanced users can start to use it and
play with it). Then, we can get more feedback and solve critical
issues that may matter to customers and users.

For the above, I think per-page walk is the safest one for now. And I
can do investigate (as I mentioned) in the future to see whether we
can make it faster, according to your suggestion. However that'll be
nice we do it after we have some real use cases for this series, then
we can make sure the enhancement won't break anything besides boosting
the performance.

But of course I would like to listen to the maintainer's opinion on
this...

> 
> >
> >>Another possible issue is, consider (with CM) a 16K contiguous iova with the
> >>last page has already been mapped. In this case, if we want to map first
> >>three pages, when handling IOTLB invalidation, am would be 16K, then the
> >>last page will be mapped twice. Can this lead some issue?
> >I don't know whether guest has special handling of this kind of
> >request.
> 
> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
> 
> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>                   struct dmar_domain *domain,
>                   unsigned long pfn, unsigned int pages,
>                   int ih, int map)
> {
>     unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>     uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>     u16 did = domain->iommu_did[iommu->seq_id];
> ...

Yes, do rounding up should be the only thing to do when we have
unaligned size.

> 
> 
> >
> >Besides, imho to completely solve this problem, we still need that
> >per-domain tree. Considering that currently the tree is inside vfio, I
> >see this not a big issue as well.
> 
> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems become
> guest trigger-able. And since VFIO allocate its own structure to record dma
> mapping, this seems open a window for evil guest to exhaust host memory
> which is even worse.

(I see Alex replied in another email, so will skip this one)

> 
> >  In that case, the last page mapping
> >request will fail (we might see one error line from QEMU stderr),
> >however that'll not affect too much since currently vfio allows that
> >failure to happen (ioctl fail, but that page is still mapped, which is
> >what we wanted).
> 
> Works but sub-optimal or maybe even buggy.

Again, to finally solve this, I think we need a tree. But I don't
think that's a good idea for this series, considering that we have
already had one in the kernel. But I see this issue not a critical
blocker (if you won't disagree) since it should work for our goal,
which is either nested device assignment, or dpdk applications in
general.

I think users' feedback is really important for this series. So again,
I'll request that we postpone some issues as todo, rather than solving
all of them in this series before merge.

> 
> >
> >(But of course above error message can be used by an in-guest attacker
> >  as well just like general error_report() issues reported before,
> >  though again I will appreciate if we can have this series
> >  functionally work first :)
> >
> >And, I should be able to emulate this behavior in guest with a tiny C
> >program to make sure of it, possibly after this series if allowed.
> 
> Or through your vtd unittest :) ?

Yes, or easier, just write a program in guest running Linux, sends
VFIO_IOMMU_DMA_MAP ioctl()s correspondingly.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-23 10:36   ` Jason Wang
@ 2017-01-24  4:52     ` Peter Xu
  2017-01-25  3:09       ` Jason Wang
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-24  4:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 23, 2017 at 06:36:17PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月20日 21:08, Peter Xu wrote:
> >Before this one we only invalidate context cache when we receive context
> >entry invalidations. However it's possible that the invalidation also
> >contains a domain switch (only if cache-mode is enabled for vIOMMU). In
> >that case we need to notify all the registered components about the new
> >mapping.
> >
> >Signed-off-by: Peter Xu <peterx@redhat.com>
> >---
> >  hw/i386/intel_iommu.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> >diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >index f9c5142..4b08b4d 100644
> >--- a/hw/i386/intel_iommu.c
> >+++ b/hw/i386/intel_iommu.c
> >@@ -1146,6 +1146,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
> >                  trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> >                                               VTD_PCI_FUNC(devfn_it));
> >                  vtd_as->context_cache_entry.context_cache_gen = 0;
> >+                /*
> >+                 * So a device is moving out of (or moving into) a
> >+                 * domain, a replay() suites here to notify all the
> >+                 * IOMMU_NOTIFIER_MAP registers about this change.
> >+                 * This won't bring bad even if we have no such
> >+                 * notifier registered - the IOMMU notification
> >+                 * framework will skip MAP notifications if that
> >+                 * happened.
> >+                 */
> >+                memory_region_iommu_replay_all(&vtd_as->iommu);
> 
> DSI and GLOBAL questions come back again or no need to care about them :) ?

DSI/GLOBAL hanldings are in patch 20 (though it'll be squashed into 18
in my next post). Is that what you mean above?

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23 18:03           ` Alex Williamson
@ 2017-01-24  7:22             ` Peter Xu
  2017-01-24 16:24               ` Alex Williamson
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-24  7:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Wang, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel

On Mon, Jan 23, 2017 at 11:03:08AM -0700, Alex Williamson wrote:
> On Mon, 23 Jan 2017 11:34:29 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2017年01月22日 17:04, Peter Xu wrote:  
> > > >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> > > >
> > > >[...]
> > > >  
> > > >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> > > >>>+                                           uint16_t domain_id, hwaddr addr,
> > > >>>+                                           uint8_t am)
> > > >>>+{
> > > >>>+    IntelIOMMUNotifierNode *node;
> > > >>>+    VTDContextEntry ce;
> > > >>>+    int ret;
> > > >>>+
> > > >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> > > >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> > > >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> > > >>>+                                       vtd_as->devfn, &ce);
> > > >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > > >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> > > >>>+                          vtd_page_invalidate_notify_hook,
> > > >>>+                          (void *)&vtd_as->iommu, true);  
> > > >>Why not simply trigger the notifier here? (or is this vfio required?)  
> > > >Because we may only want to notify part of the region - we are with
> > > >mask here, but not exact size.
> > > >
> > > >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> > > >the mask will be extended to 16K in the guest. In that case, we need
> > > >to explicitly go over the page entry to know that the 4th page should
> > > >not be notified.  
> > > 
> > > I see. Then it was required by vfio only, I think we can add a fast path for
> > > !CM in this case by triggering the notifier directly.  
> > 
> > I noted this down (to be further investigated in my todo), but I don't
> > know whether this can work, due to the fact that I think it is still
> > legal that guest merge more than one PSIs into one. For example, I
> > don't know whether below is legal:
> > 
> > - guest invalidate page (0, 4k)
> > - guest map new page (4k, 8k)
> > - guest send single PSI of (0, 8k)
> > 
> > In that case, it contains both map/unmap, and looks like it didn't
> > disobay the spec as well?
> 
> The topic of mapping and invalidation granularity also makes me
> slightly concerned with the abstraction we use for the type1 IOMMU
> backend.  With the "v2" type1 configuration we currently use in QEMU,
> the user may only unmap with the same minimum granularity with which
> the original mapping was created.  For instance if an iommu notifier
> map request gets to vfio with an 8k range, the resulting mapping can
> only be removed by an invalidation covering the full range.  Trying to
> bisect that original mapping by only invalidating 4k of the range will
> generate an error.

I see. Then this will be an strict requirement that we cannot do
coalescing during page walk, at least for mappings.

I didn't notice this before, but luckily current series is following
the rule above - we are basically doing the mapping in the unit of
pages. Normally, we should always be mapping with 4K pages, only if
guest provides huge pages in the VT-d page table, would we notify map
with >4K, though of course it can be either 2M/1G but never other
values.

The point is, guest should be aware of the existance of the above huge
pages, so it won't unmap (for example) a single 4k region within a 2M
huge page range. It'll either keep the huge page, or unmap the whole
huge page. In that sense, we are quite safe.

(for my own curiousity and out of topic: could I ask why we can't do
 that? e.g., we map 4K*2 pages, then we unmap the first 4K page?)

> 
> I would think (but please confirm), that when we're only tracking
> mappings generated by the guest OS that this works.  If the guest OS
> maps with 4k pages, we get map notifies for each of those 4k pages.  If
> they use 2MB pages, we get 2MB ranges and invalidations will come in
> the same granularity.

I would agree (I haven't thought of a case that this might be a
problem).

> 
> An area of concern though is the replay mechanism in QEMU, I'll need to
> look for it in the code, but replaying an IOMMU domain into a new
> container *cannot* coalesce mappings or else it limits the granularity
> with which we can later accept unmaps. Take for instance a guest that
> has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
> page within that range.  However if vfio gets a single 2MB mapping
> rather than 512 4K mappings, then the host IOMMU may use a hugepage
> mapping where our granularity is now 2MB.  Thanks,

Is this the answer of my above question (which is for my own
curiosity)? If so, that'll kind of explain.

If it's just because vfio is smart enough on automatically using huge
pages when applicable (I believe it's for performance's sake), not
sure whether we can introduce a ioctl() to setup the iova_pgsizes
bitmap, as long as it is a subset of supported iova_pgsizes (from
VFIO_IOMMU_GET_INFO) - then when people wants to get rid of above
limitation, they can explicitly set the iova_pgsizes to only allow 4K
pages.

But, of course, this series can live well without it at least for now.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-23 10:40   ` Jason Wang
@ 2017-01-24  7:31     ` Peter Xu
  2017-01-25  3:11       ` Jason Wang
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-24  7:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 23, 2017 at 06:40:12PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月20日 21:08, Peter Xu wrote:
> >  static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> >  {
> >      memory_region_notify_one((IOMMUNotifier *)private, entry);
> >@@ -2711,13 +2768,16 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> >      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> >          /*
> >-         * Scanned a valid context entry, walk over the pages and
> >-         * notify when needed.
> >+         * Scanned a valid context entry, we first make sure to remove
> >+         * all existing mappings in old domain, by sending UNMAP to
> >+         * all the notifiers. Then, we walk over the pages and notify
> >+         * with existing mapped new entries in the new domain.
> >           */
> 
> A question is what if the context cache was invalidated but the device were
> not moved to a new domain. Then the code here does not do anything I
> believe?

Yes, it'll unmap all the stuffs and remap them all. I think that's my
intention, and can we really avoid this?

> I think we should move vtd_address_space_unmap() in the context
> entry invalidation processing.

IMHO we need this "whole umap" thing not only for context entry
invalidation, but all the places that need this replay, no? For
example, when we receive domain flush.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances
  2017-01-23 15:55 ` [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
@ 2017-01-24  7:40   ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-24  7:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, tianyu.lan, kevin.tian, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Mon, Jan 23, 2017 at 05:55:51PM +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 20, 2017 at 09:08:36PM +0800, Peter Xu wrote:
> > This is v4 of vt-d vfio enablement series.
> > 
> > Sorry that v4 growed to 20 patches. Some newly added patches (which
> > are quite necessary):
> > 
> > [01/20] vfio: trace map/unmap for notify as well
> > [02/20] vfio: introduce vfio_get_vaddr()
> > [03/20] vfio: allow to notify unmap for very large region
> > 
> >   Patches from RFC series:
> > 
> >   "[PATCH RFC 0/3] vfio: allow to notify unmap for very big region"
> > 
> >   Which is required by patch [19/20].
> > 
> > [11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro
> > 
> >   A helper only.
> > 
> > [19/20] intel_iommu: unmap existing pages before replay
> > 
> >   This solves Alex's concern that there might have existing mappings
> >   in previous domain when replay happens.
> > 
> > [20/20] intel_iommu: replay even with DSI/GLOBAL inv desc
> > 
> >   This solves Jason/Kevin's concern by handling DSI/GLOBAL
> >   invalidations as well.
> > 
> > Each individual patch will have more detailed explanation on itself.
> > Please refer to each of them.
> > 
> > Here I did separate work on patch 19/20 rather than squashing them
> > into patch 18 for easier modification and review. I prefer we have
> > them separately so we can see each problem separately, after all,
> > patch 18 survives in most use cases. Please let me know if we want to
> > squash them in some way. I can respin when necessary.
> > 
> > Besides the big things, lots of tiny tweaks as well. Here's the
> > changelog.
> 
> It would be nice to add to the log
> - known issues / missing features, if any

Sure. Will add them in next post.

> - are there patches ready to be merged here?
>   if yes pls post them without the rfc tag

The series (since V1) should have passed compilation test and simple
functional test (though only with my tiny C program to torture it).
Since I have got lots of review comments, and looks like the whole
thing is acceptable in general, I'll repost with a non-rfc version
with some tweaks upon this one.

Michael, please feel free to pick any of them which you think are
applicable and safe (e.g., the iommu cleanups).

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier
  2017-01-23 19:12   ` Alex Williamson
@ 2017-01-24  7:48     ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-24  7:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Mon, Jan 23, 2017 at 12:12:44PM -0700, Alex Williamson wrote:
> On Fri, 20 Jan 2017 21:08:46 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > In this patch, IOMMUNotifier.{start|end} are introduced to store section
> > information for a specific notifier. When notification occurs, we not
> > only check the notification type (MAP|UNMAP), but also check whether the
> > notified iova is in the range of specific IOMMU notifier, and skip those
> > notifiers if not in the listened range.
> > 
> > When removing an region, we need to make sure we removed the correct
> > VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> > 
> > Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > changelog (start from vt-d vfio enablement series v3):
> > v4:
> > - introduce memory_region_iommu_notifier_init() [Jason]
> > ---
> >  hw/vfio/common.c      | 12 +++++++++---
> >  hw/virtio/vhost.c     |  4 ++--
> >  include/exec/memory.h | 19 ++++++++++++++++++-
> >  memory.c              |  5 ++++-
> >  4 files changed, 33 insertions(+), 7 deletions(-)
> 
> 
> Acked-by: Alex Williamson <alex.williamson@redhat.com>

Thanks for the ack!

Sorry that I want to tune this patch a bit - I'll loosen the limit on
the range check. The original patch will notify if iova is inside
range (start, end), while I am tuning it to allow the notification
happen as long as (iova, size) and (start, end) has any overlapping.
The diff against this one would be (for your better reference):

------8<-------

diff --git a/memory.c b/memory.c
index 89104b1..80ab3c1 100644
--- a/memory.c
+++ b/memory.c
@@ -1672,9 +1672,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     }

     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
-        if (iommu_notifier->notifier_flags & request_flags &&
-            iommu_notifier->start <= entry.iova &&
-            iommu_notifier->end >= entry.iova) {
+        /*
+         * Skip the notification if the notification does not overlap
+         * with registered range.
+         */
+        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
+            iommu_notifier->end < entry.iova) {
+            continue;
+        }
+        if (iommu_notifier->notifier_flags & request_flags) {
             iommu_notifier->notify(iommu_notifier, &entry);
         }
     }

------>8-------

I'll post with the complete patch along with the series's next post.

Thanks,

-- peterx

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-24  7:22             ` Peter Xu
@ 2017-01-24 16:24               ` Alex Williamson
  2017-01-25  4:04                 ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Alex Williamson @ 2017-01-24 16:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel

On Tue, 24 Jan 2017 15:22:15 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jan 23, 2017 at 11:03:08AM -0700, Alex Williamson wrote:
> > On Mon, 23 Jan 2017 11:34:29 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:  
> > > > 
> > > > 
> > > > On 2017年01月22日 17:04, Peter Xu wrote:    
> > > > >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> > > > >
> > > > >[...]
> > > > >    
> > > > >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> > > > >>>+                                           uint16_t domain_id, hwaddr addr,
> > > > >>>+                                           uint8_t am)
> > > > >>>+{
> > > > >>>+    IntelIOMMUNotifierNode *node;
> > > > >>>+    VTDContextEntry ce;
> > > > >>>+    int ret;
> > > > >>>+
> > > > >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> > > > >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> > > > >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> > > > >>>+                                       vtd_as->devfn, &ce);
> > > > >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > > > >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> > > > >>>+                          vtd_page_invalidate_notify_hook,
> > > > >>>+                          (void *)&vtd_as->iommu, true);    
> > > > >>Why not simply trigger the notifier here? (or is this vfio required?)    
> > > > >Because we may only want to notify part of the region - we are with
> > > > >mask here, but not exact size.
> > > > >
> > > > >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> > > > >the mask will be extended to 16K in the guest. In that case, we need
> > > > >to explicitly go over the page entry to know that the 4th page should
> > > > >not be notified.    
> > > > 
> > > > I see. Then it was required by vfio only, I think we can add a fast path for
> > > > !CM in this case by triggering the notifier directly.    
> > > 
> > > I noted this down (to be further investigated in my todo), but I don't
> > > know whether this can work, due to the fact that I think it is still
> > > legal that guest merge more than one PSIs into one. For example, I
> > > don't know whether below is legal:
> > > 
> > > - guest invalidate page (0, 4k)
> > > - guest map new page (4k, 8k)
> > > - guest send single PSI of (0, 8k)
> > > 
> > > In that case, it contains both map/unmap, and looks like it didn't
> > > disobay the spec as well?  
> > 
> > The topic of mapping and invalidation granularity also makes me
> > slightly concerned with the abstraction we use for the type1 IOMMU
> > backend.  With the "v2" type1 configuration we currently use in QEMU,
> > the user may only unmap with the same minimum granularity with which
> > the original mapping was created.  For instance if an iommu notifier
> > map request gets to vfio with an 8k range, the resulting mapping can
> > only be removed by an invalidation covering the full range.  Trying to
> > bisect that original mapping by only invalidating 4k of the range will
> > generate an error.  
> 
> I see. Then this will be an strict requirement that we cannot do
> coalescing during page walk, at least for mappings.
> 
> I didn't notice this before, but luckily current series is following
> the rule above - we are basically doing the mapping in the unit of
> pages. Normally, we should always be mapping with 4K pages, only if
> guest provides huge pages in the VT-d page table, would we notify map
> with >4K, though of course it can be either 2M/1G but never other
> values.
> 
> The point is, guest should be aware of the existance of the above huge
> pages, so it won't unmap (for example) a single 4k region within a 2M
> huge page range. It'll either keep the huge page, or unmap the whole
> huge page. In that sense, we are quite safe.
> 
> (for my own curiousity and out of topic: could I ask why we can't do
>  that? e.g., we map 4K*2 pages, then we unmap the first 4K page?)

You understand why we can't do this in the hugepage case, right?  A
hugepage means that at least one entire level of the page table is
missing and that in order to unmap a subsection of it, we actually need
to replace it with a new page table level, which cannot be done
atomically relative to the rest of the PTEs in that entry.  Now what if
we don't assume that hugepages are only the Intel defined 2MB & 1GB?
AMD-Vi supports effectively arbitrary power of two page table entries.
So what if we've passed a 2x 4K mapping where the physical pages were
contiguous and vfio passed it as a direct 8K mapping to the IOMMU and
the IOMMU has native support for 8K mappings.  We're in a similar
scenario as the 2MB page, different page table layout though.

> > I would think (but please confirm), that when we're only tracking
> > mappings generated by the guest OS that this works.  If the guest OS
> > maps with 4k pages, we get map notifies for each of those 4k pages.  If
> > they use 2MB pages, we get 2MB ranges and invalidations will come in
> > the same granularity.  
> 
> I would agree (I haven't thought of a case that this might be a
> problem).
> 
> > 
> > An area of concern though is the replay mechanism in QEMU, I'll need to
> > look for it in the code, but replaying an IOMMU domain into a new
> > container *cannot* coalesce mappings or else it limits the granularity
> > with which we can later accept unmaps. Take for instance a guest that
> > has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
> > page within that range.  However if vfio gets a single 2MB mapping
> > rather than 512 4K mappings, then the host IOMMU may use a hugepage
> > mapping where our granularity is now 2MB.  Thanks,  
> 
> Is this the answer of my above question (which is for my own
> curiosity)? If so, that'll kind of explain.
> 
> If it's just because vfio is smart enough on automatically using huge
> pages when applicable (I believe it's for performance's sake), not
> sure whether we can introduce a ioctl() to setup the iova_pgsizes
> bitmap, as long as it is a subset of supported iova_pgsizes (from
> VFIO_IOMMU_GET_INFO) - then when people wants to get rid of above
> limitation, they can explicitly set the iova_pgsizes to only allow 4K
> pages.
> 
> But, of course, this series can live well without it at least for now.

Yes, this is part of how vfio transparently makes use of hugepages in
the IOMMU, we effectively disregard the supported page sizes bitmap
(it's useless for anything other than determining the minimum page size
anyway), and instead pass through the largest range of iovas which are
physically contiguous.  The IOMMU driver can then make use of hugepages
where available.  The VFIO_IOMMU_MAP_DMA ioctl does include a flags
field where we could appropriate a bit to indicate map with minimum
granularity, but that would not be as simple as triggering the
disable_hugepages mapping path because the type1 driver would also need
to flag the internal vfio_dma as being bisectable, if not simply
converted to multiple vfio_dma structs internally.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-23 19:40             ` Alex Williamson
@ 2017-01-25  1:19               ` Jason Wang
  2017-01-25  1:31                 ` Alex Williamson
  0 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-25  1:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel, Peter Xu



On 2017年01月24日 03:40, Alex Williamson wrote:
> On Mon, 23 Jan 2017 18:23:44 +0800
> Jason Wang<jasowang@redhat.com>  wrote:
>
>> On 2017年01月23日 11:34, Peter Xu wrote:
>>> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
>>>> On 2017年01月22日 17:04, Peter Xu wrote:
>>>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>>>>>
>>>>> [...]
>>>>>   
>>>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>>>>>> +                                           uint16_t domain_id, hwaddr addr,
>>>>>>> +                                           uint8_t am)
>>>>>>> +{
>>>>>>> +    IntelIOMMUNotifierNode *node;
>>>>>>> +    VTDContextEntry ce;
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>>>>> +                                       vtd_as->devfn, &ce);
>>>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>>>>>> +                          vtd_page_invalidate_notify_hook,
>>>>>>> +                          (void *)&vtd_as->iommu, true);
>>>>>> Why not simply trigger the notifier here? (or is this vfio required?)
>>>>> Because we may only want to notify part of the region - we are with
>>>>> mask here, but not exact size.
>>>>>
>>>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
>>>>> the mask will be extended to 16K in the guest. In that case, we need
>>>>> to explicitly go over the page entry to know that the 4th page should
>>>>> not be notified.
>>>> I see. Then it was required by vfio only, I think we can add a fast path for
>>>> !CM in this case by triggering the notifier directly.
>>> I noted this down (to be further investigated in my todo), but I don't
>>> know whether this can work, due to the fact that I think it is still
>>> legal that guest merge more than one PSIs into one. For example, I
>>> don't know whether below is legal:
>>>
>>> - guest invalidate page (0, 4k)
>>> - guest map new page (4k, 8k)
>>> - guest send single PSI of (0, 8k)
>>>
>>> In that case, it contains both map/unmap, and looks like it didn't
>>> disobay the spec as well?
>> Not sure I get your meaning, you mean just send single PSI instead of two?
>>
>>>   
>>>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
>>>> last page has already been mapped. In this case, if we want to map first
>>>> three pages, when handling IOTLB invalidation, am would be 16K, then the
>>>> last page will be mapped twice. Can this lead some issue?
>>> I don't know whether guest has special handling of this kind of
>>> request.
>> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
>>
>> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>>                     struct dmar_domain *domain,
>>                     unsigned long pfn, unsigned int pages,
>>                     int ih, int map)
>> {
>>       unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>>       uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>>       u16 did = domain->iommu_did[iommu->seq_id];
>> ...
>>
>>
>>> Besides, imho to completely solve this problem, we still need that
>>> per-domain tree. Considering that currently the tree is inside vfio, I
>>> see this not a big issue as well.
>> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems
>> become guest trigger-able. And since VFIO allocate its own structure to
>> record dma mapping, this seems open a window for evil guest to exhaust
>> host memory which is even worse.
> You're thinking of pci-assign, vfio does page accounting such that a
> user can only lock pages up to their locked memory limit.  Exposing the
> mapping ioctl within the guest is not a different problem from exposing
> the ioctl to the host user from a vfio perspective.  Thanks,
>
> Alex
>

Yes, but what if an evil guest that maps all iovas to the same gpa?

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-25  1:19               ` Jason Wang
@ 2017-01-25  1:31                 ` Alex Williamson
  2017-01-25  7:41                   ` Jason Wang
  0 siblings, 1 reply; 75+ messages in thread
From: Alex Williamson @ 2017-01-25  1:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel, Peter Xu

On Wed, 25 Jan 2017 09:19:25 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2017年01月24日 03:40, Alex Williamson wrote:
> > On Mon, 23 Jan 2017 18:23:44 +0800
> > Jason Wang<jasowang@redhat.com>  wrote:
> >  
> >> On 2017年01月23日 11:34, Peter Xu wrote:  
> >>> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:  
> >>>> On 2017年01月22日 17:04, Peter Xu wrote:  
> >>>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >>>>>
> >>>>> [...]
> >>>>>     
> >>>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>>>>> +                                           uint16_t domain_id, hwaddr addr,
> >>>>>>> +                                           uint8_t am)
> >>>>>>> +{
> >>>>>>> +    IntelIOMMUNotifierNode *node;
> >>>>>>> +    VTDContextEntry ce;
> >>>>>>> +    int ret;
> >>>>>>> +
> >>>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>>>>> +                                       vtd_as->devfn, &ce);
> >>>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>>>>> +                          vtd_page_invalidate_notify_hook,
> >>>>>>> +                          (void *)&vtd_as->iommu, true);  
> >>>>>> Why not simply trigger the notifier here? (or is this vfio required?)  
> >>>>> Because we may only want to notify part of the region - we are with
> >>>>> mask here, but not exact size.
> >>>>>
> >>>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >>>>> the mask will be extended to 16K in the guest. In that case, we need
> >>>>> to explicitly go over the page entry to know that the 4th page should
> >>>>> not be notified.  
> >>>> I see. Then it was required by vfio only, I think we can add a fast path for
> >>>> !CM in this case by triggering the notifier directly.  
> >>> I noted this down (to be further investigated in my todo), but I don't
> >>> know whether this can work, due to the fact that I think it is still
> >>> legal that guest merge more than one PSIs into one. For example, I
> >>> don't know whether below is legal:
> >>>
> >>> - guest invalidate page (0, 4k)
> >>> - guest map new page (4k, 8k)
> >>> - guest send single PSI of (0, 8k)
> >>>
> >>> In that case, it contains both map/unmap, and looks like it didn't
> >>> disobay the spec as well?  
> >> Not sure I get your meaning, you mean just send single PSI instead of two?
> >>  
> >>>     
> >>>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
> >>>> last page has already been mapped. In this case, if we want to map first
> >>>> three pages, when handling IOTLB invalidation, am would be 16K, then the
> >>>> last page will be mapped twice. Can this lead some issue?  
> >>> I don't know whether guest has special handling of this kind of
> >>> request.  
> >> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
> >>
> >> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
> >>                     struct dmar_domain *domain,
> >>                     unsigned long pfn, unsigned int pages,
> >>                     int ih, int map)
> >> {
> >>       unsigned int mask = ilog2(__roundup_pow_of_two(pages));
> >>       uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
> >>       u16 did = domain->iommu_did[iommu->seq_id];
> >> ...
> >>
> >>  
> >>> Besides, imho to completely solve this problem, we still need that
> >>> per-domain tree. Considering that currently the tree is inside vfio, I
> >>> see this not a big issue as well.  
> >> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems
> >> become guest trigger-able. And since VFIO allocate its own structure to
> >> record dma mapping, this seems open a window for evil guest to exhaust
> >> host memory which is even worse.  
> > You're thinking of pci-assign, vfio does page accounting such that a
> > user can only lock pages up to their locked memory limit.  Exposing the
> > mapping ioctl within the guest is not a different problem from exposing
> > the ioctl to the host user from a vfio perspective.  Thanks,
> >
> > Alex
> >  
> 
> Yes, but what if an evil guest that maps all iovas to the same gpa?

Doesn't matter, we'd account that gpa each time it's mapped, so
effectively the locked memory limit is equal to the iova size the user
can map.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-24  4:52     ` Peter Xu
@ 2017-01-25  3:09       ` Jason Wang
  2017-01-25  3:46         ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-25  3:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月24日 12:52, Peter Xu wrote:
> On Mon, Jan 23, 2017 at 06:36:17PM +0800, Jason Wang wrote:
>>
>> On 2017年01月20日 21:08, Peter Xu wrote:
>>> Before this one we only invalidate context cache when we receive context
>>> entry invalidations. However it's possible that the invalidation also
>>> contains a domain switch (only if cache-mode is enabled for vIOMMU). In
>>> that case we need to notify all the registered components about the new
>>> mapping.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>   hw/i386/intel_iommu.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index f9c5142..4b08b4d 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -1146,6 +1146,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>>>                   trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
>>>                                                VTD_PCI_FUNC(devfn_it));
>>>                   vtd_as->context_cache_entry.context_cache_gen = 0;
>>> +                /*
>>> +                 * So a device is moving out of (or moving into) a
>>> +                 * domain, a replay() suites here to notify all the
>>> +                 * IOMMU_NOTIFIER_MAP registers about this change.
>>> +                 * This won't bring bad even if we have no such
>>> +                 * notifier registered - the IOMMU notification
>>> +                 * framework will skip MAP notifications if that
>>> +                 * happened.
>>> +                 */
>>> +                memory_region_iommu_replay_all(&vtd_as->iommu);
>> DSI and GLOBAL questions come back again or no need to care about them :) ?
> DSI/GLOBAL hanldings are in patch 20 (though it'll be squashed into 18
> in my next post). Is that what you mean above?

Seems not, I mean DSI/GLOBAL for context cache invalidation instead of 
IOTLB :)

Thanks

>
> Thanks!
>
> -- peterx
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-24  7:31     ` Peter Xu
@ 2017-01-25  3:11       ` Jason Wang
  2017-01-25  4:15         ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Jason Wang @ 2017-01-25  3:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月24日 15:31, Peter Xu wrote:
> On Mon, Jan 23, 2017 at 06:40:12PM +0800, Jason Wang wrote:
>> On 2017年01月20日 21:08, Peter Xu wrote:
>>>   static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
>>>   {
>>>       memory_region_notify_one((IOMMUNotifier *)private, entry);
>>> @@ -2711,13 +2768,16 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
>>>       if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>>>           /*
>>> -         * Scanned a valid context entry, walk over the pages and
>>> -         * notify when needed.
>>> +         * Scanned a valid context entry, we first make sure to remove
>>> +         * all existing mappings in old domain, by sending UNMAP to
>>> +         * all the notifiers. Then, we walk over the pages and notify
>>> +         * with existing mapped new entries in the new domain.
>>>            */
>> A question is what if the context cache was invalidated but the device were
>> not moved to a new domain. Then the code here does not do anything I
>> believe?
> Yes, it'll unmap all the stuffs and remap them all. I think that's my
> intention, and can we really avoid this?
>
>> I think we should move vtd_address_space_unmap() in the context
>> entry invalidation processing.
> IMHO we need this "whole umap" thing not only for context entry
> invalidation, but all the places that need this replay, no? For
> example, when we receive domain flush.
>
> Thanks,
>
> -- peterx
>

Consider the case that we move device from domain A to no domain. Looks 
like current code did nothing since it can not get a valid context entry 
during replay?

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-25  3:09       ` Jason Wang
@ 2017-01-25  3:46         ` Peter Xu
  2017-01-25  6:37           ` Tian, Kevin
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-25  3:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Wed, Jan 25, 2017 at 11:09:39AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月24日 12:52, Peter Xu wrote:
> >On Mon, Jan 23, 2017 at 06:36:17PM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月20日 21:08, Peter Xu wrote:
> >>>Before this one we only invalidate context cache when we receive context
> >>>entry invalidations. However it's possible that the invalidation also
> >>>contains a domain switch (only if cache-mode is enabled for vIOMMU). In
> >>>that case we need to notify all the registered components about the new
> >>>mapping.
> >>>
> >>>Signed-off-by: Peter Xu <peterx@redhat.com>
> >>>---
> >>>  hw/i386/intel_iommu.c | 10 ++++++++++
> >>>  1 file changed, 10 insertions(+)
> >>>
> >>>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >>>index f9c5142..4b08b4d 100644
> >>>--- a/hw/i386/intel_iommu.c
> >>>+++ b/hw/i386/intel_iommu.c
> >>>@@ -1146,6 +1146,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
> >>>                  trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> >>>                                               VTD_PCI_FUNC(devfn_it));
> >>>                  vtd_as->context_cache_entry.context_cache_gen = 0;
> >>>+                /*
> >>>+                 * So a device is moving out of (or moving into) a
> >>>+                 * domain, a replay() suites here to notify all the
> >>>+                 * IOMMU_NOTIFIER_MAP registers about this change.
> >>>+                 * This won't bring bad even if we have no such
> >>>+                 * notifier registered - the IOMMU notification
> >>>+                 * framework will skip MAP notifications if that
> >>>+                 * happened.
> >>>+                 */
> >>>+                memory_region_iommu_replay_all(&vtd_as->iommu);
> >>DSI and GLOBAL questions come back again or no need to care about them :) ?
> >DSI/GLOBAL hanldings are in patch 20 (though it'll be squashed into 18
> >in my next post). Is that what you mean above?
> 
> Seems not, I mean DSI/GLOBAL for context cache invalidation instead of IOTLB
> :)

Yes, I should possibly do the same thing for context cache global
invalidations. IIUC context global invalidation should be a superset
of iotlb invalidation, so maybe I'll add one more patch to call iotlb
invalidation in context invalidation as well. Kevin/others, please
correct me if I misunderstood the spec. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-24 16:24               ` Alex Williamson
@ 2017-01-25  4:04                 ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-25  4:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Wang, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel

On Tue, Jan 24, 2017 at 09:24:29AM -0700, Alex Williamson wrote:

[...]

> > I see. Then this will be an strict requirement that we cannot do
> > coalescing during page walk, at least for mappings.
> > 
> > I didn't notice this before, but luckily current series is following
> > the rule above - we are basically doing the mapping in the unit of
> > pages. Normally, we should always be mapping with 4K pages, only if
> > guest provides huge pages in the VT-d page table, would we notify map
> > with >4K, though of course it can be either 2M/1G but never other
> > values.
> > 
> > The point is, guest should be aware of the existance of the above huge
> > pages, so it won't unmap (for example) a single 4k region within a 2M
> > huge page range. It'll either keep the huge page, or unmap the whole
> > huge page. In that sense, we are quite safe.
> > 
> > (for my own curiousity and out of topic: could I ask why we can't do
> >  that? e.g., we map 4K*2 pages, then we unmap the first 4K page?)
> 
> You understand why we can't do this in the hugepage case, right?  A
> hugepage means that at least one entire level of the page table is
> missing and that in order to unmap a subsection of it, we actually need
> to replace it with a new page table level, which cannot be done
> atomically relative to the rest of the PTEs in that entry.  Now what if
> we don't assume that hugepages are only the Intel defined 2MB & 1GB?
> AMD-Vi supports effectively arbitrary power of two page table entries.
> So what if we've passed a 2x 4K mapping where the physical pages were
> contiguous and vfio passed it as a direct 8K mapping to the IOMMU and
> the IOMMU has native support for 8K mappings.  We're in a similar
> scenario as the 2MB page, different page table layout though.

Thanks for the explaination. The AMD example is clear.

> 
> > > I would think (but please confirm), that when we're only tracking
> > > mappings generated by the guest OS that this works.  If the guest OS
> > > maps with 4k pages, we get map notifies for each of those 4k pages.  If
> > > they use 2MB pages, we get 2MB ranges and invalidations will come in
> > > the same granularity.  
> > 
> > I would agree (I haven't thought of a case that this might be a
> > problem).
> > 
> > > 
> > > An area of concern though is the replay mechanism in QEMU, I'll need to
> > > look for it in the code, but replaying an IOMMU domain into a new
> > > container *cannot* coalesce mappings or else it limits the granularity
> > > with which we can later accept unmaps. Take for instance a guest that
> > > has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
> > > page within that range.  However if vfio gets a single 2MB mapping
> > > rather than 512 4K mappings, then the host IOMMU may use a hugepage
> > > mapping where our granularity is now 2MB.  Thanks,  
> > 
> > Is this the answer of my above question (which is for my own
> > curiosity)? If so, that'll kind of explain.
> > 
> > If it's just because vfio is smart enough on automatically using huge
> > pages when applicable (I believe it's for performance's sake), not
> > sure whether we can introduce a ioctl() to setup the iova_pgsizes
> > bitmap, as long as it is a subset of supported iova_pgsizes (from
> > VFIO_IOMMU_GET_INFO) - then when people wants to get rid of above
> > limitation, they can explicitly set the iova_pgsizes to only allow 4K
> > pages.
> > 
> > But, of course, this series can live well without it at least for now.
> 
> Yes, this is part of how vfio transparently makes use of hugepages in
> the IOMMU, we effectively disregard the supported page sizes bitmap
> (it's useless for anything other than determining the minimum page size
> anyway), and instead pass through the largest range of iovas which are
> physically contiguous.  The IOMMU driver can then make use of hugepages
> where available.  The VFIO_IOMMU_MAP_DMA ioctl does include a flags
> field where we could appropriate a bit to indicate map with minimum
> granularity, but that would not be as simple as triggering the
> disable_hugepages mapping path because the type1 driver would also need
> to flag the internal vfio_dma as being bisectable, if not simply
> converted to multiple vfio_dma structs internally.  Thanks,

I see, thanks!

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay
  2017-01-25  3:11       ` Jason Wang
@ 2017-01-25  4:15         ` Peter Xu
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Xu @ 2017-01-25  4:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Wed, Jan 25, 2017 at 11:11:30AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月24日 15:31, Peter Xu wrote:
> >On Mon, Jan 23, 2017 at 06:40:12PM +0800, Jason Wang wrote:
> >>On 2017年01月20日 21:08, Peter Xu wrote:
> >>>  static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> >>>  {
> >>>      memory_region_notify_one((IOMMUNotifier *)private, entry);
> >>>@@ -2711,13 +2768,16 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> >>>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> >>>          /*
> >>>-         * Scanned a valid context entry, walk over the pages and
> >>>-         * notify when needed.
> >>>+         * Scanned a valid context entry, we first make sure to remove
> >>>+         * all existing mappings in old domain, by sending UNMAP to
> >>>+         * all the notifiers. Then, we walk over the pages and notify
> >>>+         * with existing mapped new entries in the new domain.
> >>>           */
> >>A question is what if the context cache was invalidated but the device were
> >>not moved to a new domain. Then the code here does not do anything I
> >>believe?
> >Yes, it'll unmap all the stuffs and remap them all. I think that's my
> >intention, and can we really avoid this?
> >
> >>I think we should move vtd_address_space_unmap() in the context
> >>entry invalidation processing.
> >IMHO we need this "whole umap" thing not only for context entry
> >invalidation, but all the places that need this replay, no? For
> >example, when we receive domain flush.
> >
> >Thanks,
> >
> >-- peterx
> >
> 
> Consider the case that we move device from domain A to no domain. Looks like
> current code did nothing since it can not get a valid context entry during
> replay?

Right. I should do the "whole region unmap" thing even without a valid
context entry. Will fix it in next post. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-25  3:46         ` Peter Xu
@ 2017-01-25  6:37           ` Tian, Kevin
  2017-01-25  6:44             ` Peter Xu
  0 siblings, 1 reply; 75+ messages in thread
From: Tian, Kevin @ 2017-01-25  6:37 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: Lan, Tianyu, mst, jan.kiszka, bd.aviv, qemu-devel, alex.williamson

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, January 25, 2017 11:46 AM
> 
> On Wed, Jan 25, 2017 at 11:09:39AM +0800, Jason Wang wrote:
> >
> >
> > On 2017年01月24日 12:52, Peter Xu wrote:
> > >On Mon, Jan 23, 2017 at 06:36:17PM +0800, Jason Wang wrote:
> > >>
> > >>On 2017年01月20日 21:08, Peter Xu wrote:
> > >>>Before this one we only invalidate context cache when we receive context
> > >>>entry invalidations. However it's possible that the invalidation also
> > >>>contains a domain switch (only if cache-mode is enabled for vIOMMU). In
> > >>>that case we need to notify all the registered components about the new
> > >>>mapping.
> > >>>
> > >>>Signed-off-by: Peter Xu <peterx@redhat.com>
> > >>>---
> > >>>  hw/i386/intel_iommu.c | 10 ++++++++++
> > >>>  1 file changed, 10 insertions(+)
> > >>>
> > >>>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > >>>index f9c5142..4b08b4d 100644
> > >>>--- a/hw/i386/intel_iommu.c
> > >>>+++ b/hw/i386/intel_iommu.c
> > >>>@@ -1146,6 +1146,16 @@ static void
> vtd_context_device_invalidate(IntelIOMMUState *s,
> > >>>                  trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> > >>>                                               VTD_PCI_FUNC(devfn_it));
> > >>>                  vtd_as->context_cache_entry.context_cache_gen = 0;
> > >>>+                /*
> > >>>+                 * So a device is moving out of (or moving into) a
> > >>>+                 * domain, a replay() suites here to notify all the
> > >>>+                 * IOMMU_NOTIFIER_MAP registers about this change.
> > >>>+                 * This won't bring bad even if we have no such
> > >>>+                 * notifier registered - the IOMMU notification
> > >>>+                 * framework will skip MAP notifications if that
> > >>>+                 * happened.
> > >>>+                 */
> > >>>+                memory_region_iommu_replay_all(&vtd_as->iommu);
> > >>DSI and GLOBAL questions come back again or no need to care about them :) ?
> > >DSI/GLOBAL hanldings are in patch 20 (though it'll be squashed into 18
> > >in my next post). Is that what you mean above?
> >
> > Seems not, I mean DSI/GLOBAL for context cache invalidation instead of IOTLB
> > :)
> 
> Yes, I should possibly do the same thing for context cache global
> invalidations. IIUC context global invalidation should be a superset
> of iotlb invalidation, so maybe I'll add one more patch to call iotlb
> invalidation in context invalidation as well. Kevin/others, please
> correct me if I misunderstood the spec. Thanks,
> 

context invalidation is not superset of iotlb invalidation. The spec just
requires software to always follow a context-cache invalidation with
a PASID-cache invalidation, followed by an IOTLB invalidation.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-25  6:37           ` Tian, Kevin
@ 2017-01-25  6:44             ` Peter Xu
  2017-01-25  7:45               ` Jason Wang
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Xu @ 2017-01-25  6:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Lan, Tianyu, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Wed, Jan 25, 2017 at 06:37:23AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, January 25, 2017 11:46 AM
> > 
> > On Wed, Jan 25, 2017 at 11:09:39AM +0800, Jason Wang wrote:
> > >
> > >
> > > On 2017年01月24日 12:52, Peter Xu wrote:
> > > >On Mon, Jan 23, 2017 at 06:36:17PM +0800, Jason Wang wrote:
> > > >>
> > > >>On 2017年01月20日 21:08, Peter Xu wrote:
> > > >>>Before this one we only invalidate context cache when we receive context
> > > >>>entry invalidations. However it's possible that the invalidation also
> > > >>>contains a domain switch (only if cache-mode is enabled for vIOMMU). In
> > > >>>that case we need to notify all the registered components about the new
> > > >>>mapping.
> > > >>>
> > > >>>Signed-off-by: Peter Xu <peterx@redhat.com>
> > > >>>---
> > > >>>  hw/i386/intel_iommu.c | 10 ++++++++++
> > > >>>  1 file changed, 10 insertions(+)
> > > >>>
> > > >>>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > >>>index f9c5142..4b08b4d 100644
> > > >>>--- a/hw/i386/intel_iommu.c
> > > >>>+++ b/hw/i386/intel_iommu.c
> > > >>>@@ -1146,6 +1146,16 @@ static void
> > vtd_context_device_invalidate(IntelIOMMUState *s,
> > > >>>                  trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> > > >>>                                               VTD_PCI_FUNC(devfn_it));
> > > >>>                  vtd_as->context_cache_entry.context_cache_gen = 0;
> > > >>>+                /*
> > > >>>+                 * So a device is moving out of (or moving into) a
> > > >>>+                 * domain, a replay() suites here to notify all the
> > > >>>+                 * IOMMU_NOTIFIER_MAP registers about this change.
> > > >>>+                 * This won't bring bad even if we have no such
> > > >>>+                 * notifier registered - the IOMMU notification
> > > >>>+                 * framework will skip MAP notifications if that
> > > >>>+                 * happened.
> > > >>>+                 */
> > > >>>+                memory_region_iommu_replay_all(&vtd_as->iommu);
> > > >>DSI and GLOBAL questions come back again or no need to care about them :) ?
> > > >DSI/GLOBAL hanldings are in patch 20 (though it'll be squashed into 18
> > > >in my next post). Is that what you mean above?
> > >
> > > Seems not, I mean DSI/GLOBAL for context cache invalidation instead of IOTLB
> > > :)
> > 
> > Yes, I should possibly do the same thing for context cache global
> > invalidations. IIUC context global invalidation should be a superset
> > of iotlb invalidation, so maybe I'll add one more patch to call iotlb
> > invalidation in context invalidation as well. Kevin/others, please
> > correct me if I misunderstood the spec. Thanks,
> > 
> 
> context invalidation is not superset of iotlb invalidation. The spec just
> requires software to always follow a context-cache invalidation with
> a PASID-cache invalidation, followed by an IOTLB invalidation.

Thanks for pointing out. If so, looks like current version suffice for
this, right? (so no further change needed for this one)

-- peterx

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices
  2017-01-25  1:31                 ` Alex Williamson
@ 2017-01-25  7:41                   ` Jason Wang
  0 siblings, 0 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-25  7:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel, Peter Xu



On 2017年01月25日 09:31, Alex Williamson wrote:
> On Wed, 25 Jan 2017 09:19:25 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2017年01月24日 03:40, Alex Williamson wrote:
>>> On Mon, 23 Jan 2017 18:23:44 +0800
>>> Jason Wang<jasowang@redhat.com>  wrote:
>>>   
>>>> On 2017年01月23日 11:34, Peter Xu wrote:
>>>>> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
>>>>>> On 2017年01月22日 17:04, Peter Xu wrote:
>>>>>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>>>>>>>
>>>>>>> [...]
>>>>>>>      
>>>>>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>>>>>>>> +                                           uint16_t domain_id, hwaddr addr,
>>>>>>>>> +                                           uint8_t am)
>>>>>>>>> +{
>>>>>>>>> +    IntelIOMMUNotifierNode *node;
>>>>>>>>> +    VTDContextEntry ce;
>>>>>>>>> +    int ret;
>>>>>>>>> +
>>>>>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>>>>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>>>>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>>>>>>> +                                       vtd_as->devfn, &ce);
>>>>>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>>>>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>>>>>>>> +                          vtd_page_invalidate_notify_hook,
>>>>>>>>> +                          (void *)&vtd_as->iommu, true);
>>>>>>>> Why not simply trigger the notifier here? (or is this vfio required?)
>>>>>>> Because we may only want to notify part of the region - we are with
>>>>>>> mask here, but not exact size.
>>>>>>>
>>>>>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
>>>>>>> the mask will be extended to 16K in the guest. In that case, we need
>>>>>>> to explicitly go over the page entry to know that the 4th page should
>>>>>>> not be notified.
>>>>>> I see. Then it was required by vfio only, I think we can add a fast path for
>>>>>> !CM in this case by triggering the notifier directly.
>>>>> I noted this down (to be further investigated in my todo), but I don't
>>>>> know whether this can work, due to the fact that I think it is still
>>>>> legal that guest merge more than one PSIs into one. For example, I
>>>>> don't know whether below is legal:
>>>>>
>>>>> - guest invalidate page (0, 4k)
>>>>> - guest map new page (4k, 8k)
>>>>> - guest send single PSI of (0, 8k)
>>>>>
>>>>> In that case, it contains both map/unmap, and looks like it didn't
>>>>> disobay the spec as well?
>>>> Not sure I get your meaning, you mean just send single PSI instead of two?
>>>>   
>>>>>      
>>>>>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
>>>>>> last page has already been mapped. In this case, if we want to map first
>>>>>> three pages, when handling IOTLB invalidation, am would be 16K, then the
>>>>>> last page will be mapped twice. Can this lead some issue?
>>>>> I don't know whether guest has special handling of this kind of
>>>>> request.
>>>> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
>>>>
>>>> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>>>>                      struct dmar_domain *domain,
>>>>                      unsigned long pfn, unsigned int pages,
>>>>                      int ih, int map)
>>>> {
>>>>        unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>>>>        uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>>>>        u16 did = domain->iommu_did[iommu->seq_id];
>>>> ...
>>>>
>>>>   
>>>>> Besides, imho to completely solve this problem, we still need that
>>>>> per-domain tree. Considering that currently the tree is inside vfio, I
>>>>> see this not a big issue as well.
>>>> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems
>>>> become guest trigger-able. And since VFIO allocate its own structure to
>>>> record dma mapping, this seems open a window for evil guest to exhaust
>>>> host memory which is even worse.
>>> You're thinking of pci-assign, vfio does page accounting such that a
>>> user can only lock pages up to their locked memory limit.  Exposing the
>>> mapping ioctl within the guest is not a different problem from exposing
>>> the ioctl to the host user from a vfio perspective.  Thanks,
>>>
>>> Alex
>>>   
>> Yes, but what if an evil guest that maps all iovas to the same gpa?
> Doesn't matter, we'd account that gpa each time it's mapped, so
> effectively the locked memory limit is equal to the iova size the user
> can map.  Thanks,
>
> Alex

I see. Good to know this.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate
  2017-01-25  6:44             ` Peter Xu
@ 2017-01-25  7:45               ` Jason Wang
  0 siblings, 0 replies; 75+ messages in thread
From: Jason Wang @ 2017-01-25  7:45 UTC (permalink / raw)
  To: Peter Xu, Tian, Kevin
  Cc: Lan, Tianyu, mst, jan.kiszka, qemu-devel, alex.williamson, bd.aviv



On 2017年01月25日 14:44, Peter Xu wrote:
> On Wed, Jan 25, 2017 at 06:37:23AM +0000, Tian, Kevin wrote:
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Wednesday, January 25, 2017 11:46 AM
>>>
>>> On Wed, Jan 25, 2017 at 11:09:39AM +0800, Jason Wang wrote:
>>>>
>>>> On 2017年01月24日 12:52, Peter Xu wrote:
>>>>> On Mon, Jan 23, 2017 at 06:36:17PM +0800, Jason Wang wrote:
>>>>>> On 2017年01月20日 21:08, Peter Xu wrote:
>>>>>>> Before this one we only invalidate context cache when we receive context
>>>>>>> entry invalidations. However it's possible that the invalidation also
>>>>>>> contains a domain switch (only if cache-mode is enabled for vIOMMU). In
>>>>>>> that case we need to notify all the registered components about the new
>>>>>>> mapping.
>>>>>>>
>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>> ---
>>>>>>>   hw/i386/intel_iommu.c | 10 ++++++++++
>>>>>>>   1 file changed, 10 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>> index f9c5142..4b08b4d 100644
>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>> @@ -1146,6 +1146,16 @@ static void
>>> vtd_context_device_invalidate(IntelIOMMUState *s,
>>>>>>>                   trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
>>>>>>>                                                VTD_PCI_FUNC(devfn_it));
>>>>>>>                   vtd_as->context_cache_entry.context_cache_gen = 0;
>>>>>>> +                /*
>>>>>>> +                 * So a device is moving out of (or moving into) a
>>>>>>> +                 * domain, a replay() suites here to notify all the
>>>>>>> +                 * IOMMU_NOTIFIER_MAP registers about this change.
>>>>>>> +                 * This won't bring bad even if we have no such
>>>>>>> +                 * notifier registered - the IOMMU notification
>>>>>>> +                 * framework will skip MAP notifications if that
>>>>>>> +                 * happened.
>>>>>>> +                 */
>>>>>>> +                memory_region_iommu_replay_all(&vtd_as->iommu);
>>>>>> DSI and GLOBAL questions come back again or no need to care about them :) ?
>>>>> DSI/GLOBAL hanldings are in patch 20 (though it'll be squashed into 18
>>>>> in my next post). Is that what you mean above?
>>>> Seems not, I mean DSI/GLOBAL for context cache invalidation instead of IOTLB
>>>> :)
>>> Yes, I should possibly do the same thing for context cache global
>>> invalidations. IIUC context global invalidation should be a superset
>>> of iotlb invalidation, so maybe I'll add one more patch to call iotlb
>>> invalidation in context invalidation as well. Kevin/others, please
>>> correct me if I misunderstood the spec. Thanks,
>>>
>> context invalidation is not superset of iotlb invalidation. The spec just
>> requires software to always follow a context-cache invalidation with
>> a PASID-cache invalidation, followed by an IOTLB invalidation.
> Thanks for pointing out. If so, looks like current version suffice for
> this, right? (so no further change needed for this one)
>
> -- peterx
>

We could not depends on guest or driver behavior. I still prefer to add 
unamp for DSI/GLOBAL to prevent us from leaking mappings.

Thanks

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2017-01-25  7:45 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well Peter Xu
2017-01-23 18:20   ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr() Peter Xu
2017-01-23 18:49   ` Alex Williamson
2017-01-24  3:28     ` Peter Xu
2017-01-24  4:30       ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 03/20] vfio: allow to notify unmap for very large region Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 04/20] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
2017-01-22  2:51   ` [Qemu-devel] [PATCH RFC v4.1 04/20] intel_iommu: add "caching-mode" option Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 05/20] intel_iommu: simplify irq region translation Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 06/20] intel_iommu: renaming gpa to iova where proper Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 07/20] intel_iommu: fix trace for inv desc handling Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 08/20] intel_iommu: fix trace for addr translation Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 09/20] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier Peter Xu
2017-01-23 19:12   ` Alex Williamson
2017-01-24  7:48     ` Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 12/20] memory: provide iommu_replay_all() Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 13/20] memory: introduce memory_region_notify_one() Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 14/20] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback Peter Xu
2017-01-22  7:56   ` Jason Wang
2017-01-22  8:51     ` Peter Xu
2017-01-22  9:36       ` Peter Xu
2017-01-23  1:50         ` Jason Wang
2017-01-23  1:48       ` Jason Wang
2017-01-23  2:54         ` Peter Xu
2017-01-23  3:12           ` Jason Wang
2017-01-23  3:35             ` Peter Xu
2017-01-23 19:34           ` Alex Williamson
2017-01-24  4:04             ` Peter Xu
2017-01-23 19:33       ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate Peter Xu
2017-01-23 10:36   ` Jason Wang
2017-01-24  4:52     ` Peter Xu
2017-01-25  3:09       ` Jason Wang
2017-01-25  3:46         ` Peter Xu
2017-01-25  6:37           ` Tian, Kevin
2017-01-25  6:44             ` Peter Xu
2017-01-25  7:45               ` Jason Wang
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 17/20] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Peter Xu
2017-01-22  8:08   ` Jason Wang
2017-01-22  9:04     ` Peter Xu
2017-01-23  1:55       ` Jason Wang
2017-01-23  3:34         ` Peter Xu
2017-01-23 10:23           ` Jason Wang
2017-01-23 19:40             ` Alex Williamson
2017-01-25  1:19               ` Jason Wang
2017-01-25  1:31                 ` Alex Williamson
2017-01-25  7:41                   ` Jason Wang
2017-01-24  4:42             ` Peter Xu
2017-01-23 18:03           ` Alex Williamson
2017-01-24  7:22             ` Peter Xu
2017-01-24 16:24               ` Alex Williamson
2017-01-25  4:04                 ` Peter Xu
2017-01-23  2:01   ` Jason Wang
2017-01-23  2:17     ` Jason Wang
2017-01-23  3:40     ` Peter Xu
2017-01-23 10:27       ` Jason Wang
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay Peter Xu
2017-01-22  8:13   ` Jason Wang
2017-01-22  9:09     ` Peter Xu
2017-01-23  1:57       ` Jason Wang
2017-01-23  7:30         ` Peter Xu
2017-01-23 10:29           ` Jason Wang
2017-01-23 10:40   ` Jason Wang
2017-01-24  7:31     ` Peter Xu
2017-01-25  3:11       ` Jason Wang
2017-01-25  4:15         ` Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 20/20] intel_iommu: replay even with DSI/GLOBAL inv desc Peter Xu
2017-01-23 15:55 ` [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
2017-01-24  7:40   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.