All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances
@ 2017-02-07  8:28 Peter Xu
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 01/17] vfio: trace map/unmap for notify as well Peter Xu
                   ` (18 more replies)
  0 siblings, 19 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

This is v7 of vt-d vfio enablement series.

v7:
- for the two traces patches: Change subjects. Remove vtd_err() and
  vtd_err_nonzero_rsvd() tracers, instead using standalone trace for
  each of the places. Don't remove any DPRINTF() if there is no
  replacement. [Jason]
- add r-b and a-b for Alex/David/Jason.
- in patch "intel_iommu: renaming gpa to iova where proper", convert
  one more place where I missed [Jason]
- fix the place where I should use "~0ULL" not "~0" [Jason]
- squash patch 16 into 18 [Jason]

v6:
- do unmap in all cases when replay [Jason]
- do global replay even if context entry is invalidated [Jason]
- when iommu reset, send unmap to all registered notifiers [Jason]
- use rcu read lock to protect the whole vfio_iommu_map_notify()
  [Alex, Paolo]

v5:
- fix patch 4 subject too long, and error spelling [Eric]
- add ack-by for alex in patch 1 [Alex]
- squashing patch 19/20 into patch 18 [Jason]
- fix comments in vtd_page_walk() [Jason]
- remove all error_report() [Jason]
- add comment for patch 18, mention about that enabled vhost without
  ATS as well [Jason]
- remove skipped debug thing during page walk [Jason]
- remove duplicated page walk trace [Jason]
- some tunings in vtd_address_space_unmap(), to provide correct iova
  and addr_mask. For this, I tuned this patch as well a bit:
  "memory: add section range info for IOMMU notifier"
  to loosen the range check

v4:
- convert all error_report()s into traces (in the two patches that did
  that)
- rebased to Jason's DMAR series (master + one more patch:
  "[PATCH V4 net-next] vhost_net: device IOTLB support")
- let vhost use the new api iommu_notifier_init() so it won't break
  vhost dmar [Jason]
- touch commit message of the patch:
  "intel_iommu: provide its own replay() callback"
  old replay is not a dead loop, but it will just consume lots of time
  [Jason]
- add comment for patch:
  "intel_iommu: do replay when context invalidate"
  telling why replay won't be a problem even without CM=1 [Jason]
- remove a useless comment line [Jason]
- remove dmar_enabled parameter for vtd_switch_address_space() and
  vtd_switch_address_space_all() [Mst, Jason]
- merged the vfio patches in, to support unmap of big ranges at the
  beginning ("[PATCH RFC 0/3] vfio: allow to notify unmap for very big
  region")
- using caching_mode instead of cache_mode_enabled, and "caching-mode"
  instead of "cache-mode" [Kevin]
- when receive context entry invalidation, we unmap the entire region
  first, then replay [Alex]
- fix commit message for patch:
  "intel_iommu: simplify irq region translation" [Kevin]
- handle domain/global invalidation, and notify where proper [Jason,
  Kevin]

v3:
- fix style error reported by patchew
- fix comment in domain switch patch: use "IOMMU address space" rather
  than "IOMMU region" [Kevin]
- add ack-by for Paolo in patch:
  "memory: add section range info for IOMMU notifier"
  (this is seperately collected besides this thread)
- remove 3 patches which are merged already (from Jason)
- rebase to master b6c0897

v2:
- change comment for "end" parameter in vtd_page_walk() [Tianyu]
- change comment for "a iova" to "an iova" [Yi]
- fix fault printed val for GPA address in vtd_page_walk_level (debug
  only)
- rebased to master (rather than Aviv's v6 series) and merged Aviv's
  series v6: picked patch 1 (as patch 1 in this series), dropped patch
  2, re-wrote patch 3 (as patch 17 of this series).
- picked up two more bugfix patches from Jason's DMAR series
- picked up the following patch as well:
  "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"

This RFC series is a re-work for Aviv B.D.'s vfio enablement series
with vt-d:

  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html

Aviv has done a great job there, and what we still lack there are
mostly the following:

(1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
    memory region.

(2) VT-d still haven't provide a correct replay() mechanism (e.g.,
    when IOMMU domain switches, things will broke).

This series should have solved the above two issues.

Online repo:

  https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v7

I would be glad to hear about any review comments for above patches.

=========
Test Done
=========

Build test passed for x86_64/arm/ppc64.

Simply tested with x86_64, assigning two PCI devices to a single VM,
boot the VM using:

bin=x86_64-softmmu/qemu-system-x86_64
$bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
     -device intel-iommu,intremap=on,eim=off,caching-mode=on \
     -netdev user,id=net0,hostfwd=tcp::5555-:22 \
     -device virtio-net-pci,netdev=net0 \
     -device vfio-pci,host=03:00.0 \
     -device vfio-pci,host=02:00.0 \
     -trace events=".trace.vfio" \
     /var/lib/libvirt/images/vm1.qcow2

pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
vtd_page_walk*
vtd_replay*
vtd_inv_desc*

Then, in the guest, run the following tool:

  https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c

With parameter:

  ./vfio-bind-group 00:03.0 00:04.0

Check host side trace log, I can see pages are replayed and mapped in
00:04.0 device address space, like:

...
vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
...

=========
Todo List
=========

- error reporting for the assigned devices (as Tianyu has mentioned)

- per-domain address-space: A better solution in the future may be -
  we maintain one address space per IOMMU domain in the guest (so
  multiple devices can share a same address space if they are sharing
  the same IOMMU domains in the guest), rather than one address space
  per device (which is current implementation of vt-d). However that's
  a step further than this series, and let's see whether we can first
  provide a workable version of device assignment with vt-d
  protection.

- don't need to notify IOTLB (psi/gsi/global) invalidations to devices
  that with ATS enabled

- investigate when guest map page while mask contains existing mapped
  pages (e.g. map 12k-16k first, then map 0-12k)

- coalesce unmap during page walk (currently, we send it once per
  page)

- when do PSI for unmap, whether we can send one notify directly
  instead of walking over the page table?

- more to come...

Thanks,

Aviv Ben-David (1):
  intel_iommu: add "caching-mode" option

Peter Xu (16):
  vfio: trace map/unmap for notify as well
  vfio: introduce vfio_get_vaddr()
  vfio: allow to notify unmap for very large region
  intel_iommu: simplify irq region translation
  intel_iommu: renaming gpa to iova where proper
  intel_iommu: convert dbg macros to traces for inv
  intel_iommu: convert dbg macros to trace for trans
  intel_iommu: vtd_slpt_level_shift check level
  memory: add section range info for IOMMU notifier
  memory: provide IOMMU_NOTIFIER_FOREACH macro
  memory: provide iommu_replay_all()
  memory: introduce memory_region_notify_one()
  memory: add MemoryRegionIOMMUOps.replay() callback
  intel_iommu: provide its own replay() callback
  intel_iommu: allow dynamic switch of IOMMU region
  intel_iommu: enable vfio devices

 hw/i386/intel_iommu.c          | 669 +++++++++++++++++++++++++++++++----------
 hw/i386/intel_iommu_internal.h |   2 +
 hw/i386/trace-events           |  36 +++
 hw/vfio/common.c               |  77 +++--
 hw/vfio/trace-events           |   2 +-
 hw/virtio/vhost.c              |   4 +-
 include/exec/memory.h          |  49 ++-
 include/hw/i386/intel_iommu.h  |  12 +
 memory.c                       |  52 +++-
 9 files changed, 710 insertions(+), 193 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 01/17] vfio: trace map/unmap for notify as well
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 02/17] vfio: introduce vfio_get_vaddr() Peter Xu
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

We traces its range, but we don't know whether it's a MAP/UNMAP. Let's
dump it as well.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c     | 3 ++-
 hw/vfio/trace-events | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 801578b..174f351 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -305,7 +305,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     void *vaddr;
     int ret;
 
-    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
+    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
+                                iova, iova + iotlb->addr_mask);
 
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8de8281..2561c6d 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -84,7 +84,7 @@ vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
 # hw/vfio/common.c
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
-vfio_iommu_map_notify(uint64_t iova_start, uint64_t iova_end) "iommu map @ %"PRIx64" - %"PRIx64
+vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "iommu %s @ %"PRIx64" - %"PRIx64
 vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64
 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] %"PRIx64" - %"PRIx64
 vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] %"PRIx64" - %"PRIx64" [%p]"
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 02/17] vfio: introduce vfio_get_vaddr()
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 01/17] vfio: trace map/unmap for notify as well Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  1:12   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 03/17] vfio: allow to notify unmap for very large region Peter Xu
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

A cleanup for vfio_iommu_map_notify(). Now we will fetch vaddr even if
the operation is unmap, but it won't hurt much.

One thing to mention is that we need the RCU read lock to protect the
whole translation and map/unmap procedure.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c | 65 +++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 45 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 174f351..42c4790 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -294,54 +294,79 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
-static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+/* Called with rcu_read_lock held.  */
+static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
+                           bool *read_only)
 {
-    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
-    VFIOContainer *container = giommu->container;
-    hwaddr iova = iotlb->iova + giommu->iommu_offset;
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
-    void *vaddr;
-    int ret;
-
-    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
-                                iova, iova + iotlb->addr_mask);
-
-    if (iotlb->target_as != &address_space_memory) {
-        error_report("Wrong target AS \"%s\", only system memory is allowed",
-                     iotlb->target_as->name ? iotlb->target_as->name : "none");
-        return;
-    }
+    bool writable = iotlb->perm & IOMMU_WO;
 
     /*
      * The IOMMU TLB entry we have just covers translation through
      * this IOMMU to its immediate target.  We need to translate
      * it the rest of the way through to memory.
      */
-    rcu_read_lock();
     mr = address_space_translate(&address_space_memory,
                                  iotlb->translated_addr,
-                                 &xlat, &len, iotlb->perm & IOMMU_WO);
+                                 &xlat, &len, writable);
     if (!memory_region_is_ram(mr)) {
         error_report("iommu map to non memory area %"HWADDR_PRIx"",
                      xlat);
-        goto out;
+        return false;
     }
+
     /*
      * Translation truncates length to the IOMMU page size,
      * check that it did not truncate too much.
      */
     if (len & iotlb->addr_mask) {
         error_report("iommu has granularity incompatible with target AS");
+        return false;
+    }
+
+    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    *read_only = !writable || mr->readonly;
+
+    return true;
+}
+
+static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    VFIOContainer *container = giommu->container;
+    hwaddr iova = iotlb->iova + giommu->iommu_offset;
+    bool read_only;
+    void *vaddr;
+    int ret;
+
+    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
+                                iova, iova + iotlb->addr_mask);
+
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+        return;
+    }
+
+    rcu_read_lock();
+
+    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
         goto out;
     }
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        vaddr = memory_region_get_ram_ptr(mr) + xlat;
+        /*
+         * vaddr is only valid until rcu_read_unlock(). But after
+         * vfio_dma_map has set up the mapping the pages will be
+         * pinned by the kernel. This makes sure that the RAM backend
+         * of vaddr will always be there, even if the memory object is
+         * destroyed and its backing memory munmap-ed.
+         */
         ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
-                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
+                           read_only);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 03/17] vfio: allow to notify unmap for very large region
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 01/17] vfio: trace map/unmap for notify as well Peter Xu
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 02/17] vfio: introduce vfio_get_vaddr() Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  1:13   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 04/17] intel_iommu: add "caching-mode" option Peter Xu
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big
region. This can be leveraged by QEMU IOMMU implementation to cleanup
existing page mappings for an entire iova address space (by notifying
with an IOTLB with extremely huge addr_mask). However current
vfio_iommu_map_notify() does not allow that. It make sure that all the
translated address in IOTLB is falling into RAM range.

The check makes sense, but it should only be a sensible checker for
mapping operations, and mean little for unmap operations.

This patch moves this check into map logic only, so that we'll get
faster unmap handling (no need to translate again), and also we can then
better support unmapping a very big region when it covers non-ram ranges
or even not-existing ranges.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 42c4790..f3ba9b9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -352,11 +352,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 
     rcu_read_lock();
 
-    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
-        goto out;
-    }
-
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
+        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+            goto out;
+        }
         /*
          * vaddr is only valid until rcu_read_unlock(). But after
          * vfio_dma_map has set up the mapping the pages will be
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 04/17] intel_iommu: add "caching-mode" option
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (2 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 03/17] vfio: allow to notify unmap for very large region Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  1:14   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 05/17] intel_iommu: simplify irq region translation Peter Xu
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

From: Aviv Ben-David <bd.aviv@gmail.com>

This capability asks the guest to invalidate cache before each map operation.
We can use this invalidation to trap map operations in the hypervisor.

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
[peterx: using "caching-mode" instead of "cache-mode" to align with spec]
[peterx: re-write the subject to make it short and clear]
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 5 +++++
 hw/i386/intel_iommu_internal.h | 1 +
 include/hw/i386/intel_iommu.h  | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 3270fb9..50251c3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2115,6 +2115,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
                             ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
+    DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -2496,6 +2497,10 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_DT;
     }
 
+    if (s->caching_mode) {
+        s->cap |= VTD_CAP_CM;
+    }
+
     vtd_reset_context_cache(s);
     vtd_reset_iotlb(s);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 356f188..4104121 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -202,6 +202,7 @@
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
 #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_CM                  (1ULL << 7)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 405c9d1..fe645aa 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -257,6 +257,8 @@ struct IntelIOMMUState {
     uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
     uint32_t version;
 
+    bool caching_mode;          /* RO - is cap CM enabled? */
+
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
     bool dmar_enabled;              /* Set if DMA remapping is enabled */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 05/17] intel_iommu: simplify irq region translation
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (3 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 04/17] intel_iommu: add "caching-mode" option Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  1:15   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 06/17] intel_iommu: renaming gpa to iova where proper Peter Xu
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

Now we have a standalone memory region for MSI, all the irq region
requests should be redirected there. Cleaning up the block with an
assertion instead.

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 50251c3..86d19bb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -818,28 +818,12 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     bool writes = true;
     VTDIOTLBEntry *iotlb_entry;
 
-    /* Check if the request is in interrupt address range */
-    if (vtd_is_interrupt_addr(addr)) {
-        if (is_write) {
-            /* FIXME: since we don't know the length of the access here, we
-             * treat Non-DWORD length write requests without PASID as
-             * interrupt requests, too. Withoud interrupt remapping support,
-             * we just use 1:1 mapping.
-             */
-            VTD_DPRINTF(MMU, "write request to interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            entry->iova = addr & VTD_PAGE_MASK_4K;
-            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
-            entry->addr_mask = ~VTD_PAGE_MASK_4K;
-            entry->perm = IOMMU_WO;
-            return;
-        } else {
-            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
-            return;
-        }
-    }
+    /*
+     * We have standalone memory region for interrupt addresses, we
+     * should never receive translation requests in this region.
+     */
+    assert(!vtd_is_interrupt_addr(addr));
+
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 06/17] intel_iommu: renaming gpa to iova where proper
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (4 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 05/17] intel_iommu: simplify irq region translation Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  1:17   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv Peter Xu
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

There are lots of places in current intel_iommu.c codes that named
"iova" as "gpa". It is really confusing to use a name "gpa" in these
places (which is very easily to be understood as "Guest Physical
Address", while it's not). To make the codes (much) easier to be read, I
decided to do this once and for all.

No functional change is made. Only literal ones.

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 86d19bb..0c94b79 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -259,7 +259,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                 " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
                 domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
@@ -575,12 +575,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
     return slpte;
 }
 
-/* Given a gpa and the level of paging structure, return the offset of current
- * level.
+/* Given an iova and the level of paging structure, return the offset
+ * of current level.
  */
-static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
+static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
 {
-    return (gpa >> vtd_slpt_level_shift(level)) &
+    return (iova >> vtd_slpt_level_shift(level)) &
             ((1ULL << VTD_SL_LEVEL_BITS) - 1);
 }
 
@@ -628,12 +628,12 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     }
 }
 
-/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
+/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
-                            uint64_t *slptep, uint32_t *slpte_level,
-                            bool *reads, bool *writes)
+static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
+                             uint64_t *slptep, uint32_t *slpte_level,
+                             bool *reads, bool *writes)
 {
     dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
     uint32_t level = vtd_get_level_from_context_entry(ce);
@@ -642,11 +642,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
     uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check;
 
-    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
-     * and AW in context-entry.
+    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
      */
-    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
-        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
+    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
@@ -654,13 +654,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
     access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
 
     while (true) {
-        offset = vtd_gpa_level_offset(gpa, level);
+        offset = vtd_iova_level_offset(iova, level);
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
             VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
-                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
-                        level, gpa);
+                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
+                        level, iova);
             if (level == vtd_get_level_from_context_entry(ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
@@ -672,8 +672,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
         *writes = (*writes) && (slpte & VTD_SL_W);
         if (!(slpte & access_right_check)) {
             VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
-                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (is_write ? "write" : "read"), gpa, slpte);
+                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
+                        (is_write ? "write" : "read"), iova, slpte);
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
@@ -827,7 +827,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                     " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
                     iotlb_entry->slpte, iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
@@ -867,8 +867,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
-    ret_fr = vtd_gpa_to_slpte(&ce, addr, is_write, &slpte, &level,
-                              &reads, &writes);
+    ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
+                               &reads, &writes);
     if (ret_fr) {
         ret_fr = -ret_fr;
         if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
@@ -2033,7 +2033,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
                            is_write, &ret);
     VTD_DPRINTF(MMU,
                 "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
-                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
+                " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
                 VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
                 vtd_as->devfn, addr, ret.translated_addr);
     return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (5 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 06/17] intel_iommu: renaming gpa to iova where proper Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-08  2:47   ` Jason Wang
  2017-02-10  1:19   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans Peter Xu
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
good, and we should end the day when we need to recompile the code
before getting useful debugging information for vt-d. Time to switch to
the trace system. This is the first patch to do it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 95 +++++++++++++++++++++------------------------------
 hw/i386/trace-events  | 18 ++++++++++
 2 files changed, 56 insertions(+), 57 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0c94b79..08e43b6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -35,6 +35,7 @@
 #include "sysemu/kvm.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm_i386.h"
+#include "trace.h"
 
 /*#define DEBUG_INTEL_IOMMU*/
 #ifdef DEBUG_INTEL_IOMMU
@@ -474,22 +475,19 @@ static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
 /* Set the IWC field and try to generate an invalidation completion interrupt */
 static void vtd_generate_completion_event(IntelIOMMUState *s)
 {
-    VTD_DPRINTF(INV, "completes an invalidation wait command with "
-                "Interrupt Flag");
     if (vtd_get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
-        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
-                    "serviced by software, "
-                    "new invalidation event is not generated");
+        trace_vtd_inv_desc_wait_irq("One pending, skip current");
         return;
     }
     vtd_set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
     vtd_set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
     if (vtd_get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
-        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
-                    "event is not generated");
+        trace_vtd_inv_desc_wait_irq("IM in IECTL_REG is set, "
+                                    "new event not generated");
         return;
     } else {
         /* Generate the interrupt event */
+        trace_vtd_inv_desc_wait_irq("Generating complete event");
         vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
         vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
     }
@@ -923,6 +921,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_inv_desc_cc_global();
     s->context_cache_gen++;
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
@@ -962,9 +961,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
     uint16_t mask;
     VTDBus *vtd_bus;
     VTDAddressSpace *vtd_as;
-    uint16_t devfn;
+    uint8_t bus_n, devfn;
     uint16_t devfn_it;
 
+    trace_vtd_inv_desc_cc_devices(source_id, func_mask);
+
     switch (func_mask & 3) {
     case 0:
         mask = 0;   /* No bits in the SID field masked */
@@ -980,16 +981,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
         break;
     }
     mask = ~mask;
-    VTD_DPRINTF(INV, "device-selective invalidation source 0x%"PRIx16
-                    " mask %"PRIu16, source_id, mask);
-    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
+
+    bus_n = VTD_SID_TO_BUS(source_id);
+    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
     if (vtd_bus) {
         devfn = VTD_SID_TO_DEVFN(source_id);
         for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
             vtd_as = vtd_bus->dev_as[devfn_it];
             if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
-                VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
-                            devfn_it);
+                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
+                                             VTD_PCI_FUNC(devfn_it));
                 vtd_as->context_cache_entry.context_cache_gen = 0;
             }
         }
@@ -1302,9 +1303,7 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 {
     if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
         (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Invalidation "
-                    "Wait Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_inv_desc_wait_invalid(inv_desc->hi, inv_desc->lo);
         return false;
     }
     if (inv_desc->lo & VTD_INV_DESC_WAIT_SW) {
@@ -1316,21 +1315,18 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
         /* FIXME: need to be masked with HAW? */
         dma_addr_t status_addr = inv_desc->hi;
-        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
-                    status_data, status_addr);
+        trace_vtd_inv_desc_wait_sw(status_addr, status_data);
         status_data = cpu_to_le32(status_data);
         if (dma_memory_write(&address_space_memory, status_addr, &status_data,
                              sizeof(status_data))) {
-            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
+            trace_vtd_inv_desc_wait_write_fail(inv_desc->hi, inv_desc->lo);
             return false;
         }
     } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
         /* Interrupt flag */
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
         vtd_generate_completion_event(s);
     } else {
-        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
+        trace_vtd_inv_desc_wait_invalid(inv_desc->hi, inv_desc->lo);
         return false;
     }
     return true;
@@ -1339,30 +1335,29 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
                                            VTDInvDesc *inv_desc)
 {
+    uint16_t sid, fmask;
+
     if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Context-cache "
-                    "Invalidate Descriptor");
+        trace_vtd_inv_desc_cc_invalid(inv_desc->hi, inv_desc->lo);
         return false;
     }
     switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
     case VTD_INV_DESC_CC_DOMAIN:
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
+        trace_vtd_inv_desc_cc_domain(
+            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
         /* Fall through */
     case VTD_INV_DESC_CC_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
         vtd_context_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_CC_DEVICE:
-        vtd_context_device_invalidate(s, VTD_INV_DESC_CC_SID(inv_desc->lo),
-                                      VTD_INV_DESC_CC_FM(inv_desc->lo));
+        sid = VTD_INV_DESC_CC_SID(inv_desc->lo);
+        fmask = VTD_INV_DESC_CC_FM(inv_desc->lo);
+        vtd_context_device_invalidate(s, sid, fmask);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in Context-cache "
-                    "Invalidate Descriptor hi 0x%"PRIx64  " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_inv_desc_cc_invalid(inv_desc->hi, inv_desc->lo);
         return false;
     }
     return true;
@@ -1376,22 +1371,19 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
     if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
         (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in IOTLB "
-                    "Invalidate Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
         return false;
     }
 
     switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
     case VTD_INV_DESC_IOTLB_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
+        trace_vtd_inv_desc_iotlb_global();
         vtd_iotlb_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_IOTLB_DOMAIN:
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    domain_id);
+        trace_vtd_inv_desc_iotlb_domain(domain_id);
         vtd_iotlb_domain_invalidate(s, domain_id);
         break;
 
@@ -1399,20 +1391,16 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
         addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
         am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
-        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
-                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
+        trace_vtd_inv_desc_iotlb_pages(domain_id, addr, am);
         if (am > VTD_MAMV) {
-            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
-                        "%"PRIu8, (uint8_t)VTD_MAMV);
+            trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
             return false;
         }
         vtd_iotlb_page_invalidate(s, domain_id, addr, am);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in IOTLB Invalidate "
-                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
         return false;
     }
     return true;
@@ -1511,33 +1499,28 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 
     switch (desc_type) {
     case VTD_INV_DESC_CC:
-        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("context-cache", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_context_cache_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IOTLB:
-        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iotlb", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_iotlb_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_WAIT:
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_wait_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IEC:
-        VTD_DPRINTF(INV, "Invalidation Interrupt Entry Cache "
-                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iec", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
             return false;
         }
@@ -1552,9 +1535,7 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
-                    inv_desc.hi, inv_desc.lo, desc_type);
+        trace_vtd_inv_desc_invalid(inv_desc.hi, inv_desc.lo);
         return false;
     }
     s->iq_head++;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 1cc4a10..02aeaab 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -3,6 +3,24 @@
 # hw/i386/x86-iommu.c
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
+# hw/i386/intel_iommu.c
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
+vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
+vtd_inv_desc_invalid(uint64_t hi, uint64_t lo) "invalid inv desc hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
+vtd_inv_desc_cc_global(void) "context invalidate globally"
+vtd_inv_desc_cc_device(uint8_t bus, uint8_t dev, uint8_t fn) "context invalidate device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate devices sid 0x%"PRIx16" fmask 0x%"PRIx16
+vtd_inv_desc_cc_invalid(uint64_t hi, uint64_t lo) "invalid context-cache desc hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
+vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
+vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
+vtd_inv_desc_iotlb_invalid(uint64_t hi, uint64_t lo) "invalid iotlb desc hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
+vtd_inv_desc_wait_irq(const char *msg) "%s"
+vtd_inv_desc_wait_invalid(uint64_t hi, uint64_t lo) "invalid wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
+
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
 amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (6 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-08  2:49   ` Jason Wang
  2017-02-10  1:20   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 09/17] intel_iommu: vtd_slpt_level_shift check level Peter Xu
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

Another patch to convert the DPRINTF() stuffs. This patch focuses on the
address translation path and caching.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 69 ++++++++++++++++++---------------------------------
 hw/i386/trace-events  | 10 ++++++++
 2 files changed, 34 insertions(+), 45 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 08e43b6..ad304f6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -260,11 +260,9 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
-                domain_id);
+    trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
-        VTD_DPRINTF(CACHE, "iotlb exceeds size limit, forced to reset");
+        trace_vtd_iotlb_reset("iotlb exceeds size limit");
         vtd_reset_iotlb(s);
     }
 
@@ -505,8 +503,7 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
 
     addr = s->root + index * sizeof(*re);
     if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
-                    " + %"PRIu8, s->root, index);
+        trace_vtd_re_invalid(re->rsvd, re->val);
         re->val = 0;
         return -VTD_FR_ROOT_TABLE_INV;
     }
@@ -524,15 +521,10 @@ static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
 {
     dma_addr_t addr;
 
-    if (!vtd_root_entry_present(root)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
-        return -VTD_FR_ROOT_ENTRY_P;
-    }
+    /* we have checked that root entry is present */
     addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
     if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
-                    " + %"PRIu8,
-                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
+        trace_vtd_re_invalid(root->rsvd, root->val);
         return -VTD_FR_CONTEXT_TABLE_INV;
     }
     ce->lo = le64_to_cpu(ce->lo);
@@ -704,12 +696,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_root_entry_present(&re)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry #%"PRIu8 " is not present",
-                    bus_num);
+        /* Not error - it's okay we don't have root entry. */
+        trace_vtd_re_not_present(bus_num);
         return -VTD_FR_ROOT_ENTRY_P;
     } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
+        trace_vtd_re_invalid(re.rsvd, re.val);
         return -VTD_FR_ROOT_ENTRY_RSVD;
     }
 
@@ -719,22 +710,17 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_context_entry_present(ce)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
-                    "is not present", devfn, bus_num);
+        /* Not error - it's okay we don't have context entry. */
+        trace_vtd_ce_not_present(bus_num, devfn);
         return -VTD_FR_CONTEXT_ENTRY_P;
     } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
                (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: non-zero reserved field in context-entry "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
+        trace_vtd_ce_invalid(ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_RSVD;
     }
     /* Check if the programming of context-entry is valid */
     if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
-        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
-                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    ce->hi, ce->lo);
+        trace_vtd_ce_invalid(ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_INV;
     } else {
         switch (ce->lo & VTD_CONTEXT_ENTRY_TT) {
@@ -743,9 +729,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         case VTD_CONTEXT_TT_DEV_IOTLB:
             break;
         default:
-            VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
-                        "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                        ce->hi, ce->lo);
+            trace_vtd_ce_invalid(ce->hi, ce->lo);
             return -VTD_FR_CONTEXT_ENTRY_INV;
         }
     }
@@ -825,9 +809,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                    " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
-                    iotlb_entry->slpte, iotlb_entry->domain_id);
+        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+                                 iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
         reads = iotlb_entry->read_flags;
         writes = iotlb_entry->write_flags;
@@ -836,10 +819,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     }
     /* Try to fetch context-entry from cache first */
     if (cc_entry->context_cache_gen == s->context_cache_gen) {
-        VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 ")",
-                    bus_num, devfn, cc_entry->context_entry.hi,
-                    cc_entry->context_entry.lo, cc_entry->context_cache_gen);
+        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
+                               cc_entry->context_entry.lo,
+                               cc_entry->context_cache_gen);
         ce = cc_entry->context_entry;
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
     } else {
@@ -848,19 +830,16 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         if (ret_fr) {
             ret_fr = -ret_fr;
             if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
-                            "requests through this context-entry "
-                            "(with FPD Set)");
+                trace_vtd_fault_disabled();
             } else {
                 vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
             }
             return;
         }
         /* Update context-cache */
-        VTD_DPRINTF(CACHE, "update context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 "->%"PRIu32 ")",
-                    bus_num, devfn, ce.hi, ce.lo,
-                    cc_entry->context_cache_gen, s->context_cache_gen);
+        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
+                                  cc_entry->context_cache_gen,
+                                  s->context_cache_gen);
         cc_entry->context_entry = ce;
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
@@ -870,8 +849,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     if (ret_fr) {
         ret_fr = -ret_fr;
         if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
-                        "through this context-entry (with FPD Set)");
+            trace_vtd_fault_disabled();
         } else {
             vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
         }
@@ -1031,6 +1009,7 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
 
 static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_iotlb_reset("global invalidation recved");
     vtd_reset_iotlb(s);
 }
 
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 02aeaab..88ad5e4 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -20,6 +20,16 @@ vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write
 vtd_inv_desc_wait_irq(const char *msg) "%s"
 vtd_inv_desc_wait_invalid(uint64_t hi, uint64_t lo) "invalid wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
 vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
+vtd_re_invalid(uint64_t hi, uint64_t lo) "invalid root entry hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
+vtd_ce_invalid(uint64_t hi, uint64_t lo) "invalid context entry hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
+vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
+vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
+vtd_fault_disabled(void) "Fault processing disabled for context entry"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 09/17] intel_iommu: vtd_slpt_level_shift check level
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (7 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  1:20   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 10/17] memory: add section range info for IOMMU notifier Peter Xu
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

This helps in debugging incorrect level passed in.

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ad304f6..22d8226 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -168,6 +168,7 @@ static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
 /* The shift of an addr for a certain level of paging structure */
 static inline uint32_t vtd_slpt_level_shift(uint32_t level)
 {
+    assert(level != 0);
     return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 10/17] memory: add section range info for IOMMU notifier
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (8 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 09/17] intel_iommu: vtd_slpt_level_shift check level Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  2:29   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 11/17] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

In this patch, IOMMUNotifier.{start|end} are introduced to store section
information for a specific notifier. When notification occurs, we not
only check the notification type (MAP|UNMAP), but also check whether the
notified iova range overlaps with the range of specific IOMMU notifier,
and skip those notifiers if not in the listened range.

When removing an region, we need to make sure we removed the correct
VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c      | 12 +++++++++---
 hw/virtio/vhost.c     |  4 ++--
 include/exec/memory.h | 19 ++++++++++++++++++-
 memory.c              |  9 +++++++++
 4 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f3ba9b9..6b33b9f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
         giommu->iommu_offset = section->offset_within_address_space -
                                section->offset_within_region;
         giommu->container = container;
-        giommu->n.notify = vfio_iommu_map_notify;
-        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
+        llend = int128_add(int128_make64(section->offset_within_region),
+                           section->size);
+        llend = int128_sub(llend, int128_one());
+        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
+                            IOMMU_NOTIFIER_ALL,
+                            section->offset_within_region,
+                            int128_get64(llend));
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
@@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (giommu->iommu == section->mr) {
+            if (giommu->iommu == section->mr &&
+                giommu->n.start == section->offset_within_region) {
                 memory_region_unregister_iommu_notifier(giommu->iommu,
                                                         &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index febe519..ccf8b2e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1244,8 +1244,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         .priority = 10
     };
 
-    hdev->n.notify = vhost_iommu_unmap_notify;
-    hdev->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
+    iommu_notifier_init(&hdev->n, vhost_iommu_unmap_notify,
+                        IOMMU_NOTIFIER_UNMAP, 0, ~0ULL);
 
     if (hdev->migration_blocker == NULL) {
         if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 987f925..805a88a 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -77,13 +77,30 @@ typedef enum {
 
 #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
 
+struct IOMMUNotifier;
+typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
+                            IOMMUTLBEntry *data);
+
 struct IOMMUNotifier {
-    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
+    IOMMUNotify notify;
     IOMMUNotifierFlag notifier_flags;
+    /* Notify for address space range start <= addr <= end */
+    hwaddr start;
+    hwaddr end;
     QLIST_ENTRY(IOMMUNotifier) node;
 };
 typedef struct IOMMUNotifier IOMMUNotifier;
 
+static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
+                                       IOMMUNotifierFlag flags,
+                                       hwaddr start, hwaddr end)
+{
+    n->notify = fn;
+    n->notifier_flags = flags;
+    n->start = start;
+    n->end = end;
+}
+
 /* New-style MMIO accessors can indicate that the transaction failed.
  * A zero (MEMTX_OK) response means success; anything else is a failure
  * of some kind. The memory subsystem will bitwise-OR together results
diff --git a/memory.c b/memory.c
index 6c58373..4900bbf 100644
--- a/memory.c
+++ b/memory.c
@@ -1610,6 +1610,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
 
     /* We need to register for at least one bitfield */
     assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
+    assert(n->start <= n->end);
     QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
     memory_region_update_iommu_notify_flags(mr);
 }
@@ -1671,6 +1672,14 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     }
 
     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+        /*
+         * Skip the notification if the notification does not overlap
+         * with registered range.
+         */
+        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
+            iommu_notifier->end < entry.iova) {
+            continue;
+        }
         if (iommu_notifier->notifier_flags & request_flags) {
             iommu_notifier->notify(iommu_notifier, &entry);
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 11/17] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (9 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 10/17] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  2:30   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 12/17] memory: provide iommu_replay_all() Peter Xu
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 3 +++
 memory.c              | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 805a88a..f76e174 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -239,6 +239,9 @@ struct MemoryRegion {
     IOMMUNotifierFlag iommu_notify_flags;
 };
 
+#define IOMMU_NOTIFIER_FOREACH(n, mr) \
+    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
+
 /**
  * MemoryListener: callbacks structure for updates to the physical memory map
  *
diff --git a/memory.c b/memory.c
index 4900bbf..523c43f 100644
--- a/memory.c
+++ b/memory.c
@@ -1587,7 +1587,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
     IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
     IOMMUNotifier *iommu_notifier;
 
-    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
         flags |= iommu_notifier->notifier_flags;
     }
 
@@ -1671,7 +1671,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
-    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
+    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
         /*
          * Skip the notification if the notification does not overlap
          * with registered range.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 12/17] memory: provide iommu_replay_all()
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (10 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 11/17] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  2:31   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 13/17] memory: introduce memory_region_notify_one() Peter Xu
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

This is an "global" version of exising memory_region_iommu_replay() - we
announce the translations to all the registered notifiers, instead of a
specific one.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 8 ++++++++
 memory.c              | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index f76e174..606ce88 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -707,6 +707,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
                                 bool is_write);
 
 /**
+ * memory_region_iommu_replay_all: replay existing IOMMU translations
+ * to all the notifiers registered.
+ *
+ * @mr: the memory region to observe
+ */
+void memory_region_iommu_replay_all(MemoryRegion *mr);
+
+/**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
  * changes to IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index 523c43f..9e1bb75 100644
--- a/memory.c
+++ b/memory.c
@@ -1646,6 +1646,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     }
 }
 
+void memory_region_iommu_replay_all(MemoryRegion *mr)
+{
+    IOMMUNotifier *notifier;
+
+    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
+        memory_region_iommu_replay(mr, notifier, false);
+    }
+}
+
 void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
                                              IOMMUNotifier *n)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 13/17] memory: introduce memory_region_notify_one()
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (11 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 12/17] memory: provide iommu_replay_all() Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  2:33   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

Generalizing the notify logic in memory_region_notify_iommu() into a
single function. This can be further used in customized replay()
functions for IOMMUs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 15 +++++++++++++++
 memory.c              | 40 ++++++++++++++++++++++++----------------
 2 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 606ce88..0767888 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -682,6 +682,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
                                 IOMMUTLBEntry entry);
 
 /**
+ * memory_region_notify_one: notify a change in an IOMMU translation
+ *                           entry to a single notifier
+ *
+ * This works just like memory_region_notify_iommu(), but it only
+ * notifies a specific notifier, not all of them.
+ *
+ * @notifier: the notifier to be notified
+ * @entry: the new entry in the IOMMU translation table.  The entry
+ *         replaces all old entries for the same virtual I/O address range.
+ *         Deleted entries have .@perm == 0.
+ */
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry);
+
+/**
  * memory_region_register_iommu_notifier: register a notifier for changes to
  * IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index 9e1bb75..7a4f2f9 100644
--- a/memory.c
+++ b/memory.c
@@ -1666,32 +1666,40 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
     memory_region_update_iommu_notify_flags(mr);
 }
 
-void memory_region_notify_iommu(MemoryRegion *mr,
-                                IOMMUTLBEntry entry)
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry)
 {
-    IOMMUNotifier *iommu_notifier;
     IOMMUNotifierFlag request_flags;
 
-    assert(memory_region_is_iommu(mr));
+    /*
+     * Skip the notification if the notification does not overlap
+     * with registered range.
+     */
+    if (notifier->start > entry->iova + entry->addr_mask + 1 ||
+        notifier->end < entry->iova) {
+        return;
+    }
 
-    if (entry.perm & IOMMU_RW) {
+    if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
     } else {
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
+    if (notifier->notifier_flags & request_flags) {
+        notifier->notify(notifier, entry);
+    }
+}
+
+void memory_region_notify_iommu(MemoryRegion *mr,
+                                IOMMUTLBEntry entry)
+{
+    IOMMUNotifier *iommu_notifier;
+
+    assert(memory_region_is_iommu(mr));
+
     IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
-        /*
-         * Skip the notification if the notification does not overlap
-         * with registered range.
-         */
-        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
-            iommu_notifier->end < entry.iova) {
-            continue;
-        }
-        if (iommu_notifier->notifier_flags & request_flags) {
-            iommu_notifier->notify(iommu_notifier, &entry);
-        }
+        memory_region_notify_one(iommu_notifier, &entry);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (12 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 13/17] memory: introduce memory_region_notify_one() Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  2:34   ` David Gibson
  2017-03-27  8:35   ` Liu, Yi L
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 15/17] intel_iommu: provide its own replay() callback Peter Xu
                   ` (4 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

Originally we have one memory_region_iommu_replay() function, which is
the default behavior to replay the translations of the whole IOMMU
region. However, on some platform like x86, we may want our own replay
logic for IOMMU regions. This patch add one more hook for IOMMUOps for
the callback, and it'll override the default if set.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 2 ++
 memory.c              | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0767888..30b2a74 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
     void (*notify_flag_changed)(MemoryRegion *iommu,
                                 IOMMUNotifierFlag old_flags,
                                 IOMMUNotifierFlag new_flags);
+    /* Set this up to provide customized IOMMU replay function */
+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/memory.c b/memory.c
index 7a4f2f9..9c253cc 100644
--- a/memory.c
+++ b/memory.c
@@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    /* If the IOMMU has its own replay callback, override */
+    if (mr->iommu_ops->replay) {
+        mr->iommu_ops->replay(mr, n);
+        return;
+    }
+
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 15/17] intel_iommu: provide its own replay() callback
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (13 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  2:36   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 16/17] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

The default replay() don't work for VT-d since vt-d will have a huge
default memory region which covers address range 0-(2^64-1). This will
normally consumes a lot of time (which looks like a dead loop).

The solution is simple - we don't walk over all the regions. Instead, we
jump over the regions when we found that the page directories are empty.
It'll greatly reduce the time to walk the whole region.

To achieve this, we provided a page walk helper to do that, invoking
corresponding hook function when we found an page we are interested in.
vtd_page_walk_level() is the core logic for the page walking. It's
interface is designed to suite further use case, e.g., to invalidate a
range of addresses.

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/i386/trace-events  |   7 ++
 include/exec/memory.h |   2 +
 3 files changed, 186 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 22d8226..f8d5713 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -595,6 +595,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
+static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
+{
+    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
+    return 1ULL << MIN(ce_agaw, VTD_MGAW);
+}
+
+/* Return true if IOVA passes range check, otherwise false. */
+static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
+{
+    /*
+     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
+     */
+    return !(iova & ~(vtd_iova_limit(ce) - 1));
+}
+
 static const uint64_t vtd_paging_entry_rsvd_field[] = {
     [0] = ~0ULL,
     /* For not large page */
@@ -630,13 +646,9 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     uint32_t level = vtd_get_level_from_context_entry(ce);
     uint32_t offset;
     uint64_t slpte;
-    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check;
 
-    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
-     * in CAP_REG and AW in context-entry.
-     */
-    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+    if (!vtd_iova_range_check(iova, ce)) {
         VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
@@ -684,6 +696,134 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     }
 }
 
+typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
+
+/**
+ * vtd_page_walk_level - walk over specific level for IOVA range
+ *
+ * @addr: base GPA addr to start the walk
+ * @start: IOVA range start address
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @read: whether parent level has read permission
+ * @write: whether parent level has write permission
+ * @notify_unmap: whether we should notify invalid entries
+ */
+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
+                               uint64_t end, vtd_page_walk_hook hook_fn,
+                               void *private, uint32_t level,
+                               bool read, bool write, bool notify_unmap)
+{
+    bool read_cur, write_cur, entry_valid;
+    uint32_t offset;
+    uint64_t slpte;
+    uint64_t subpage_size, subpage_mask;
+    IOMMUTLBEntry entry;
+    uint64_t iova = start;
+    uint64_t iova_next;
+    int ret = 0;
+
+    trace_vtd_page_walk_level(addr, level, start, end);
+
+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
+    subpage_mask = vtd_slpt_level_page_mask(level);
+
+    while (iova < end) {
+        iova_next = (iova & subpage_mask) + subpage_size;
+
+        offset = vtd_iova_level_offset(iova, level);
+        slpte = vtd_get_slpte(addr, offset);
+
+        if (slpte == (uint64_t)-1) {
+            trace_vtd_page_walk_skip_read(iova, iova_next);
+            goto next;
+        }
+
+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
+            goto next;
+        }
+
+        /* Permissions are stacked with parents' */
+        read_cur = read && (slpte & VTD_SL_R);
+        write_cur = write && (slpte & VTD_SL_W);
+
+        /*
+         * As long as we have either read/write permission, this is a
+         * valid entry. The rule works for both page entries and page
+         * table entries.
+         */
+        entry_valid = read_cur | write_cur;
+
+        if (vtd_is_last_slpte(slpte, level)) {
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            /* NOTE: this is only meaningful if entry_valid == true */
+            entry.translated_addr = vtd_get_slpte_addr(slpte);
+            entry.addr_mask = ~subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            if (!entry_valid && !notify_unmap) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                goto next;
+            }
+            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
+                                    entry.addr_mask, entry.perm);
+            if (hook_fn) {
+                ret = hook_fn(&entry, private);
+                if (ret < 0) {
+                    return ret;
+                }
+            }
+        } else {
+            if (!entry_valid) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                goto next;
+            }
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
+                                      MIN(iova_next, end), hook_fn, private,
+                                      level - 1, read_cur, write_cur,
+                                      notify_unmap);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+
+next:
+        iova = iova_next;
+    }
+
+    return 0;
+}
+
+/**
+ * vtd_page_walk - walk specific IOVA range, and call the hook
+ *
+ * @ce: context entry to walk upon
+ * @start: IOVA address to start the walk
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: the hook that to be called for each detected area
+ * @private: private data for the hook function
+ */
+static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
+                         vtd_page_walk_hook hook_fn, void *private)
+{
+    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
+    uint32_t level = vtd_get_level_from_context_entry(ce);
+
+    if (!vtd_iova_range_check(start, ce)) {
+        return -VTD_FR_ADDR_BEYOND_MGAW;
+    }
+
+    if (!vtd_iova_range_check(end, ce)) {
+        /* Fix end so that it reaches the maximum */
+        end = vtd_iova_limit(ce);
+    }
+
+    return vtd_page_walk_level(addr, start, end, hook_fn, private,
+                               level, true, true, false);
+}
+
 /* Map a device to its corresponding domain (context-entry) */
 static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
                                     uint8_t devfn, VTDContextEntry *ce)
@@ -2402,6 +2542,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
+{
+    memory_region_notify_one((IOMMUNotifier *)private, entry);
+    return 0;
+}
+
+static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
+{
+    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_n = pci_bus_num(vtd_as->bus);
+    VTDContextEntry ce;
+
+    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+        /*
+         * Scanned a valid context entry, walk over the pages and
+         * notify when needed.
+         */
+        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                  PCI_FUNC(vtd_as->devfn),
+                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
+                                  ce.hi, ce.lo);
+        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n);
+    } else {
+        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                    PCI_FUNC(vtd_as->devfn));
+    }
+
+    return;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -2416,6 +2587,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     s->iommu_ops.translate = vtd_iommu_translate;
     s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
+    s->iommu_ops.replay = vtd_iommu_replay;
     s->root = 0;
     s->root_extended = false;
     s->dmar_enabled = false;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 88ad5e4..463db0d 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -30,6 +30,13 @@ vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
 vtd_fault_disabled(void) "Fault processing disabled for context entry"
+vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
+vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
+vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
+vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 30b2a74..267f399 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -55,6 +55,8 @@ typedef enum {
     IOMMU_RW   = 3,
 } IOMMUAccessFlags;
 
+#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
+
 struct IOMMUTLBEntry {
     AddressSpace    *target_as;
     hwaddr           iova;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 16/17] intel_iommu: allow dynamic switch of IOMMU region
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (14 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 15/17] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  2:38   ` David Gibson
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices Peter Xu
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

This is preparation work to finally enabled dynamic switching ON/OFF for
VT-d protection. The old VT-d codes is using static IOMMU address space,
and that won't satisfy vfio-pci device listeners.

Let me explain.

vfio-pci devices depend on the memory region listener and IOMMU replay
mechanism to make sure the device mapping is coherent with the guest
even if there are domain switches. And there are two kinds of domain
switches:

  (1) switch from domain A -> B
  (2) switch from domain A -> no domain (e.g., turn DMAR off)

Case (1) is handled by the context entry invalidation handling by the
VT-d replay logic. What the replay function should do here is to replay
the existing page mappings in domain B.

However for case (2), we don't want to replay any domain mappings - we
just need the default GPA->HPA mappings (the address_space_memory
mapping). And this patch helps on case (2) to build up the mapping
automatically by leveraging the vfio-pci memory listeners.

Another important thing that this patch does is to seperate
IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
depend on the DMAR region (like before this patch). It should be a
standalone region, and it should be able to be activated without
DMAR (which is a common behavior of Linux kernel - by default it enables
IR while disabled DMAR).

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 78 ++++++++++++++++++++++++++++++++++++++++---
 hw/i386/trace-events          |  2 +-
 include/hw/i386/intel_iommu.h |  2 ++
 3 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f8d5713..4fe161f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1291,9 +1291,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
 }
 
+static void vtd_switch_address_space(VTDAddressSpace *as)
+{
+    assert(as);
+
+    trace_vtd_switch_address_space(pci_bus_num(as->bus),
+                                   VTD_PCI_SLOT(as->devfn),
+                                   VTD_PCI_FUNC(as->devfn),
+                                   as->iommu_state->dmar_enabled);
+
+    /* Turn off first then on the other */
+    if (as->iommu_state->dmar_enabled) {
+        memory_region_set_enabled(&as->sys_alias, false);
+        memory_region_set_enabled(&as->iommu, true);
+    } else {
+        memory_region_set_enabled(&as->iommu, false);
+        memory_region_set_enabled(&as->sys_alias, true);
+    }
+}
+
+static void vtd_switch_address_space_all(IntelIOMMUState *s)
+{
+    GHashTableIter iter;
+    VTDBus *vtd_bus;
+    int i;
+
+    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
+    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
+        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
+            if (!vtd_bus->dev_as[i]) {
+                continue;
+            }
+            vtd_switch_address_space(vtd_bus->dev_as[i]);
+        }
+    }
+}
+
 /* Handle Translation Enable/Disable */
 static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 {
+    if (s->dmar_enabled == en) {
+        return;
+    }
+
     VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
 
     if (en) {
@@ -1308,6 +1348,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
         /* Ok - report back to driver */
         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
     }
+
+    vtd_switch_address_space_all(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -2529,15 +2571,43 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+
+        /*
+         * Memory region relationships looks like (Address range shows
+         * only lower 32 bits to make it short in length...):
+         *
+         * |-----------------+-------------------+----------|
+         * | Name            | Address range     | Priority |
+         * |-----------------+-------------------+----------+
+         * | vtd_root        | 00000000-ffffffff |        0 |
+         * |  intel_iommu    | 00000000-ffffffff |        1 |
+         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
+         * |  intel_iommu_ir | fee00000-feefffff |       64 |
+         * |-----------------+-------------------+----------|
+         *
+         * We enable/disable DMAR by switching enablement for
+         * vtd_sys_alias and intel_iommu regions. IR region is always
+         * enabled.
+         */
         memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
                                  &s->iommu_ops, "intel_iommu", UINT64_MAX);
+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
+                                 "vtd_sys_alias", get_system_memory(),
+                                 0, memory_region_size(get_system_memory()));
         memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
                               &vtd_mem_ir_ops, s, "intel_iommu_ir",
                               VTD_INTERRUPT_ADDR_SIZE);
-        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
-                                    &vtd_dev_as->iommu_ir);
-        address_space_init(&vtd_dev_as->as,
-                           &vtd_dev_as->iommu, name);
+        memory_region_init(&vtd_dev_as->root, OBJECT(s),
+                           "vtd_root", UINT64_MAX);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root,
+                                            VTD_INTERRUPT_ADDR_FIRST,
+                                            &vtd_dev_as->iommu_ir, 64);
+        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->sys_alias, 1);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->iommu, 1);
+        vtd_switch_address_space(vtd_dev_as);
     }
     return vtd_dev_as;
 }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 463db0d..ebb650b 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -4,7 +4,6 @@
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
 # hw/i386/intel_iommu.c
-vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
 vtd_inv_desc_invalid(uint64_t hi, uint64_t lo) "invalid inv desc hi 0x%"PRIx64" lo 0x%"PRIx64
 vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
@@ -37,6 +36,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fe645aa..8f212a1 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -83,6 +83,8 @@ struct VTDAddressSpace {
     uint8_t devfn;
     AddressSpace as;
     MemoryRegion iommu;
+    MemoryRegion root;
+    MemoryRegion sys_alias;
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (15 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 16/17] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2017-02-07  8:28 ` Peter Xu
  2017-02-10  6:24   ` Jason Wang
  2017-03-16  4:05   ` Peter Xu
  2017-02-17 17:18 ` [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Alex Williamson
  2017-02-28  7:52 ` Peter Xu
  18 siblings, 2 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-07  8:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	David Gibson, alex.williamson, bd.aviv

This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
upstream:

  "IOMMU: enable intel_iommu map and unmap notifiers"
  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html

However I removed/fixed some content, and added my own codes.

Instead of translate() every page for iotlb invalidations (which is
slower), we walk the pages when needed and notify in a hook function.

This patch enables vfio devices for VT-d emulation.

And, since we already have vhost DMAR support via device-iotlb, a
natural benefit that this patch brings is that vt-d enabled vhost can
live even without ATS capability now. Though more tests are needed.

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 191 ++++++++++++++++++++++++++++++++++++++---
 hw/i386/intel_iommu_internal.h |   1 +
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |   8 ++
 4 files changed, 188 insertions(+), 13 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4fe161f..9b1ba1b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -806,7 +806,8 @@ next:
  * @private: private data for the hook function
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
-                         vtd_page_walk_hook hook_fn, void *private)
+                         vtd_page_walk_hook hook_fn, void *private,
+                         bool notify_unmap)
 {
     dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
     uint32_t level = vtd_get_level_from_context_entry(ce);
@@ -821,7 +822,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
     }
 
     return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, false);
+                               level, true, true, notify_unmap);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1038,6 +1039,15 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
                 s->intr_root, s->intr_size);
 }
 
+static void vtd_iommu_replay_all(IntelIOMMUState *s)
+{
+    IntelIOMMUNotifierNode *node;
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        memory_region_iommu_replay_all(&node->vtd_as->iommu);
+    }
+}
+
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
     trace_vtd_inv_desc_cc_global();
@@ -1045,6 +1055,14 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
     }
+    /*
+     * From VT-d spec 6.5.2.1, a global context entry invalidation
+     * should be followed by a IOTLB global invalidation, so we should
+     * be safe even without this. Hoewever, let's replay the region as
+     * well to be safer, and go back here when we need finer tunes for
+     * VT-d emulation codes.
+     */
+    vtd_iommu_replay_all(s);
 }
 
 
@@ -1111,6 +1129,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                 trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
                                              VTD_PCI_FUNC(devfn_it));
                 vtd_as->context_cache_entry.context_cache_gen = 0;
+                /*
+                 * So a device is moving out of (or moving into) a
+                 * domain, a replay() suites here to notify all the
+                 * IOMMU_NOTIFIER_MAP registers about this change.
+                 * This won't bring bad even if we have no such
+                 * notifier registered - the IOMMU notification
+                 * framework will skip MAP notifications if that
+                 * happened.
+                 */
+                memory_region_iommu_replay_all(&vtd_as->iommu);
             }
         }
     }
@@ -1152,12 +1180,53 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 {
     trace_vtd_iotlb_reset("global invalidation recved");
     vtd_reset_iotlb(s);
+    vtd_iommu_replay_all(s);
 }
 
 static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
 {
+    IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    VTDAddressSpace *vtd_as;
+
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
                                 &domain_id);
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        vtd_as = node->vtd_as;
+        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                      vtd_as->devfn, &ce) &&
+            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            memory_region_iommu_replay_all(&vtd_as->iommu);
+        }
+    }
+}
+
+static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
+                                           void *private)
+{
+    memory_region_notify_iommu((MemoryRegion *)private, *entry);
+    return 0;
+}
+
+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
+                                           uint16_t domain_id, hwaddr addr,
+                                           uint8_t am)
+{
+    IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    int ret;
+
+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
+        VTDAddressSpace *vtd_as = node->vtd_as;
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce);
+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
+                          vtd_page_invalidate_notify_hook,
+                          (void *)&vtd_as->iommu, true);
+        }
+    }
 }
 
 static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
@@ -1170,6 +1239,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
 }
 
 /* Flush IOTLB
@@ -2187,15 +2257,33 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu,
                                           IOMMUNotifierFlag new)
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    IntelIOMMUNotifierNode *node = NULL;
+    IntelIOMMUNotifierNode *next_node = NULL;
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_report("Device at bus %s addr %02x.%d requires iommu "
-                     "notifier which is currently not supported by "
-                     "intel-iommu emulation",
-                     vtd_as->bus->qbus.name, PCI_SLOT(vtd_as->devfn),
-                     PCI_FUNC(vtd_as->devfn));
+    if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
+        error_report("We need to set cache_mode=1 for intel-iommu to enable "
+                     "device assignment with IOMMU protection.");
         exit(1);
     }
+
+    if (old == IOMMU_NOTIFIER_NONE) {
+        node = g_malloc0(sizeof(*node));
+        node->vtd_as = vtd_as;
+        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
+        return;
+    }
+
+    /* update notifier node with new flags */
+    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
+        if (node->vtd_as == vtd_as) {
+            if (new == IOMMU_NOTIFIER_NONE) {
+                QLIST_REMOVE(node, next);
+                g_free(node);
+            }
+            return;
+        }
+    }
 }
 
 static const VMStateDescription vtd_vmstate = {
@@ -2612,6 +2700,74 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+/* Unmap the whole range in the notifier's scope. */
+static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
+{
+    IOMMUTLBEntry entry;
+    hwaddr size;
+    hwaddr start = n->start;
+    hwaddr end = n->end;
+
+    /*
+     * Note: all the codes in this function has a assumption that IOVA
+     * bits are no more than VTD_MGAW bits (which is restricted by
+     * VT-d spec), otherwise we need to consider overflow of 64 bits.
+     */
+
+    if (end > VTD_ADDRESS_SIZE) {
+        /*
+         * Don't need to unmap regions that is bigger than the whole
+         * VT-d supported address space size
+         */
+        end = VTD_ADDRESS_SIZE;
+    }
+
+    assert(start <= end);
+    size = end - start;
+
+    if (ctpop64(size) != 1) {
+        /*
+         * This size cannot format a correct mask. Let's enlarge it to
+         * suite the minimum available mask.
+         */
+        int n = 64 - clz64(size);
+        if (n > VTD_MGAW) {
+            /* should not happen, but in case it happens, limit it */
+            n = VTD_MGAW;
+        }
+        size = 1ULL << n;
+    }
+
+    entry.target_as = &address_space_memory;
+    /* Adjust iova for the size */
+    entry.iova = n->start & ~(size - 1);
+    /* This field is meaningless for unmap */
+    entry.translated_addr = 0;
+    entry.perm = IOMMU_NONE;
+    entry.addr_mask = size - 1;
+
+    trace_vtd_as_unmap_whole(pci_bus_num(as->bus),
+                             VTD_PCI_SLOT(as->devfn),
+                             VTD_PCI_FUNC(as->devfn),
+                             entry.iova, size);
+
+    memory_region_notify_one(n, &entry);
+}
+
+static void vtd_address_space_unmap_all(IntelIOMMUState *s)
+{
+    IntelIOMMUNotifierNode *node;
+    VTDAddressSpace *vtd_as;
+    IOMMUNotifier *n;
+
+    QLIST_FOREACH(node, &s->notifiers_list, next) {
+        vtd_as = node->vtd_as;
+        IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
+            vtd_address_space_unmap(vtd_as, n);
+        }
+    }
+}
+
 static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
 {
     memory_region_notify_one((IOMMUNotifier *)private, entry);
@@ -2625,16 +2781,19 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
     uint8_t bus_n = pci_bus_num(vtd_as->bus);
     VTDContextEntry ce;
 
+    /*
+     * The replay can be triggered by either a invalidation or a newly
+     * created entry. No matter what, we release existing mappings
+     * (it means flushing caches for UNMAP-only registers).
+     */
+    vtd_address_space_unmap(vtd_as, n);
+
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
-        /*
-         * Scanned a valid context entry, walk over the pages and
-         * notify when needed.
-         */
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn),
                                   VTD_CONTEXT_ENTRY_DID(ce.hi),
                                   ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n);
+        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false);
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
@@ -2753,6 +2912,11 @@ static void vtd_reset(DeviceState *dev)
 
     VTD_DPRINTF(GENERAL, "");
     vtd_init(s);
+
+    /*
+     * When device reset, throw away all mappings and external caches
+     */
+    vtd_address_space_unmap_all(s);
 }
 
 static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
@@ -2816,6 +2980,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    QLIST_INIT(&s->notifiers_list);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4104121..29d6707 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -197,6 +197,7 @@
 #define VTD_DOMAIN_ID_MASK          ((1UL << VTD_DOMAIN_ID_SHIFT) - 1)
 #define VTD_CAP_ND                  (((VTD_DOMAIN_ID_SHIFT - 4) / 2) & 7ULL)
 #define VTD_MGAW                    39  /* Maximum Guest Address Width */
+#define VTD_ADDRESS_SIZE            (1ULL << VTD_MGAW)
 #define VTD_CAP_MGAW                (((VTD_MGAW - 1) & 0x3fULL) << 16)
 #define VTD_MAMV                    18ULL
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ebb650b..77d4373 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -37,6 +37,7 @@ vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"P
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
+vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 8f212a1..3e51876 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -63,6 +63,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDIrq VTDIrq;
 typedef struct VTD_MSIMessage VTD_MSIMessage;
+typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -249,6 +250,11 @@ struct VTD_MSIMessage {
 /* When IR is enabled, all MSI/MSI-X data bits should be zero */
 #define VTD_IR_MSI_DATA          (0)
 
+struct IntelIOMMUNotifierNode {
+    VTDAddressSpace *vtd_as;
+    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
+};
+
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
@@ -286,6 +292,8 @@ struct IntelIOMMUState {
     MemoryRegionIOMMUOps iommu_ops;
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    /* list of registered notifiers */
+    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv Peter Xu
@ 2017-02-08  2:47   ` Jason Wang
  2017-02-10  1:19   ` David Gibson
  1 sibling, 0 replies; 63+ messages in thread
From: Jason Wang @ 2017-02-08  2:47 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson,
	bd.aviv, David Gibson



On 2017年02月07日 16:28, Peter Xu wrote:
> VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
> good, and we should end the day when we need to recompile the code
> before getting useful debugging information for vt-d. Time to switch to
> the trace system. This is the first patch to do it.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 95 +++++++++++++++++++++------------------------------
>   hw/i386/trace-events  | 18 ++++++++++
>   2 files changed, 56 insertions(+), 57 deletions(-)

Reviewed-by: Jason Wang <jasowang@redhat.com>

> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 0c94b79..08e43b6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -35,6 +35,7 @@
>   #include "sysemu/kvm.h"
>   #include "hw/i386/apic_internal.h"
>   #include "kvm_i386.h"
> +#include "trace.h"
>   
>   /*#define DEBUG_INTEL_IOMMU*/
>   #ifdef DEBUG_INTEL_IOMMU
> @@ -474,22 +475,19 @@ static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
>   /* Set the IWC field and try to generate an invalidation completion interrupt */
>   static void vtd_generate_completion_event(IntelIOMMUState *s)
>   {
> -    VTD_DPRINTF(INV, "completes an invalidation wait command with "
> -                "Interrupt Flag");
>       if (vtd_get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
> -        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
> -                    "serviced by software, "
> -                    "new invalidation event is not generated");
> +        trace_vtd_inv_desc_wait_irq("One pending, skip current");
>           return;
>       }
>       vtd_set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
>       vtd_set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
>       if (vtd_get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
> -        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
> -                    "event is not generated");
> +        trace_vtd_inv_desc_wait_irq("IM in IECTL_REG is set, "
> +                                    "new event not generated");
>           return;
>       } else {
>           /* Generate the interrupt event */
> +        trace_vtd_inv_desc_wait_irq("Generating complete event");
>           vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
>           vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
>       }
> @@ -923,6 +921,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
>   
>   static void vtd_context_global_invalidate(IntelIOMMUState *s)
>   {
> +    trace_vtd_inv_desc_cc_global();
>       s->context_cache_gen++;
>       if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
>           vtd_reset_context_cache(s);
> @@ -962,9 +961,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>       uint16_t mask;
>       VTDBus *vtd_bus;
>       VTDAddressSpace *vtd_as;
> -    uint16_t devfn;
> +    uint8_t bus_n, devfn;
>       uint16_t devfn_it;
>   
> +    trace_vtd_inv_desc_cc_devices(source_id, func_mask);
> +
>       switch (func_mask & 3) {
>       case 0:
>           mask = 0;   /* No bits in the SID field masked */
> @@ -980,16 +981,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>           break;
>       }
>       mask = ~mask;
> -    VTD_DPRINTF(INV, "device-selective invalidation source 0x%"PRIx16
> -                    " mask %"PRIu16, source_id, mask);
> -    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
> +
> +    bus_n = VTD_SID_TO_BUS(source_id);
> +    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
>       if (vtd_bus) {
>           devfn = VTD_SID_TO_DEVFN(source_id);
>           for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
>               vtd_as = vtd_bus->dev_as[devfn_it];
>               if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
> -                VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
> -                            devfn_it);
> +                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> +                                             VTD_PCI_FUNC(devfn_it));
>                   vtd_as->context_cache_entry.context_cache_gen = 0;
>               }
>           }
> @@ -1302,9 +1303,7 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>   {
>       if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
>           (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Invalidation "
> -                    "Wait Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_wait_invalid(inv_desc->hi, inv_desc->lo);
>           return false;
>       }
>       if (inv_desc->lo & VTD_INV_DESC_WAIT_SW) {
> @@ -1316,21 +1315,18 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>   
>           /* FIXME: need to be masked with HAW? */
>           dma_addr_t status_addr = inv_desc->hi;
> -        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
> -                    status_data, status_addr);
> +        trace_vtd_inv_desc_wait_sw(status_addr, status_data);
>           status_data = cpu_to_le32(status_data);
>           if (dma_memory_write(&address_space_memory, status_addr, &status_data,
>                                sizeof(status_data))) {
> -            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
> +            trace_vtd_inv_desc_wait_write_fail(inv_desc->hi, inv_desc->lo);
>               return false;
>           }
>       } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
>           /* Interrupt flag */
> -        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
>           vtd_generate_completion_event(s);
>       } else {
> -        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_wait_invalid(inv_desc->hi, inv_desc->lo);
>           return false;
>       }
>       return true;
> @@ -1339,30 +1335,29 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>   static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
>                                              VTDInvDesc *inv_desc)
>   {
> +    uint16_t sid, fmask;
> +
>       if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Context-cache "
> -                    "Invalidate Descriptor");
> +        trace_vtd_inv_desc_cc_invalid(inv_desc->hi, inv_desc->lo);
>           return false;
>       }
>       switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
>       case VTD_INV_DESC_CC_DOMAIN:
> -        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
> -                    (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
> +        trace_vtd_inv_desc_cc_domain(
> +            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
>           /* Fall through */
>       case VTD_INV_DESC_CC_GLOBAL:
> -        VTD_DPRINTF(INV, "global invalidation");
>           vtd_context_global_invalidate(s);
>           break;
>   
>       case VTD_INV_DESC_CC_DEVICE:
> -        vtd_context_device_invalidate(s, VTD_INV_DESC_CC_SID(inv_desc->lo),
> -                                      VTD_INV_DESC_CC_FM(inv_desc->lo));
> +        sid = VTD_INV_DESC_CC_SID(inv_desc->lo);
> +        fmask = VTD_INV_DESC_CC_FM(inv_desc->lo);
> +        vtd_context_device_invalidate(s, sid, fmask);
>           break;
>   
>       default:
> -        VTD_DPRINTF(GENERAL, "error: invalid granularity in Context-cache "
> -                    "Invalidate Descriptor hi 0x%"PRIx64  " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_cc_invalid(inv_desc->hi, inv_desc->lo);
>           return false;
>       }
>       return true;
> @@ -1376,22 +1371,19 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>   
>       if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
>           (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in IOTLB "
> -                    "Invalidate Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
>           return false;
>       }
>   
>       switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
>       case VTD_INV_DESC_IOTLB_GLOBAL:
> -        VTD_DPRINTF(INV, "global invalidation");
> +        trace_vtd_inv_desc_iotlb_global();
>           vtd_iotlb_global_invalidate(s);
>           break;
>   
>       case VTD_INV_DESC_IOTLB_DOMAIN:
>           domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
> -        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
> -                    domain_id);
> +        trace_vtd_inv_desc_iotlb_domain(domain_id);
>           vtd_iotlb_domain_invalidate(s, domain_id);
>           break;
>   
> @@ -1399,20 +1391,16 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>           domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
>           addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
>           am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
> -        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
> -                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
> +        trace_vtd_inv_desc_iotlb_pages(domain_id, addr, am);
>           if (am > VTD_MAMV) {
> -            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
> -                        "%"PRIu8, (uint8_t)VTD_MAMV);
> +            trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
>               return false;
>           }
>           vtd_iotlb_page_invalidate(s, domain_id, addr, am);
>           break;
>   
>       default:
> -        VTD_DPRINTF(GENERAL, "error: invalid granularity in IOTLB Invalidate "
> -                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
>           return false;
>       }
>       return true;
> @@ -1511,33 +1499,28 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>   
>       switch (desc_type) {
>       case VTD_INV_DESC_CC:
> -        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
> -                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("context-cache", inv_desc.hi, inv_desc.lo);
>           if (!vtd_process_context_cache_desc(s, &inv_desc)) {
>               return false;
>           }
>           break;
>   
>       case VTD_INV_DESC_IOTLB:
> -        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
> -                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("iotlb", inv_desc.hi, inv_desc.lo);
>           if (!vtd_process_iotlb_desc(s, &inv_desc)) {
>               return false;
>           }
>           break;
>   
>       case VTD_INV_DESC_WAIT:
> -        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
> -                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
>           if (!vtd_process_wait_desc(s, &inv_desc)) {
>               return false;
>           }
>           break;
>   
>       case VTD_INV_DESC_IEC:
> -        VTD_DPRINTF(INV, "Invalidation Interrupt Entry Cache "
> -                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("iec", inv_desc.hi, inv_desc.lo);
>           if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
>               return false;
>           }
> @@ -1552,9 +1535,7 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>           break;
>   
>       default:
> -        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
> -                    inv_desc.hi, inv_desc.lo, desc_type);
> +        trace_vtd_inv_desc_invalid(inv_desc.hi, inv_desc.lo);
>           return false;
>       }
>       s->iq_head++;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 1cc4a10..02aeaab 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -3,6 +3,24 @@
>   # hw/i386/x86-iommu.c
>   x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>   
> +# hw/i386/intel_iommu.c
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
> +vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
> +vtd_inv_desc_invalid(uint64_t hi, uint64_t lo) "invalid inv desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
> +vtd_inv_desc_cc_global(void) "context invalidate globally"
> +vtd_inv_desc_cc_device(uint8_t bus, uint8_t dev, uint8_t fn) "context invalidate device %02"PRIx8":%02"PRIx8".%02"PRIx8
> +vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate devices sid 0x%"PRIx16" fmask 0x%"PRIx16
> +vtd_inv_desc_cc_invalid(uint64_t hi, uint64_t lo) "invalid context-cache desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
> +vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
> +vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
> +vtd_inv_desc_iotlb_invalid(uint64_t hi, uint64_t lo) "invalid iotlb desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
> +vtd_inv_desc_wait_irq(const char *msg) "%s"
> +vtd_inv_desc_wait_invalid(uint64_t hi, uint64_t lo) "invalid wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +
>   # hw/i386/amd_iommu.c
>   amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
>   amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans Peter Xu
@ 2017-02-08  2:49   ` Jason Wang
  2017-02-10  1:20   ` David Gibson
  1 sibling, 0 replies; 63+ messages in thread
From: Jason Wang @ 2017-02-08  2:49 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson,
	bd.aviv, David Gibson



On 2017年02月07日 16:28, Peter Xu wrote:
> Another patch to convert the DPRINTF() stuffs. This patch focuses on the
> address translation path and caching.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 69 ++++++++++++++++++---------------------------------
>   hw/i386/trace-events  | 10 ++++++++
>   2 files changed, 34 insertions(+), 45 deletions(-)

Reviewed-by: Jason Wang <jasowang@redhat.com>

>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 08e43b6..ad304f6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -260,11 +260,9 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>       uint64_t *key = g_malloc(sizeof(*key));
>       uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
>   
> -    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
> -                " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
> -                domain_id);
> +    trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
>       if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
> -        VTD_DPRINTF(CACHE, "iotlb exceeds size limit, forced to reset");
> +        trace_vtd_iotlb_reset("iotlb exceeds size limit");
>           vtd_reset_iotlb(s);
>       }
>   
> @@ -505,8 +503,7 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
>   
>       addr = s->root + index * sizeof(*re);
>       if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
> -        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
> -                    " + %"PRIu8, s->root, index);
> +        trace_vtd_re_invalid(re->rsvd, re->val);
>           re->val = 0;
>           return -VTD_FR_ROOT_TABLE_INV;
>       }
> @@ -524,15 +521,10 @@ static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
>   {
>       dma_addr_t addr;
>   
> -    if (!vtd_root_entry_present(root)) {
> -        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
> -        return -VTD_FR_ROOT_ENTRY_P;
> -    }
> +    /* we have checked that root entry is present */
>       addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
>       if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
> -        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
> -                    " + %"PRIu8,
> -                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
> +        trace_vtd_re_invalid(root->rsvd, root->val);
>           return -VTD_FR_CONTEXT_TABLE_INV;
>       }
>       ce->lo = le64_to_cpu(ce->lo);
> @@ -704,12 +696,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>       }
>   
>       if (!vtd_root_entry_present(&re)) {
> -        VTD_DPRINTF(GENERAL, "error: root-entry #%"PRIu8 " is not present",
> -                    bus_num);
> +        /* Not error - it's okay we don't have root entry. */
> +        trace_vtd_re_not_present(bus_num);
>           return -VTD_FR_ROOT_ENTRY_P;
>       } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
> +        trace_vtd_re_invalid(re.rsvd, re.val);
>           return -VTD_FR_ROOT_ENTRY_RSVD;
>       }
>   
> @@ -719,22 +710,17 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>       }
>   
>       if (!vtd_context_entry_present(ce)) {
> -        VTD_DPRINTF(GENERAL,
> -                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
> -                    "is not present", devfn, bus_num);
> +        /* Not error - it's okay we don't have context entry. */
> +        trace_vtd_ce_not_present(bus_num, devfn);
>           return -VTD_FR_CONTEXT_ENTRY_P;
>       } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
>                  (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
> -        VTD_DPRINTF(GENERAL,
> -                    "error: non-zero reserved field in context-entry "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
> +        trace_vtd_ce_invalid(ce->hi, ce->lo);
>           return -VTD_FR_CONTEXT_ENTRY_RSVD;
>       }
>       /* Check if the programming of context-entry is valid */
>       if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
> -        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
> -                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    ce->hi, ce->lo);
> +        trace_vtd_ce_invalid(ce->hi, ce->lo);
>           return -VTD_FR_CONTEXT_ENTRY_INV;
>       } else {
>           switch (ce->lo & VTD_CONTEXT_ENTRY_TT) {
> @@ -743,9 +729,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>           case VTD_CONTEXT_TT_DEV_IOTLB:
>               break;
>           default:
> -            VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
> -                        "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                        ce->hi, ce->lo);
> +            trace_vtd_ce_invalid(ce->hi, ce->lo);
>               return -VTD_FR_CONTEXT_ENTRY_INV;
>           }
>       }
> @@ -825,9 +809,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>       /* Try to fetch slpte form IOTLB */
>       iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>       if (iotlb_entry) {
> -        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
> -                    " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
> -                    iotlb_entry->slpte, iotlb_entry->domain_id);
> +        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
> +                                 iotlb_entry->domain_id);
>           slpte = iotlb_entry->slpte;
>           reads = iotlb_entry->read_flags;
>           writes = iotlb_entry->write_flags;
> @@ -836,10 +819,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>       }
>       /* Try to fetch context-entry from cache first */
>       if (cc_entry->context_cache_gen == s->context_cache_gen) {
> -        VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
> -                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 ")",
> -                    bus_num, devfn, cc_entry->context_entry.hi,
> -                    cc_entry->context_entry.lo, cc_entry->context_cache_gen);
> +        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
> +                               cc_entry->context_entry.lo,
> +                               cc_entry->context_cache_gen);
>           ce = cc_entry->context_entry;
>           is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>       } else {
> @@ -848,19 +830,16 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>           if (ret_fr) {
>               ret_fr = -ret_fr;
>               if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> -                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
> -                            "requests through this context-entry "
> -                            "(with FPD Set)");
> +                trace_vtd_fault_disabled();
>               } else {
>                   vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
>               }
>               return;
>           }
>           /* Update context-cache */
> -        VTD_DPRINTF(CACHE, "update context-cache bus %d devfn %d "
> -                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 "->%"PRIu32 ")",
> -                    bus_num, devfn, ce.hi, ce.lo,
> -                    cc_entry->context_cache_gen, s->context_cache_gen);
> +        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
> +                                  cc_entry->context_cache_gen,
> +                                  s->context_cache_gen);
>           cc_entry->context_entry = ce;
>           cc_entry->context_cache_gen = s->context_cache_gen;
>       }
> @@ -870,8 +849,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>       if (ret_fr) {
>           ret_fr = -ret_fr;
>           if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> -            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
> -                        "through this context-entry (with FPD Set)");
> +            trace_vtd_fault_disabled();
>           } else {
>               vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
>           }
> @@ -1031,6 +1009,7 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
>   
>   static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
>   {
> +    trace_vtd_iotlb_reset("global invalidation recved");
>       vtd_reset_iotlb(s);
>   }
>   
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 02aeaab..88ad5e4 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -20,6 +20,16 @@ vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write
>   vtd_inv_desc_wait_irq(const char *msg) "%s"
>   vtd_inv_desc_wait_invalid(uint64_t hi, uint64_t lo) "invalid wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
>   vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> +vtd_re_invalid(uint64_t hi, uint64_t lo) "invalid root entry hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
> +vtd_ce_invalid(uint64_t hi, uint64_t lo) "invalid context entry hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
> +vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
> +vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
> +vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
> +vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
> +vtd_fault_disabled(void) "Fault processing disabled for context entry"
>   
>   # hw/i386/amd_iommu.c
>   amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 02/17] vfio: introduce vfio_get_vaddr()
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 02/17] vfio: introduce vfio_get_vaddr() Peter Xu
@ 2017-02-10  1:12   ` David Gibson
  2017-02-10  5:50     ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 5379 bytes --]

On Tue, Feb 07, 2017 at 04:28:04PM +0800, Peter Xu wrote:
> A cleanup for vfio_iommu_map_notify(). Now we will fetch vaddr even if
> the operation is unmap, but it won't hurt much.
> 
> One thing to mention is that we need the RCU read lock to protect the
> whole translation and map/unmap procedure.
> 
> Acked-by: Alex Williamson <alex.williamson@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>

So, I know I reviewed this already, but looking again I'm confused.

I'm not sure how the original code ever worked: if this is an unmap
(perm == IOMMU_NONE), then I wouldn't even expect
iotlb->translated_addr to have a valid value, but we're passing it to
address_space_translate() and failing if it it doesn't give us
sensible results.

> ---
>  hw/vfio/common.c | 65 +++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 45 insertions(+), 20 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 174f351..42c4790 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -294,54 +294,79 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>             section->offset_within_address_space & (1ULL << 63);
>  }
>  
> -static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> +/* Called with rcu_read_lock held.  */
> +static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                           bool *read_only)
>  {
> -    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> -    VFIOContainer *container = giommu->container;
> -    hwaddr iova = iotlb->iova + giommu->iommu_offset;
>      MemoryRegion *mr;
>      hwaddr xlat;
>      hwaddr len = iotlb->addr_mask + 1;
> -    void *vaddr;
> -    int ret;
> -
> -    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> -                                iova, iova + iotlb->addr_mask);
> -
> -    if (iotlb->target_as != &address_space_memory) {
> -        error_report("Wrong target AS \"%s\", only system memory is allowed",
> -                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> -        return;
> -    }
> +    bool writable = iotlb->perm & IOMMU_WO;
>  
>      /*
>       * The IOMMU TLB entry we have just covers translation through
>       * this IOMMU to its immediate target.  We need to translate
>       * it the rest of the way through to memory.
>       */
> -    rcu_read_lock();
>      mr = address_space_translate(&address_space_memory,
>                                   iotlb->translated_addr,
> -                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> +                                 &xlat, &len, writable);
>      if (!memory_region_is_ram(mr)) {
>          error_report("iommu map to non memory area %"HWADDR_PRIx"",
>                       xlat);
> -        goto out;
> +        return false;
>      }
> +
>      /*
>       * Translation truncates length to the IOMMU page size,
>       * check that it did not truncate too much.
>       */
>      if (len & iotlb->addr_mask) {
>          error_report("iommu has granularity incompatible with target AS");
> +        return false;
> +    }
> +
> +    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    *read_only = !writable || mr->readonly;
> +
> +    return true;
> +}
> +
> +static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> +{
> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> +    VFIOContainer *container = giommu->container;
> +    hwaddr iova = iotlb->iova + giommu->iommu_offset;
> +    bool read_only;
> +    void *vaddr;
> +    int ret;
> +
> +    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
> +                                iova, iova + iotlb->addr_mask);
> +
> +    if (iotlb->target_as != &address_space_memory) {
> +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> +                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> +        return;
> +    }
> +
> +    rcu_read_lock();
> +
> +    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
>          goto out;
>      }
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +        /*
> +         * vaddr is only valid until rcu_read_unlock(). But after
> +         * vfio_dma_map has set up the mapping the pages will be
> +         * pinned by the kernel. This makes sure that the RAM backend
> +         * of vaddr will always be there, even if the memory object is
> +         * destroyed and its backing memory munmap-ed.
> +         */
>          ret = vfio_dma_map(container, iova,
>                             iotlb->addr_mask + 1, vaddr,
> -                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> +                           read_only);
>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 03/17] vfio: allow to notify unmap for very large region
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 03/17] vfio: allow to notify unmap for very large region Peter Xu
@ 2017-02-10  1:13   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]

On Tue, Feb 07, 2017 at 04:28:05PM +0800, Peter Xu wrote:
> Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big
> region. This can be leveraged by QEMU IOMMU implementation to cleanup
> existing page mappings for an entire iova address space (by notifying
> with an IOTLB with extremely huge addr_mask). However current
> vfio_iommu_map_notify() does not allow that. It make sure that all the
> translated address in IOTLB is falling into RAM range.
> 
> The check makes sense, but it should only be a sensible checker for
> mapping operations, and mean little for unmap operations.
> 
> This patch moves this check into map logic only, so that we'll get
> faster unmap handling (no need to translate again), and also we can then
> better support unmapping a very big region when it covers non-ram ranges
> or even not-existing ranges.
> 
> Acked-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Seems sensible of itself, except that I don't understand how we were
ever working before this.

> ---
>  hw/vfio/common.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 42c4790..f3ba9b9 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -352,11 +352,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>  
>      rcu_read_lock();
>  
> -    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> -        goto out;
> -    }
> -
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> +        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +            goto out;
> +        }
>          /*
>           * vaddr is only valid until rcu_read_unlock(). But after
>           * vfio_dma_map has set up the mapping the pages will be

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 04/17] intel_iommu: add "caching-mode" option
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 04/17] intel_iommu: add "caching-mode" option Peter Xu
@ 2017-02-10  1:14   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 2989 bytes --]

On Tue, Feb 07, 2017 at 04:28:06PM +0800, Peter Xu wrote:
> From: Aviv Ben-David <bd.aviv@gmail.com>
> 
> This capability asks the guest to invalidate cache before each map operation.
> We can use this invalidation to trap map operations in the hypervisor.
> 
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> [peterx: using "caching-mode" instead of "cache-mode" to align with spec]
> [peterx: re-write the subject to make it short and clear]
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/i386/intel_iommu.c          | 5 +++++
>  hw/i386/intel_iommu_internal.h | 1 +
>  include/hw/i386/intel_iommu.h  | 2 ++
>  3 files changed, 8 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 3270fb9..50251c3 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2115,6 +2115,7 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
>                              ON_OFF_AUTO_AUTO),
>      DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
> +    DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -2496,6 +2497,10 @@ static void vtd_init(IntelIOMMUState *s)
>          s->ecap |= VTD_ECAP_DT;
>      }
>  
> +    if (s->caching_mode) {
> +        s->cap |= VTD_CAP_CM;
> +    }
> +
>      vtd_reset_context_cache(s);
>      vtd_reset_iotlb(s);
>  
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 356f188..4104121 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -202,6 +202,7 @@
>  #define VTD_CAP_MAMV                (VTD_MAMV << 48)
>  #define VTD_CAP_PSI                 (1ULL << 39)
>  #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
> +#define VTD_CAP_CM                  (1ULL << 7)
>  
>  /* Supported Adjusted Guest Address Widths */
>  #define VTD_CAP_SAGAW_SHIFT         8
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 405c9d1..fe645aa 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -257,6 +257,8 @@ struct IntelIOMMUState {
>      uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>      uint32_t version;
>  
> +    bool caching_mode;          /* RO - is cap CM enabled? */
> +
>      dma_addr_t root;                /* Current root table pointer */
>      bool root_extended;             /* Type of root table (extended or not) */
>      bool dmar_enabled;              /* Set if DMA remapping is enabled */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 05/17] intel_iommu: simplify irq region translation
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 05/17] intel_iommu: simplify irq region translation Peter Xu
@ 2017-02-10  1:15   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 2455 bytes --]

On Tue, Feb 07, 2017 at 04:28:07PM +0800, Peter Xu wrote:
> Now we have a standalone memory region for MSI, all the irq region
> requests should be redirected there. Cleaning up the block with an
> assertion instead.
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/i386/intel_iommu.c | 28 ++++++----------------------
>  1 file changed, 6 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 50251c3..86d19bb 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -818,28 +818,12 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      bool writes = true;
>      VTDIOTLBEntry *iotlb_entry;
>  
> -    /* Check if the request is in interrupt address range */
> -    if (vtd_is_interrupt_addr(addr)) {
> -        if (is_write) {
> -            /* FIXME: since we don't know the length of the access here, we
> -             * treat Non-DWORD length write requests without PASID as
> -             * interrupt requests, too. Withoud interrupt remapping support,
> -             * we just use 1:1 mapping.
> -             */
> -            VTD_DPRINTF(MMU, "write request to interrupt address "
> -                        "gpa 0x%"PRIx64, addr);
> -            entry->iova = addr & VTD_PAGE_MASK_4K;
> -            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
> -            entry->addr_mask = ~VTD_PAGE_MASK_4K;
> -            entry->perm = IOMMU_WO;
> -            return;
> -        } else {
> -            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
> -                        "gpa 0x%"PRIx64, addr);
> -            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
> -            return;
> -        }
> -    }
> +    /*
> +     * We have standalone memory region for interrupt addresses, we
> +     * should never receive translation requests in this region.
> +     */
> +    assert(!vtd_is_interrupt_addr(addr));
> +
>      /* Try to fetch slpte form IOTLB */
>      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>      if (iotlb_entry) {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 06/17] intel_iommu: renaming gpa to iova where proper
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 06/17] intel_iommu: renaming gpa to iova where proper Peter Xu
@ 2017-02-10  1:17   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 7342 bytes --]

On Tue, Feb 07, 2017 at 04:28:08PM +0800, Peter Xu wrote:
> There are lots of places in current intel_iommu.c codes that named
> "iova" as "gpa". It is really confusing to use a name "gpa" in these
> places (which is very easily to be understood as "Guest Physical
> Address", while it's not). To make the codes (much) easier to be read, I
> decided to do this once and for all.
> 
> No functional change is made. Only literal ones.
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Sounds like a good idea, that would certainly confuse me.

> ---
>  hw/i386/intel_iommu.c | 44 ++++++++++++++++++++++----------------------
>  1 file changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 86d19bb..0c94b79 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -259,7 +259,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>      uint64_t *key = g_malloc(sizeof(*key));
>      uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
>  
> -    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
> +    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
>                  " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
>                  domain_id);
>      if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
> @@ -575,12 +575,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
>      return slpte;
>  }
>  
> -/* Given a gpa and the level of paging structure, return the offset of current
> - * level.
> +/* Given an iova and the level of paging structure, return the offset
> + * of current level.
>   */
> -static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
> +static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
>  {
> -    return (gpa >> vtd_slpt_level_shift(level)) &
> +    return (iova >> vtd_slpt_level_shift(level)) &
>              ((1ULL << VTD_SL_LEVEL_BITS) - 1);
>  }
>  
> @@ -628,12 +628,12 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>      }
>  }
>  
> -/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
> +/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> -static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
> -                            uint64_t *slptep, uint32_t *slpte_level,
> -                            bool *reads, bool *writes)
> +static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
> +                             uint64_t *slptep, uint32_t *slpte_level,
> +                             bool *reads, bool *writes)
>  {
>      dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
>      uint32_t level = vtd_get_level_from_context_entry(ce);
> @@ -642,11 +642,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
>      uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>      uint64_t access_right_check;
>  
> -    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
> -     * and AW in context-entry.
> +    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
>       */
> -    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> -        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
> +    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
>  
> @@ -654,13 +654,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
>      access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
>  
>      while (true) {
> -        offset = vtd_gpa_level_offset(gpa, level);
> +        offset = vtd_iova_level_offset(iova, level);
>          slpte = vtd_get_slpte(addr, offset);
>  
>          if (slpte == (uint64_t)-1) {
>              VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
> -                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
> -                        level, gpa);
> +                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
> +                        level, iova);
>              if (level == vtd_get_level_from_context_entry(ce)) {
>                  /* Invalid programming of context-entry */
>                  return -VTD_FR_CONTEXT_ENTRY_INV;
> @@ -672,8 +672,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
>          *writes = (*writes) && (slpte & VTD_SL_W);
>          if (!(slpte & access_right_check)) {
>              VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
> -                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
> -                        (is_write ? "write" : "read"), gpa, slpte);
> +                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
> +                        (is_write ? "write" : "read"), iova, slpte);
>              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
>          }
>          if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> @@ -827,7 +827,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      /* Try to fetch slpte form IOTLB */
>      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>      if (iotlb_entry) {
> -        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
> +        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
>                      " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
>                      iotlb_entry->slpte, iotlb_entry->domain_id);
>          slpte = iotlb_entry->slpte;
> @@ -867,8 +867,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          cc_entry->context_cache_gen = s->context_cache_gen;
>      }
>  
> -    ret_fr = vtd_gpa_to_slpte(&ce, addr, is_write, &slpte, &level,
> -                              &reads, &writes);
> +    ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
> +                               &reads, &writes);
>      if (ret_fr) {
>          ret_fr = -ret_fr;
>          if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> @@ -2033,7 +2033,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
>                             is_write, &ret);
>      VTD_DPRINTF(MMU,
>                  "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
> -                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
> +                " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
>                  VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
>                  vtd_as->devfn, addr, ret.translated_addr);
>      return ret;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv Peter Xu
  2017-02-08  2:47   ` Jason Wang
@ 2017-02-10  1:19   ` David Gibson
  1 sibling, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 15159 bytes --]

On Tue, Feb 07, 2017 at 04:28:09PM +0800, Peter Xu wrote:
> VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
> good, and we should end the day when we need to recompile the code
> before getting useful debugging information for vt-d. Time to switch to
> the trace system. This is the first patch to do it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/i386/intel_iommu.c | 95 +++++++++++++++++++++------------------------------
>  hw/i386/trace-events  | 18 ++++++++++
>  2 files changed, 56 insertions(+), 57 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 0c94b79..08e43b6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -35,6 +35,7 @@
>  #include "sysemu/kvm.h"
>  #include "hw/i386/apic_internal.h"
>  #include "kvm_i386.h"
> +#include "trace.h"
>  
>  /*#define DEBUG_INTEL_IOMMU*/
>  #ifdef DEBUG_INTEL_IOMMU
> @@ -474,22 +475,19 @@ static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
>  /* Set the IWC field and try to generate an invalidation completion interrupt */
>  static void vtd_generate_completion_event(IntelIOMMUState *s)
>  {
> -    VTD_DPRINTF(INV, "completes an invalidation wait command with "
> -                "Interrupt Flag");
>      if (vtd_get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
> -        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
> -                    "serviced by software, "
> -                    "new invalidation event is not generated");
> +        trace_vtd_inv_desc_wait_irq("One pending, skip current");
>          return;
>      }
>      vtd_set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
>      vtd_set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
>      if (vtd_get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
> -        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
> -                    "event is not generated");
> +        trace_vtd_inv_desc_wait_irq("IM in IECTL_REG is set, "
> +                                    "new event not generated");
>          return;
>      } else {
>          /* Generate the interrupt event */
> +        trace_vtd_inv_desc_wait_irq("Generating complete event");
>          vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
>          vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
>      }
> @@ -923,6 +921,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
>  
>  static void vtd_context_global_invalidate(IntelIOMMUState *s)
>  {
> +    trace_vtd_inv_desc_cc_global();
>      s->context_cache_gen++;
>      if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
>          vtd_reset_context_cache(s);
> @@ -962,9 +961,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>      uint16_t mask;
>      VTDBus *vtd_bus;
>      VTDAddressSpace *vtd_as;
> -    uint16_t devfn;
> +    uint8_t bus_n, devfn;
>      uint16_t devfn_it;
>  
> +    trace_vtd_inv_desc_cc_devices(source_id, func_mask);
> +
>      switch (func_mask & 3) {
>      case 0:
>          mask = 0;   /* No bits in the SID field masked */
> @@ -980,16 +981,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>          break;
>      }
>      mask = ~mask;
> -    VTD_DPRINTF(INV, "device-selective invalidation source 0x%"PRIx16
> -                    " mask %"PRIu16, source_id, mask);
> -    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
> +
> +    bus_n = VTD_SID_TO_BUS(source_id);
> +    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
>      if (vtd_bus) {
>          devfn = VTD_SID_TO_DEVFN(source_id);
>          for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
>              vtd_as = vtd_bus->dev_as[devfn_it];
>              if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
> -                VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
> -                            devfn_it);
> +                trace_vtd_inv_desc_cc_device(bus_n, VTD_PCI_SLOT(devfn_it),
> +                                             VTD_PCI_FUNC(devfn_it));
>                  vtd_as->context_cache_entry.context_cache_gen = 0;
>              }
>          }
> @@ -1302,9 +1303,7 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>  {
>      if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
>          (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Invalidation "
> -                    "Wait Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_wait_invalid(inv_desc->hi, inv_desc->lo);
>          return false;
>      }
>      if (inv_desc->lo & VTD_INV_DESC_WAIT_SW) {
> @@ -1316,21 +1315,18 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>  
>          /* FIXME: need to be masked with HAW? */
>          dma_addr_t status_addr = inv_desc->hi;
> -        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
> -                    status_data, status_addr);
> +        trace_vtd_inv_desc_wait_sw(status_addr, status_data);
>          status_data = cpu_to_le32(status_data);
>          if (dma_memory_write(&address_space_memory, status_addr, &status_data,
>                               sizeof(status_data))) {
> -            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
> +            trace_vtd_inv_desc_wait_write_fail(inv_desc->hi, inv_desc->lo);
>              return false;
>          }
>      } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
>          /* Interrupt flag */
> -        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
>          vtd_generate_completion_event(s);
>      } else {
> -        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_wait_invalid(inv_desc->hi, inv_desc->lo);
>          return false;
>      }
>      return true;
> @@ -1339,30 +1335,29 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>  static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
>                                             VTDInvDesc *inv_desc)
>  {
> +    uint16_t sid, fmask;
> +
>      if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Context-cache "
> -                    "Invalidate Descriptor");
> +        trace_vtd_inv_desc_cc_invalid(inv_desc->hi, inv_desc->lo);
>          return false;
>      }
>      switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
>      case VTD_INV_DESC_CC_DOMAIN:
> -        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
> -                    (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
> +        trace_vtd_inv_desc_cc_domain(
> +            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
>          /* Fall through */
>      case VTD_INV_DESC_CC_GLOBAL:
> -        VTD_DPRINTF(INV, "global invalidation");
>          vtd_context_global_invalidate(s);
>          break;
>  
>      case VTD_INV_DESC_CC_DEVICE:
> -        vtd_context_device_invalidate(s, VTD_INV_DESC_CC_SID(inv_desc->lo),
> -                                      VTD_INV_DESC_CC_FM(inv_desc->lo));
> +        sid = VTD_INV_DESC_CC_SID(inv_desc->lo);
> +        fmask = VTD_INV_DESC_CC_FM(inv_desc->lo);
> +        vtd_context_device_invalidate(s, sid, fmask);
>          break;
>  
>      default:
> -        VTD_DPRINTF(GENERAL, "error: invalid granularity in Context-cache "
> -                    "Invalidate Descriptor hi 0x%"PRIx64  " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_cc_invalid(inv_desc->hi, inv_desc->lo);
>          return false;
>      }
>      return true;
> @@ -1376,22 +1371,19 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>  
>      if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
>          (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in IOTLB "
> -                    "Invalidate Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
>          return false;
>      }
>  
>      switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
>      case VTD_INV_DESC_IOTLB_GLOBAL:
> -        VTD_DPRINTF(INV, "global invalidation");
> +        trace_vtd_inv_desc_iotlb_global();
>          vtd_iotlb_global_invalidate(s);
>          break;
>  
>      case VTD_INV_DESC_IOTLB_DOMAIN:
>          domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
> -        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
> -                    domain_id);
> +        trace_vtd_inv_desc_iotlb_domain(domain_id);
>          vtd_iotlb_domain_invalidate(s, domain_id);
>          break;
>  
> @@ -1399,20 +1391,16 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>          domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
>          addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
>          am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
> -        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
> -                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
> +        trace_vtd_inv_desc_iotlb_pages(domain_id, addr, am);
>          if (am > VTD_MAMV) {
> -            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
> -                        "%"PRIu8, (uint8_t)VTD_MAMV);
> +            trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
>              return false;
>          }
>          vtd_iotlb_page_invalidate(s, domain_id, addr, am);
>          break;
>  
>      default:
> -        VTD_DPRINTF(GENERAL, "error: invalid granularity in IOTLB Invalidate "
> -                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc->hi, inv_desc->lo);
> +        trace_vtd_inv_desc_iotlb_invalid(inv_desc->hi, inv_desc->lo);
>          return false;
>      }
>      return true;
> @@ -1511,33 +1499,28 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>  
>      switch (desc_type) {
>      case VTD_INV_DESC_CC:
> -        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
> -                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("context-cache", inv_desc.hi, inv_desc.lo);
>          if (!vtd_process_context_cache_desc(s, &inv_desc)) {
>              return false;
>          }
>          break;
>  
>      case VTD_INV_DESC_IOTLB:
> -        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
> -                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("iotlb", inv_desc.hi, inv_desc.lo);
>          if (!vtd_process_iotlb_desc(s, &inv_desc)) {
>              return false;
>          }
>          break;
>  
>      case VTD_INV_DESC_WAIT:
> -        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
> -                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
>          if (!vtd_process_wait_desc(s, &inv_desc)) {
>              return false;
>          }
>          break;
>  
>      case VTD_INV_DESC_IEC:
> -        VTD_DPRINTF(INV, "Invalidation Interrupt Entry Cache "
> -                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    inv_desc.hi, inv_desc.lo);
> +        trace_vtd_inv_desc("iec", inv_desc.hi, inv_desc.lo);
>          if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
>              return false;
>          }
> @@ -1552,9 +1535,7 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          break;
>  
>      default:
> -        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
> -                    inv_desc.hi, inv_desc.lo, desc_type);
> +        trace_vtd_inv_desc_invalid(inv_desc.hi, inv_desc.lo);
>          return false;
>      }
>      s->iq_head++;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 1cc4a10..02aeaab 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -3,6 +3,24 @@
>  # hw/i386/x86-iommu.c
>  x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>  
> +# hw/i386/intel_iommu.c
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
> +vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
> +vtd_inv_desc_invalid(uint64_t hi, uint64_t lo) "invalid inv desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
> +vtd_inv_desc_cc_global(void) "context invalidate globally"
> +vtd_inv_desc_cc_device(uint8_t bus, uint8_t dev, uint8_t fn) "context invalidate device %02"PRIx8":%02"PRIx8".%02"PRIx8
> +vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate devices sid 0x%"PRIx16" fmask 0x%"PRIx16
> +vtd_inv_desc_cc_invalid(uint64_t hi, uint64_t lo) "invalid context-cache desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
> +vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
> +vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
> +vtd_inv_desc_iotlb_invalid(uint64_t hi, uint64_t lo) "invalid iotlb desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
> +vtd_inv_desc_wait_irq(const char *msg) "%s"
> +vtd_inv_desc_wait_invalid(uint64_t hi, uint64_t lo) "invalid wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
>  amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans Peter Xu
  2017-02-08  2:49   ` Jason Wang
@ 2017-02-10  1:20   ` David Gibson
  1 sibling, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 11127 bytes --]

On Tue, Feb 07, 2017 at 04:28:10PM +0800, Peter Xu wrote:
> Another patch to convert the DPRINTF() stuffs. This patch focuses on the
> address translation path and caching.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/i386/intel_iommu.c | 69 ++++++++++++++++++---------------------------------
>  hw/i386/trace-events  | 10 ++++++++
>  2 files changed, 34 insertions(+), 45 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 08e43b6..ad304f6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -260,11 +260,9 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>      uint64_t *key = g_malloc(sizeof(*key));
>      uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
>  
> -    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
> -                " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
> -                domain_id);
> +    trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
>      if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
> -        VTD_DPRINTF(CACHE, "iotlb exceeds size limit, forced to reset");
> +        trace_vtd_iotlb_reset("iotlb exceeds size limit");
>          vtd_reset_iotlb(s);
>      }
>  
> @@ -505,8 +503,7 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
>  
>      addr = s->root + index * sizeof(*re);
>      if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
> -        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
> -                    " + %"PRIu8, s->root, index);
> +        trace_vtd_re_invalid(re->rsvd, re->val);
>          re->val = 0;
>          return -VTD_FR_ROOT_TABLE_INV;
>      }
> @@ -524,15 +521,10 @@ static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
>  {
>      dma_addr_t addr;
>  
> -    if (!vtd_root_entry_present(root)) {
> -        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
> -        return -VTD_FR_ROOT_ENTRY_P;
> -    }
> +    /* we have checked that root entry is present */
>      addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
>      if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
> -        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
> -                    " + %"PRIu8,
> -                    (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
> +        trace_vtd_re_invalid(root->rsvd, root->val);
>          return -VTD_FR_CONTEXT_TABLE_INV;
>      }
>      ce->lo = le64_to_cpu(ce->lo);
> @@ -704,12 +696,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>      }
>  
>      if (!vtd_root_entry_present(&re)) {
> -        VTD_DPRINTF(GENERAL, "error: root-entry #%"PRIu8 " is not present",
> -                    bus_num);
> +        /* Not error - it's okay we don't have root entry. */
> +        trace_vtd_re_not_present(bus_num);
>          return -VTD_FR_ROOT_ENTRY_P;
>      } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
> -        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
> +        trace_vtd_re_invalid(re.rsvd, re.val);
>          return -VTD_FR_ROOT_ENTRY_RSVD;
>      }
>  
> @@ -719,22 +710,17 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>      }
>  
>      if (!vtd_context_entry_present(ce)) {
> -        VTD_DPRINTF(GENERAL,
> -                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
> -                    "is not present", devfn, bus_num);
> +        /* Not error - it's okay we don't have context entry. */
> +        trace_vtd_ce_not_present(bus_num, devfn);
>          return -VTD_FR_CONTEXT_ENTRY_P;
>      } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
>                 (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
> -        VTD_DPRINTF(GENERAL,
> -                    "error: non-zero reserved field in context-entry "
> -                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
> +        trace_vtd_ce_invalid(ce->hi, ce->lo);
>          return -VTD_FR_CONTEXT_ENTRY_RSVD;
>      }
>      /* Check if the programming of context-entry is valid */
>      if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
> -        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
> -                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                    ce->hi, ce->lo);
> +        trace_vtd_ce_invalid(ce->hi, ce->lo);
>          return -VTD_FR_CONTEXT_ENTRY_INV;
>      } else {
>          switch (ce->lo & VTD_CONTEXT_ENTRY_TT) {
> @@ -743,9 +729,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>          case VTD_CONTEXT_TT_DEV_IOTLB:
>              break;
>          default:
> -            VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
> -                        "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
> -                        ce->hi, ce->lo);
> +            trace_vtd_ce_invalid(ce->hi, ce->lo);
>              return -VTD_FR_CONTEXT_ENTRY_INV;
>          }
>      }
> @@ -825,9 +809,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      /* Try to fetch slpte form IOTLB */
>      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>      if (iotlb_entry) {
> -        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
> -                    " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
> -                    iotlb_entry->slpte, iotlb_entry->domain_id);
> +        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
> +                                 iotlb_entry->domain_id);
>          slpte = iotlb_entry->slpte;
>          reads = iotlb_entry->read_flags;
>          writes = iotlb_entry->write_flags;
> @@ -836,10 +819,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      }
>      /* Try to fetch context-entry from cache first */
>      if (cc_entry->context_cache_gen == s->context_cache_gen) {
> -        VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
> -                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 ")",
> -                    bus_num, devfn, cc_entry->context_entry.hi,
> -                    cc_entry->context_entry.lo, cc_entry->context_cache_gen);
> +        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
> +                               cc_entry->context_entry.lo,
> +                               cc_entry->context_cache_gen);
>          ce = cc_entry->context_entry;
>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>      } else {
> @@ -848,19 +830,16 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          if (ret_fr) {
>              ret_fr = -ret_fr;
>              if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> -                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
> -                            "requests through this context-entry "
> -                            "(with FPD Set)");
> +                trace_vtd_fault_disabled();
>              } else {
>                  vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
>              }
>              return;
>          }
>          /* Update context-cache */
> -        VTD_DPRINTF(CACHE, "update context-cache bus %d devfn %d "
> -                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 "->%"PRIu32 ")",
> -                    bus_num, devfn, ce.hi, ce.lo,
> -                    cc_entry->context_cache_gen, s->context_cache_gen);
> +        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
> +                                  cc_entry->context_cache_gen,
> +                                  s->context_cache_gen);
>          cc_entry->context_entry = ce;
>          cc_entry->context_cache_gen = s->context_cache_gen;
>      }
> @@ -870,8 +849,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      if (ret_fr) {
>          ret_fr = -ret_fr;
>          if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> -            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
> -                        "through this context-entry (with FPD Set)");
> +            trace_vtd_fault_disabled();
>          } else {
>              vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
>          }
> @@ -1031,6 +1009,7 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
>  
>  static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
>  {
> +    trace_vtd_iotlb_reset("global invalidation recved");
>      vtd_reset_iotlb(s);
>  }
>  
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 02aeaab..88ad5e4 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -20,6 +20,16 @@ vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write
>  vtd_inv_desc_wait_irq(const char *msg) "%s"
>  vtd_inv_desc_wait_invalid(uint64_t hi, uint64_t lo) "invalid wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
>  vtd_inv_desc_wait_write_fail(uint64_t hi, uint64_t lo) "write fail for wait desc hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> +vtd_re_invalid(uint64_t hi, uint64_t lo) "invalid root entry hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
> +vtd_ce_invalid(uint64_t hi, uint64_t lo) "invalid context entry hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
> +vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
> +vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
> +vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
> +vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
> +vtd_fault_disabled(void) "Fault processing disabled for context entry"
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 09/17] intel_iommu: vtd_slpt_level_shift check level
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 09/17] intel_iommu: vtd_slpt_level_shift check level Peter Xu
@ 2017-02-10  1:20   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  1:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 1048 bytes --]

On Tue, Feb 07, 2017 at 04:28:11PM +0800, Peter Xu wrote:
> This helps in debugging incorrect level passed in.
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/i386/intel_iommu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ad304f6..22d8226 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -168,6 +168,7 @@ static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
>  /* The shift of an addr for a certain level of paging structure */
>  static inline uint32_t vtd_slpt_level_shift(uint32_t level)
>  {
> +    assert(level != 0);
>      return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 10/17] memory: add section range info for IOMMU notifier
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 10/17] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-02-10  2:29   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  2:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 5966 bytes --]

On Tue, Feb 07, 2017 at 04:28:12PM +0800, Peter Xu wrote:
> In this patch, IOMMUNotifier.{start|end} are introduced to store section
> information for a specific notifier. When notification occurs, we not
> only check the notification type (MAP|UNMAP), but also check whether the
> notified iova range overlaps with the range of specific IOMMU notifier,
> and skip those notifiers if not in the listened range.
> 
> When removing an region, we need to make sure we removed the correct
> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> 
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Acked-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/vfio/common.c      | 12 +++++++++---
>  hw/virtio/vhost.c     |  4 ++--
>  include/exec/memory.h | 19 ++++++++++++++++++-
>  memory.c              |  9 +++++++++
>  4 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f3ba9b9..6b33b9f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          giommu->iommu_offset = section->offset_within_address_space -
>                                 section->offset_within_region;
>          giommu->container = container;
> -        giommu->n.notify = vfio_iommu_map_notify;
> -        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> +        llend = int128_add(int128_make64(section->offset_within_region),
> +                           section->size);
> +        llend = int128_sub(llend, int128_one());
> +        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> +                            IOMMU_NOTIFIER_ALL,
> +                            section->offset_within_region,
> +                            int128_get64(llend));
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> @@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          VFIOGuestIOMMU *giommu;
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> -            if (giommu->iommu == section->mr) {
> +            if (giommu->iommu == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
>                  memory_region_unregister_iommu_notifier(giommu->iommu,
>                                                          &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index febe519..ccf8b2e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1244,8 +1244,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>          .priority = 10
>      };
>  
> -    hdev->n.notify = vhost_iommu_unmap_notify;
> -    hdev->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
> +    iommu_notifier_init(&hdev->n, vhost_iommu_unmap_notify,
> +                        IOMMU_NOTIFIER_UNMAP, 0, ~0ULL);
>  
>      if (hdev->migration_blocker == NULL) {
>          if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 987f925..805a88a 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -77,13 +77,30 @@ typedef enum {
>  
>  #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>  
> +struct IOMMUNotifier;
> +typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
> +                            IOMMUTLBEntry *data);
> +
>  struct IOMMUNotifier {
> -    void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
> +    IOMMUNotify notify;
>      IOMMUNotifierFlag notifier_flags;
> +    /* Notify for address space range start <= addr <= end */
> +    hwaddr start;
> +    hwaddr end;
>      QLIST_ENTRY(IOMMUNotifier) node;
>  };
>  typedef struct IOMMUNotifier IOMMUNotifier;
>  
> +static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
> +                                       IOMMUNotifierFlag flags,
> +                                       hwaddr start, hwaddr end)
> +{
> +    n->notify = fn;
> +    n->notifier_flags = flags;
> +    n->start = start;
> +    n->end = end;
> +}
> +
>  /* New-style MMIO accessors can indicate that the transaction failed.
>   * A zero (MEMTX_OK) response means success; anything else is a failure
>   * of some kind. The memory subsystem will bitwise-OR together results
> diff --git a/memory.c b/memory.c
> index 6c58373..4900bbf 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1610,6 +1610,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
>  
>      /* We need to register for at least one bitfield */
>      assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
> +    assert(n->start <= n->end);
>      QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
>      memory_region_update_iommu_notify_flags(mr);
>  }
> @@ -1671,6 +1672,14 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      }
>  
>      QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +        /*
> +         * Skip the notification if the notification does not overlap
> +         * with registered range.
> +         */
> +        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
> +            iommu_notifier->end < entry.iova) {
> +            continue;
> +        }
>          if (iommu_notifier->notifier_flags & request_flags) {
>              iommu_notifier->notify(iommu_notifier, &entry);
>          }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 11/17] memory: provide IOMMU_NOTIFIER_FOREACH macro
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 11/17] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
@ 2017-02-10  2:30   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  2:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 1826 bytes --]

On Tue, Feb 07, 2017 at 04:28:13PM +0800, Peter Xu wrote:
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/exec/memory.h | 3 +++
>  memory.c              | 4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 805a88a..f76e174 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -239,6 +239,9 @@ struct MemoryRegion {
>      IOMMUNotifierFlag iommu_notify_flags;
>  };
>  
> +#define IOMMU_NOTIFIER_FOREACH(n, mr) \
> +    QLIST_FOREACH((n), &(mr)->iommu_notify, node)
> +
>  /**
>   * MemoryListener: callbacks structure for updates to the physical memory map
>   *
> diff --git a/memory.c b/memory.c
> index 4900bbf..523c43f 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1587,7 +1587,7 @@ static void memory_region_update_iommu_notify_flags(MemoryRegion *mr)
>      IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
>      IOMMUNotifier *iommu_notifier;
>  
> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>          flags |= iommu_notifier->notifier_flags;
>      }
>  
> @@ -1671,7 +1671,7 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>          request_flags = IOMMU_NOTIFIER_UNMAP;
>      }
>  
> -    QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> +    IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
>          /*
>           * Skip the notification if the notification does not overlap
>           * with registered range.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 12/17] memory: provide iommu_replay_all()
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 12/17] memory: provide iommu_replay_all() Peter Xu
@ 2017-02-10  2:31   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  2:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

On Tue, Feb 07, 2017 at 04:28:14PM +0800, Peter Xu wrote:
> This is an "global" version of exising memory_region_iommu_replay() - we
> announce the translations to all the registered notifiers, instead of a
> specific one.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/exec/memory.h | 8 ++++++++
>  memory.c              | 9 +++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f76e174..606ce88 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -707,6 +707,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>                                  bool is_write);
>  
>  /**
> + * memory_region_iommu_replay_all: replay existing IOMMU translations
> + * to all the notifiers registered.
> + *
> + * @mr: the memory region to observe
> + */
> +void memory_region_iommu_replay_all(MemoryRegion *mr);
> +
> +/**
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
>   * changes to IOMMU translation entries.
>   *
> diff --git a/memory.c b/memory.c
> index 523c43f..9e1bb75 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1646,6 +1646,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      }
>  }
>  
> +void memory_region_iommu_replay_all(MemoryRegion *mr)
> +{
> +    IOMMUNotifier *notifier;
> +
> +    IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> +        memory_region_iommu_replay(mr, notifier, false);
> +    }
> +}
> +
>  void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>                                               IOMMUNotifier *n)
>  {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 13/17] memory: introduce memory_region_notify_one()
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 13/17] memory: introduce memory_region_notify_one() Peter Xu
@ 2017-02-10  2:33   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  2:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 3819 bytes --]

On Tue, Feb 07, 2017 at 04:28:15PM +0800, Peter Xu wrote:
> Generalizing the notify logic in memory_region_notify_iommu() into a
> single function. This can be further used in customized replay()
> functions for IOMMUs.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/exec/memory.h | 15 +++++++++++++++
>  memory.c              | 40 ++++++++++++++++++++++++----------------
>  2 files changed, 39 insertions(+), 16 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 606ce88..0767888 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -682,6 +682,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>                                  IOMMUTLBEntry entry);
>  
>  /**
> + * memory_region_notify_one: notify a change in an IOMMU translation
> + *                           entry to a single notifier
> + *
> + * This works just like memory_region_notify_iommu(), but it only
> + * notifies a specific notifier, not all of them.
> + *
> + * @notifier: the notifier to be notified
> + * @entry: the new entry in the IOMMU translation table.  The entry
> + *         replaces all old entries for the same virtual I/O address range.
> + *         Deleted entries have .@perm == 0.
> + */
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry);
> +
> +/**
>   * memory_region_register_iommu_notifier: register a notifier for changes to
>   * IOMMU translation entries.
>   *
> diff --git a/memory.c b/memory.c
> index 9e1bb75..7a4f2f9 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1666,32 +1666,40 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>      memory_region_update_iommu_notify_flags(mr);
>  }
>  
> -void memory_region_notify_iommu(MemoryRegion *mr,
> -                                IOMMUTLBEntry entry)
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry)
>  {
> -    IOMMUNotifier *iommu_notifier;
>      IOMMUNotifierFlag request_flags;
>  
> -    assert(memory_region_is_iommu(mr));
> +    /*
> +     * Skip the notification if the notification does not overlap
> +     * with registered range.
> +     */
> +    if (notifier->start > entry->iova + entry->addr_mask + 1 ||
> +        notifier->end < entry->iova) {
> +        return;
> +    }
>  
> -    if (entry.perm & IOMMU_RW) {
> +    if (entry->perm & IOMMU_RW) {
>          request_flags = IOMMU_NOTIFIER_MAP;
>      } else {
>          request_flags = IOMMU_NOTIFIER_UNMAP;
>      }
>  
> +    if (notifier->notifier_flags & request_flags) {
> +        notifier->notify(notifier, entry);
> +    }
> +}
> +
> +void memory_region_notify_iommu(MemoryRegion *mr,
> +                                IOMMUTLBEntry entry)
> +{
> +    IOMMUNotifier *iommu_notifier;
> +
> +    assert(memory_region_is_iommu(mr));
> +
>      IOMMU_NOTIFIER_FOREACH(iommu_notifier, mr) {
> -        /*
> -         * Skip the notification if the notification does not overlap
> -         * with registered range.
> -         */
> -        if (iommu_notifier->start > entry.iova + entry.addr_mask + 1 ||
> -            iommu_notifier->end < entry.iova) {
> -            continue;
> -        }
> -        if (iommu_notifier->notifier_flags & request_flags) {
> -            iommu_notifier->notify(iommu_notifier, &entry);
> -        }
> +        memory_region_notify_one(iommu_notifier, &entry);
>      }
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
@ 2017-02-10  2:34   ` David Gibson
  2017-03-27  8:35   ` Liu, Yi L
  1 sibling, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  2:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]

On Tue, Feb 07, 2017 at 04:28:16PM +0800, Peter Xu wrote:
> Originally we have one memory_region_iommu_replay() function, which is
> the default behavior to replay the translations of the whole IOMMU
> region. However, on some platform like x86, we may want our own replay
> logic for IOMMU regions. This patch add one more hook for IOMMUOps for
> the callback, and it'll override the default if set.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/exec/memory.h | 2 ++
>  memory.c              | 6 ++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0767888..30b2a74 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>      void (*notify_flag_changed)(MemoryRegion *iommu,
>                                  IOMMUNotifierFlag old_flags,
>                                  IOMMUNotifierFlag new_flags);
> +    /* Set this up to provide customized IOMMU replay function */
> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> diff --git a/memory.c b/memory.c
> index 7a4f2f9..9c253cc 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    /* If the IOMMU has its own replay callback, override */
> +    if (mr->iommu_ops->replay) {
> +        mr->iommu_ops->replay(mr, n);
> +        return;
> +    }
> +
>      granularity = memory_region_iommu_get_min_page_size(mr);
>  
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 15/17] intel_iommu: provide its own replay() callback
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 15/17] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-02-10  2:36   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  2:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 12597 bytes --]

On Tue, Feb 07, 2017 at 04:28:17PM +0800, Peter Xu wrote:
> The default replay() don't work for VT-d since vt-d will have a huge
> default memory region which covers address range 0-(2^64-1). This will
> normally consumes a lot of time (which looks like a dead loop).
> 
> The solution is simple - we don't walk over all the regions. Instead, we
> jump over the regions when we found that the page directories are empty.
> It'll greatly reduce the time to walk the whole region.
> 
> To achieve this, we provided a page walk helper to do that, invoking
> corresponding hook function when we found an page we are interested in.
> vtd_page_walk_level() is the core logic for the page walking. It's
> interface is designed to suite further use case, e.g., to invalidate a
> range of addresses.
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

For small values of reviewed:

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

The concept is sensible and there's nothing obviously wrong.  But, I'm
not familiar enough with the VT-d page table format to check the code
in detail.

> ---
>  hw/i386/intel_iommu.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/i386/trace-events  |   7 ++
>  include/exec/memory.h |   2 +
>  3 files changed, 186 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 22d8226..f8d5713 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -595,6 +595,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
>      return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
>  }
>  
> +static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
> +{
> +    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
> +    return 1ULL << MIN(ce_agaw, VTD_MGAW);
> +}
> +
> +/* Return true if IOVA passes range check, otherwise false. */
> +static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
> +{
> +    /*
> +     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
> +     */
> +    return !(iova & ~(vtd_iova_limit(ce) - 1));
> +}
> +
>  static const uint64_t vtd_paging_entry_rsvd_field[] = {
>      [0] = ~0ULL,
>      /* For not large page */
> @@ -630,13 +646,9 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>      uint32_t level = vtd_get_level_from_context_entry(ce);
>      uint32_t offset;
>      uint64_t slpte;
> -    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>      uint64_t access_right_check;
>  
> -    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> -     * in CAP_REG and AW in context-entry.
> -     */
> -    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +    if (!vtd_iova_range_check(iova, ce)) {
>          VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
> @@ -684,6 +696,134 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>      }
>  }
>  
> +typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
> +
> +/**
> + * vtd_page_walk_level - walk over specific level for IOVA range
> + *
> + * @addr: base GPA addr to start the walk
> + * @start: IOVA range start address
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: hook func to be called when detected page
> + * @private: private data to be passed into hook func
> + * @read: whether parent level has read permission
> + * @write: whether parent level has write permission
> + * @notify_unmap: whether we should notify invalid entries
> + */
> +static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> +                               uint64_t end, vtd_page_walk_hook hook_fn,
> +                               void *private, uint32_t level,
> +                               bool read, bool write, bool notify_unmap)
> +{
> +    bool read_cur, write_cur, entry_valid;
> +    uint32_t offset;
> +    uint64_t slpte;
> +    uint64_t subpage_size, subpage_mask;
> +    IOMMUTLBEntry entry;
> +    uint64_t iova = start;
> +    uint64_t iova_next;
> +    int ret = 0;
> +
> +    trace_vtd_page_walk_level(addr, level, start, end);
> +
> +    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> +    subpage_mask = vtd_slpt_level_page_mask(level);
> +
> +    while (iova < end) {
> +        iova_next = (iova & subpage_mask) + subpage_size;
> +
> +        offset = vtd_iova_level_offset(iova, level);
> +        slpte = vtd_get_slpte(addr, offset);
> +
> +        if (slpte == (uint64_t)-1) {
> +            trace_vtd_page_walk_skip_read(iova, iova_next);
> +            goto next;
> +        }
> +
> +        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> +            goto next;
> +        }
> +
> +        /* Permissions are stacked with parents' */
> +        read_cur = read && (slpte & VTD_SL_R);
> +        write_cur = write && (slpte & VTD_SL_W);
> +
> +        /*
> +         * As long as we have either read/write permission, this is a
> +         * valid entry. The rule works for both page entries and page
> +         * table entries.
> +         */
> +        entry_valid = read_cur | write_cur;
> +
> +        if (vtd_is_last_slpte(slpte, level)) {
> +            entry.target_as = &address_space_memory;
> +            entry.iova = iova & subpage_mask;
> +            /* NOTE: this is only meaningful if entry_valid == true */
> +            entry.translated_addr = vtd_get_slpte_addr(slpte);
> +            entry.addr_mask = ~subpage_mask;
> +            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +            if (!entry_valid && !notify_unmap) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                goto next;
> +            }
> +            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> +                                    entry.addr_mask, entry.perm);
> +            if (hook_fn) {
> +                ret = hook_fn(&entry, private);
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +            }
> +        } else {
> +            if (!entry_valid) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                goto next;
> +            }
> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
> +                                      MIN(iova_next, end), hook_fn, private,
> +                                      level - 1, read_cur, write_cur,
> +                                      notify_unmap);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +
> +next:
> +        iova = iova_next;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: the hook that to be called for each detected area
> + * @private: private data for the hook function
> + */
> +static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> +                         vtd_page_walk_hook hook_fn, void *private)
> +{
> +    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> +    uint32_t level = vtd_get_level_from_context_entry(ce);
> +
> +    if (!vtd_iova_range_check(start, ce)) {
> +        return -VTD_FR_ADDR_BEYOND_MGAW;
> +    }
> +
> +    if (!vtd_iova_range_check(end, ce)) {
> +        /* Fix end so that it reaches the maximum */
> +        end = vtd_iova_limit(ce);
> +    }
> +
> +    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> +                               level, true, true, false);
> +}
> +
>  /* Map a device to its corresponding domain (context-entry) */
>  static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>                                      uint8_t devfn, VTDContextEntry *ce)
> @@ -2402,6 +2542,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>      return vtd_dev_as;
>  }
>  
> +static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> +{
> +    memory_region_notify_one((IOMMUNotifier *)private, entry);
> +    return 0;
> +}
> +
> +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> +{
> +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> +    VTDContextEntry ce;
> +
> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> +        /*
> +         * Scanned a valid context entry, walk over the pages and
> +         * notify when needed.
> +         */
> +        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> +                                  PCI_FUNC(vtd_as->devfn),
> +                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
> +                                  ce.hi, ce.lo);
> +        vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n);
> +    } else {
> +        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> +                                    PCI_FUNC(vtd_as->devfn));
> +    }
> +
> +    return;
> +}
> +
>  /* Do the initialization. It will also be called when reset, so pay
>   * attention when adding new initialization stuff.
>   */
> @@ -2416,6 +2587,7 @@ static void vtd_init(IntelIOMMUState *s)
>  
>      s->iommu_ops.translate = vtd_iommu_translate;
>      s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
> +    s->iommu_ops.replay = vtd_iommu_replay;
>      s->root = 0;
>      s->root_extended = false;
>      s->dmar_enabled = false;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 88ad5e4..463db0d 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -30,6 +30,13 @@ vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32
>  vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
>  vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
>  vtd_fault_disabled(void) "Fault processing disabled for context entry"
> +vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
> +vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
> +vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
> +vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
> +vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
> +vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 30b2a74..267f399 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -55,6 +55,8 @@ typedef enum {
>      IOMMU_RW   = 3,
>  } IOMMUAccessFlags;
>  
> +#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
> +
>  struct IOMMUTLBEntry {
>      AddressSpace    *target_as;
>      hwaddr           iova;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 16/17] intel_iommu: allow dynamic switch of IOMMU region
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 16/17] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2017-02-10  2:38   ` David Gibson
  0 siblings, 0 replies; 63+ messages in thread
From: David Gibson @ 2017-02-10  2:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 9242 bytes --]

On Tue, Feb 07, 2017 at 04:28:18PM +0800, Peter Xu wrote:
> This is preparation work to finally enabled dynamic switching ON/OFF for
> VT-d protection. The old VT-d codes is using static IOMMU address space,
> and that won't satisfy vfio-pci device listeners.
> 
> Let me explain.
> 
> vfio-pci devices depend on the memory region listener and IOMMU replay
> mechanism to make sure the device mapping is coherent with the guest
> even if there are domain switches. And there are two kinds of domain
> switches:
> 
>   (1) switch from domain A -> B
>   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> 
> Case (1) is handled by the context entry invalidation handling by the
> VT-d replay logic. What the replay function should do here is to replay
> the existing page mappings in domain B.
> 
> However for case (2), we don't want to replay any domain mappings - we
> just need the default GPA->HPA mappings (the address_space_memory
> mapping). And this patch helps on case (2) to build up the mapping
> automatically by leveraging the vfio-pci memory listeners.
> 
> Another important thing that this patch does is to seperate
> IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> depend on the DMAR region (like before this patch). It should be a
> standalone region, and it should be able to be activated without
> DMAR (which is a common behavior of Linux kernel - by default it enables
> IR while disabled DMAR).
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

As with the previous patch the description sounds sensible, but I
don't know VT-d well enough to review the details.  With that caveat

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/i386/intel_iommu.c         | 78 ++++++++++++++++++++++++++++++++++++++++---
>  hw/i386/trace-events          |  2 +-
>  include/hw/i386/intel_iommu.h |  2 ++
>  3 files changed, 77 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f8d5713..4fe161f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1291,9 +1291,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>  }
>  
> +static void vtd_switch_address_space(VTDAddressSpace *as)
> +{
> +    assert(as);
> +
> +    trace_vtd_switch_address_space(pci_bus_num(as->bus),
> +                                   VTD_PCI_SLOT(as->devfn),
> +                                   VTD_PCI_FUNC(as->devfn),
> +                                   as->iommu_state->dmar_enabled);
> +
> +    /* Turn off first then on the other */
> +    if (as->iommu_state->dmar_enabled) {
> +        memory_region_set_enabled(&as->sys_alias, false);
> +        memory_region_set_enabled(&as->iommu, true);
> +    } else {
> +        memory_region_set_enabled(&as->iommu, false);
> +        memory_region_set_enabled(&as->sys_alias, true);
> +    }
> +}
> +
> +static void vtd_switch_address_space_all(IntelIOMMUState *s)
> +{
> +    GHashTableIter iter;
> +    VTDBus *vtd_bus;
> +    int i;
> +
> +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> +            if (!vtd_bus->dev_as[i]) {
> +                continue;
> +            }
> +            vtd_switch_address_space(vtd_bus->dev_as[i]);
> +        }
> +    }
> +}
> +
>  /* Handle Translation Enable/Disable */
>  static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>  {
> +    if (s->dmar_enabled == en) {
> +        return;
> +    }
> +
>      VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
>  
>      if (en) {
> @@ -1308,6 +1348,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>          /* Ok - report back to driver */
>          vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
>      }
> +
> +    vtd_switch_address_space_all(s);
>  }
>  
>  /* Handle Interrupt Remap Enable/Disable */
> @@ -2529,15 +2571,43 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>          vtd_dev_as->devfn = (uint8_t)devfn;
>          vtd_dev_as->iommu_state = s;
>          vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> +
> +        /*
> +         * Memory region relationships looks like (Address range shows
> +         * only lower 32 bits to make it short in length...):
> +         *
> +         * |-----------------+-------------------+----------|
> +         * | Name            | Address range     | Priority |
> +         * |-----------------+-------------------+----------+
> +         * | vtd_root        | 00000000-ffffffff |        0 |
> +         * |  intel_iommu    | 00000000-ffffffff |        1 |
> +         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
> +         * |  intel_iommu_ir | fee00000-feefffff |       64 |
> +         * |-----------------+-------------------+----------|
> +         *
> +         * We enable/disable DMAR by switching enablement for
> +         * vtd_sys_alias and intel_iommu regions. IR region is always
> +         * enabled.
> +         */
>          memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
>                                   &s->iommu_ops, "intel_iommu", UINT64_MAX);
> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> +                                 "vtd_sys_alias", get_system_memory(),
> +                                 0, memory_region_size(get_system_memory()));
>          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
>                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
>                                VTD_INTERRUPT_ADDR_SIZE);
> -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> -                                    &vtd_dev_as->iommu_ir);
> -        address_space_init(&vtd_dev_as->as,
> -                           &vtd_dev_as->iommu, name);
> +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> +                           "vtd_root", UINT64_MAX);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> +                                            VTD_INTERRUPT_ADDR_FIRST,
> +                                            &vtd_dev_as->iommu_ir, 64);
> +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->sys_alias, 1);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->iommu, 1);
> +        vtd_switch_address_space(vtd_dev_as);
>      }
>      return vtd_dev_as;
>  }
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 463db0d..ebb650b 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -4,7 +4,6 @@
>  x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>  
>  # hw/i386/intel_iommu.c
> -vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>  vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
>  vtd_inv_desc_invalid(uint64_t hi, uint64_t lo) "invalid inv desc hi 0x%"PRIx64" lo 0x%"PRIx64
>  vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
> @@ -37,6 +36,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
>  vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
>  vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
>  vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index fe645aa..8f212a1 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -83,6 +83,8 @@ struct VTDAddressSpace {
>      uint8_t devfn;
>      AddressSpace as;
>      MemoryRegion iommu;
> +    MemoryRegion root;
> +    MemoryRegion sys_alias;
>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 02/17] vfio: introduce vfio_get_vaddr()
  2017-02-10  1:12   ` David Gibson
@ 2017-02-10  5:50     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-10  5:50 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Feb 10, 2017 at 12:12:22PM +1100, David Gibson wrote:
> On Tue, Feb 07, 2017 at 04:28:04PM +0800, Peter Xu wrote:
> > A cleanup for vfio_iommu_map_notify(). Now we will fetch vaddr even if
> > the operation is unmap, but it won't hurt much.
> > 
> > One thing to mention is that we need the RCU read lock to protect the
> > whole translation and map/unmap procedure.
> > 
> > Acked-by: Alex Williamson <alex.williamson@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> So, I know I reviewed this already, but looking again I'm confused.
> 
> I'm not sure how the original code ever worked: if this is an unmap
> (perm == IOMMU_NONE), then I wouldn't even expect
> iotlb->translated_addr to have a valid value, but we're passing it to
> address_space_translate() and failing if it it doesn't give us
> sensible results.

Hmm, right.

Looks like it is just because we have accidentally inited
iotlb->translated_addr in all the callers of
memory_region_notify_iommu (one is put_tce_emu(), the other one is
rpcit_service_call()). If so, patch 3 (maybe, along with this one)
would be more essential imho to make sure we don't have such an
assumption.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices Peter Xu
@ 2017-02-10  6:24   ` Jason Wang
  2017-03-16  4:05   ` Peter Xu
  1 sibling, 0 replies; 63+ messages in thread
From: Jason Wang @ 2017-02-10  6:24 UTC (permalink / raw)
  To: Peter Xu, qemu-devel, mst
  Cc: tianyu.lan, kevin.tian, jan.kiszka, David Gibson,
	alex.williamson, bd.aviv



On 2017年02月07日 16:28, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
>
>    "IOMMU: enable intel_iommu map and unmap notifiers"
>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>
> However I removed/fixed some content, and added my own codes.
>
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
>
> This patch enables vfio devices for VT-d emulation.
>
> And, since we already have vhost DMAR support via device-iotlb, a
> natural benefit that this patch brings is that vt-d enabled vhost can
> live even without ATS capability now. Though more tests are needed.

Michael, if you want to apply this series, I would propose a better 
title for this patch e.g "cowork with remote IOMMU/IOTLB"

Thanks

>
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c          | 191 ++++++++++++++++++++++++++++++++++++++---
>   hw/i386/intel_iommu_internal.h |   1 +
>   hw/i386/trace-events           |   1 +
>   include/hw/i386/intel_iommu.h  |   8 ++
>   4 files changed, 188 insertions(+), 13 deletions(-)

[...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (16 preceding siblings ...)
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices Peter Xu
@ 2017-02-17 17:18 ` Alex Williamson
  2017-02-20  7:47   ` Peter Xu
  2017-02-28  7:52 ` Peter Xu
  18 siblings, 1 reply; 63+ messages in thread
From: Alex Williamson @ 2017-02-17 17:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	David Gibson, bd.aviv

On Tue,  7 Feb 2017 16:28:02 +0800
Peter Xu <peterx@redhat.com> wrote:

> This is v7 of vt-d vfio enablement series.
[snip]
> =========
> Test Done
> =========
> 
> Build test passed for x86_64/arm/ppc64.
> 
> Simply tested with x86_64, assigning two PCI devices to a single VM,
> boot the VM using:
> 
> bin=x86_64-softmmu/qemu-system-x86_64
> $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
>      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
>      -device virtio-net-pci,netdev=net0 \
>      -device vfio-pci,host=03:00.0 \
>      -device vfio-pci,host=02:00.0 \
>      -trace events=".trace.vfio" \
>      /var/lib/libvirt/images/vm1.qcow2
> 
> pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> vtd_page_walk*
> vtd_replay*
> vtd_inv_desc*
> 
> Then, in the guest, run the following tool:
> 
>   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> 
> With parameter:
> 
>   ./vfio-bind-group 00:03.0 00:04.0
> 
> Check host side trace log, I can see pages are replayed and mapped in
> 00:04.0 device address space, like:
> 
> ...
> vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
> vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
> vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
> ...

Hi Peter,

I'm trying to make use of this, with your vtd-vfio-enablement-v7 branch
(HEAD 0c1c4e738095).  I'm assigning an 82576 PF to a VM.  It works with
iommu=pt, but if I remove that option, the device does not work and
vfio_iommu_map_notify is never called.  Any suggestions?  My
commandline is below.  Thanks,

Alex

/usr/local/bin/qemu-system-x86_64 \
        -name guest=l1,debug-threads=on -S \
        -machine pc-q35-2.9,accel=kvm,usb=off,dump-guest-core=off,kernel-irqchip=split \
        -cpu host -m 10240 -realtime mlock=off -smp 4,sockets=1,cores=2,threads=2 \
        -no-user-config -nodefaults -monitor stdio -rtc base=utc,driftfix=slew \
        -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown \
        -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 \
        -boot strict=on \
        -device ioh3420,port=0x10,chassis=1,id=pci.1,bus=pcie.0,addr=0x2 \
        -device i82801b11-bridge,id=pci.2,bus=pcie.0,addr=0x1e \
        -device pci-bridge,chassis_nr=3,id=pci.3,bus=pci.2,addr=0x0 \
        -device ioh3420,port=0x18,chassis=4,id=pci.4,bus=pcie.0,addr=0x3 \
        -device ioh3420,port=0x20,chassis=5,id=pci.5,bus=pcie.0,addr=0x4 \
        -device ioh3420,port=0x28,chassis=6,id=pci.6,bus=pcie.0,addr=0x5 \
        -device ioh3420,port=0x30,chassis=7,id=pci.7,bus=pcie.0,addr=0x6 \
        -device ioh3420,port=0x38,chassis=8,id=pci.8,bus=pcie.0,addr=0x7 \
        -device ich9-usb-ehci1,id=usb,bus=pcie.0,addr=0x1d.0x7 \
        -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pcie.0,multifunction=on,addr=0x1d \
        -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pcie.0,addr=0x1d.0x1 \
        -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pcie.0,addr=0x1d.0x2 \
        -device virtio-serial-pci,id=virtio-serial0,bus=pci.4,addr=0x0 \
        -drive file=/dev/vg_s20/lv_l1,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native \
        -device virtio-blk-pci,scsi=off,bus=pci.5,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
        -netdev user,id=hostnet0 \
        -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c2:62:30,bus=pci.1,addr=0x0 \
        -device usb-tablet,id=input0,bus=usb.0,port=1 \
        -vnc :0 -vga std \
        -device vfio-pci,host=01:00.0,id=hostdev0,bus=pci.8,addr=0x0 \
        -device intel-iommu,intremap=on,eim=off,caching-mode=on -trace events=/trace-events.txt -msg timestamp=on

# cat /trace-events.txt 
vfio_listener*
vfio_iommu*
vtd*

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances
  2017-02-17 17:18 ` [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Alex Williamson
@ 2017-02-20  7:47   ` Peter Xu
  2017-02-20  8:17     ` Liu, Yi L
  2017-02-20 19:15     ` Alex Williamson
  0 siblings, 2 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-20  7:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	David Gibson, bd.aviv

On Fri, Feb 17, 2017 at 10:18:35AM -0700, Alex Williamson wrote:
> On Tue,  7 Feb 2017 16:28:02 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > This is v7 of vt-d vfio enablement series.
> [snip]
> > =========
> > Test Done
> > =========
> > 
> > Build test passed for x86_64/arm/ppc64.
> > 
> > Simply tested with x86_64, assigning two PCI devices to a single VM,
> > boot the VM using:
> > 
> > bin=x86_64-softmmu/qemu-system-x86_64
> > $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
> >      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> >      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
> >      -device virtio-net-pci,netdev=net0 \
> >      -device vfio-pci,host=03:00.0 \
> >      -device vfio-pci,host=02:00.0 \
> >      -trace events=".trace.vfio" \
> >      /var/lib/libvirt/images/vm1.qcow2
> > 
> > pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> > vtd_page_walk*
> > vtd_replay*
> > vtd_inv_desc*
> > 
> > Then, in the guest, run the following tool:
> > 
> >   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> > 
> > With parameter:
> > 
> >   ./vfio-bind-group 00:03.0 00:04.0
> > 
> > Check host side trace log, I can see pages are replayed and mapped in
> > 00:04.0 device address space, like:
> > 
> > ...
> > vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
> > vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
> > vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> > vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
> > vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
> > ...
> 
> Hi Peter,
> 
> I'm trying to make use of this, with your vtd-vfio-enablement-v7 branch
> (HEAD 0c1c4e738095).  I'm assigning an 82576 PF to a VM.  It works with
> iommu=pt, but if I remove that option, the device does not work and
> vfio_iommu_map_notify is never called.  Any suggestions?  My
> commandline is below.  Thanks,
> 
> Alex
> 
> /usr/local/bin/qemu-system-x86_64 \
>         -name guest=l1,debug-threads=on -S \
>         -machine pc-q35-2.9,accel=kvm,usb=off,dump-guest-core=off,kernel-irqchip=split \
>         -cpu host -m 10240 -realtime mlock=off -smp 4,sockets=1,cores=2,threads=2 \
>         -no-user-config -nodefaults -monitor stdio -rtc base=utc,driftfix=slew \
>         -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown \
>         -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 \
>         -boot strict=on \
>         -device ioh3420,port=0x10,chassis=1,id=pci.1,bus=pcie.0,addr=0x2 \
>         -device i82801b11-bridge,id=pci.2,bus=pcie.0,addr=0x1e \
>         -device pci-bridge,chassis_nr=3,id=pci.3,bus=pci.2,addr=0x0 \
>         -device ioh3420,port=0x18,chassis=4,id=pci.4,bus=pcie.0,addr=0x3 \
>         -device ioh3420,port=0x20,chassis=5,id=pci.5,bus=pcie.0,addr=0x4 \
>         -device ioh3420,port=0x28,chassis=6,id=pci.6,bus=pcie.0,addr=0x5 \
>         -device ioh3420,port=0x30,chassis=7,id=pci.7,bus=pcie.0,addr=0x6 \
>         -device ioh3420,port=0x38,chassis=8,id=pci.8,bus=pcie.0,addr=0x7 \
>         -device ich9-usb-ehci1,id=usb,bus=pcie.0,addr=0x1d.0x7 \
>         -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pcie.0,multifunction=on,addr=0x1d \
>         -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pcie.0,addr=0x1d.0x1 \
>         -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pcie.0,addr=0x1d.0x2 \
>         -device virtio-serial-pci,id=virtio-serial0,bus=pci.4,addr=0x0 \
>         -drive file=/dev/vg_s20/lv_l1,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native \
>         -device virtio-blk-pci,scsi=off,bus=pci.5,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
>         -netdev user,id=hostnet0 \
>         -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c2:62:30,bus=pci.1,addr=0x0 \
>         -device usb-tablet,id=input0,bus=usb.0,port=1 \
>         -vnc :0 -vga std \
>         -device vfio-pci,host=01:00.0,id=hostdev0,bus=pci.8,addr=0x0 \
>         -device intel-iommu,intremap=on,eim=off,caching-mode=on -trace events=/trace-events.txt -msg timestamp=on

Alex,

Thanks for testing this series.

I think I reproduced it using my 10g nic as well. What I got is:

[   23.724787] ixgbe 0000:01:00.0 enp1s0: Detected Tx Unit Hang
[   23.724787]   Tx Queue             <0>
[   23.724787]   TDH, TDT             <0>, <1>
[   23.724787]   next_to_use          <1>
[   23.724787]   next_to_clean        <0>
[   23.724787] tx_buffer_info[next_to_clean]
[   23.724787]   time_stamp           <fffbb8bb>
[   23.724787]   jiffies              <fffbc780>
[   23.729580] ixgbe 0000:01:00.0 enp1s0: tx hang 1 detected on queue 0, resetting adapter
[   23.730752] ixgbe 0000:01:00.0 enp1s0: initiating reset due to tx timeout
[   23.731768] ixgbe 0000:01:00.0 enp1s0: Reset adapter

Is this the problem you have encountered? (adapter continuously reset)

Interestingly, I found that the problem solves itself after I move the
"-device intel-iommu,..." line before all the other devices.

Or say, this will be the much shorter reproducer meet the bug:

$qemu   -machine q35,accel=kvm,kernel-irqchip=split \
        -cpu host -smp 4 -m 2048 \
        -nographic -nodefaults -serial stdio \
        -device vfio-pci,host=05:00.0,bus=pci.1 \
        -device intel-iommu,intremap=on,eim=off,caching-mode=on \
        /images/fedora-25.qcow2

While this may possibly be okay at least on my host (switching the
order of the two devices):

$qemu   -machine q35,accel=kvm,kernel-irqchip=split \
        -cpu host -smp 4 -m 2048 \
        -nographic -nodefaults -serial stdio \
        -device intel-iommu,intremap=on,eim=off,caching-mode=on \
        -device vfio-pci,host=05:00.0,bus=pci.1 \
        /images/fedora-25.qcow2

So not sure how the ordering of realization of these two devices
(intel-iommu, vfio-pci) affected the behavior. One thing I suspect is
that in vfio_realize(), we have:

  group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);

while here we possibly will be getting &address_space_memory here
instead of the correct DMA address space since Intel IOMMU device has
not yet been inited...

Before I go deeper, any thoughts?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances
  2017-02-20  7:47   ` Peter Xu
@ 2017-02-20  8:17     ` Liu, Yi L
  2017-02-20  8:32       ` Peter Xu
  2017-02-20 19:15     ` Alex Williamson
  1 sibling, 1 reply; 63+ messages in thread
From: Liu, Yi L @ 2017-02-20  8:17 UTC (permalink / raw)
  To: Peter Xu, Alex Williamson
  Cc: Lan, Tianyu, Tian, Kevin, mst, jan.kiszka, jasowang, qemu-devel,
	bd.aviv, David Gibson, Liu, Yi L

> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org]
> On Behalf Of Peter Xu
> Sent: Monday, February 20, 2017 3:48 PM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com; qemu-
> devel@nongnu.org; bd.aviv@gmail.com; David Gibson
> <david@gibson.dropbear.id.au>
> Subject: Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc
> enhances
> 
> On Fri, Feb 17, 2017 at 10:18:35AM -0700, Alex Williamson wrote:
> > On Tue,  7 Feb 2017 16:28:02 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >
> > > This is v7 of vt-d vfio enablement series.
> > [snip]
> > > =========
> > > Test Done
> > > =========
> > >
> > > Build test passed for x86_64/arm/ppc64.
> > >
> > > Simply tested with x86_64, assigning two PCI devices to a single VM,
> > > boot the VM using:
> > >
> > > bin=x86_64-softmmu/qemu-system-x86_64
> > > $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
> > >      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > >      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
> > >      -device virtio-net-pci,netdev=net0 \
> > >      -device vfio-pci,host=03:00.0 \
> > >      -device vfio-pci,host=02:00.0 \
> > >      -trace events=".trace.vfio" \
> > >      /var/lib/libvirt/images/vm1.qcow2
> > >
> > > pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> > > vtd_page_walk*
> > > vtd_replay*
> > > vtd_inv_desc*
> > >
> > > Then, in the guest, run the following tool:
> > >
> > >
> > > https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind
> > > -group/vfio-bind-group.c
> > >
> > > With parameter:
> > >
> > >   ./vfio-bind-group 00:03.0 00:04.0
> > >
> > > Check host side trace log, I can see pages are replayed and mapped
> > > in
> > > 00:04.0 device address space, like:
> > >
> > > ...
> > > vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo
> > > 0x38fe1001 vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova
> > > range 0x0 - 0x8000000000 vtd_page_walk_level Page walk
> > > (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> > > vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range
> > > 0x0 - 0x40000000 vtd_page_walk_level Page walk (base=0x34979000,
> > > level=1) iova range 0x0 - 0x200000 vtd_page_walk_one Page walk
> > > detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 ->
> > > gpa 0x22e25000 mask 0xfff perm 3 vtd_page_walk_one Page walk
> > > detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm
> > > 3 vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 ->
> > > gpa 0x22e2d000 mask 0xfff perm 3 vtd_page_walk_one Page walk
> > > detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm
> > > 3 vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 ->
> > > gpa 0x129bb000 mask 0xfff perm 3 vtd_page_walk_one Page walk
> > > detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa
> 0x12a80000 mask 0xfff perm 3 vtd_page_walk_one Page walk detected map
> level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa
> 0x12b22000 mask 0xfff perm 3 vtd_page_walk_one Page walk detected map
> level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3 ...
> >
> > Hi Peter,
> >
> > I'm trying to make use of this, with your vtd-vfio-enablement-v7
> > branch (HEAD 0c1c4e738095).  I'm assigning an 82576 PF to a VM.  It
> > works with iommu=pt, but if I remove that option, the device does not
> > work and vfio_iommu_map_notify is never called.  Any suggestions?  My
> > commandline is below.  Thanks,
> >
> > Alex
> >
> > /usr/local/bin/qemu-system-x86_64 \
> >         -name guest=l1,debug-threads=on -S \
> >         -machine pc-q35-2.9,accel=kvm,usb=off,dump-guest-core=off,kernel-
> irqchip=split \
> >         -cpu host -m 10240 -realtime mlock=off -smp
> 4,sockets=1,cores=2,threads=2 \
> >         -no-user-config -nodefaults -monitor stdio -rtc base=utc,driftfix=slew \
> >         -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown \
> >         -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 \
> >         -boot strict=on \
> >         -device ioh3420,port=0x10,chassis=1,id=pci.1,bus=pcie.0,addr=0x2 \
> >         -device i82801b11-bridge,id=pci.2,bus=pcie.0,addr=0x1e \
> >         -device pci-bridge,chassis_nr=3,id=pci.3,bus=pci.2,addr=0x0 \
> >         -device ioh3420,port=0x18,chassis=4,id=pci.4,bus=pcie.0,addr=0x3 \
> >         -device ioh3420,port=0x20,chassis=5,id=pci.5,bus=pcie.0,addr=0x4 \
> >         -device ioh3420,port=0x28,chassis=6,id=pci.6,bus=pcie.0,addr=0x5 \
> >         -device ioh3420,port=0x30,chassis=7,id=pci.7,bus=pcie.0,addr=0x6 \
> >         -device ioh3420,port=0x38,chassis=8,id=pci.8,bus=pcie.0,addr=0x7 \
> >         -device ich9-usb-ehci1,id=usb,bus=pcie.0,addr=0x1d.0x7 \
> >         -device ich9-usb-
> uhci1,masterbus=usb.0,firstport=0,bus=pcie.0,multifunction=on,addr=0x1d \
> >         -device ich9-usb-
> uhci2,masterbus=usb.0,firstport=2,bus=pcie.0,addr=0x1d.0x1 \
> >         -device ich9-usb-
> uhci3,masterbus=usb.0,firstport=4,bus=pcie.0,addr=0x1d.0x2 \
> >         -device virtio-serial-pci,id=virtio-serial0,bus=pci.4,addr=0x0 \
> >         -drive file=/dev/vg_s20/lv_l1,format=raw,if=none,id=drive-virtio-
> disk0,cache=none,aio=native \
> >         -device virtio-blk-pci,scsi=off,bus=pci.5,addr=0x0,drive=drive-virtio-
> disk0,id=virtio-disk0,bootindex=1 \
> >         -netdev user,id=hostnet0 \
> >         -device virtio-net-
> pci,netdev=hostnet0,id=net0,mac=52:54:00:c2:62:30,bus=pci.1,addr=0x0 \
> >         -device usb-tablet,id=input0,bus=usb.0,port=1 \
> >         -vnc :0 -vga std \
> >         -device vfio-pci,host=01:00.0,id=hostdev0,bus=pci.8,addr=0x0 \
> >         -device intel-iommu,intremap=on,eim=off,caching-mode=on -trace
> > events=/trace-events.txt -msg timestamp=on
> 
> Alex,
> 
> Thanks for testing this series.
> 
> I think I reproduced it using my 10g nic as well. What I got is:
> 
> [   23.724787] ixgbe 0000:01:00.0 enp1s0: Detected Tx Unit Hang
> [   23.724787]   Tx Queue             <0>
> [   23.724787]   TDH, TDT             <0>, <1>
> [   23.724787]   next_to_use          <1>
> [   23.724787]   next_to_clean        <0>
> [   23.724787] tx_buffer_info[next_to_clean]
> [   23.724787]   time_stamp           <fffbb8bb>
> [   23.724787]   jiffies              <fffbc780>
> [   23.729580] ixgbe 0000:01:00.0 enp1s0: tx hang 1 detected on queue 0,
> resetting adapter
> [   23.730752] ixgbe 0000:01:00.0 enp1s0: initiating reset due to tx timeout
> [   23.731768] ixgbe 0000:01:00.0 enp1s0: Reset adapter
> 
> Is this the problem you have encountered? (adapter continuously reset)
> 
> Interestingly, I found that the problem solves itself after I move the "-device
> intel-iommu,..." line before all the other devices.

I also encountered this interesting thing. yes, it is. you must place
"-device intel-iommu" before the vfio-pci devices. If I remember correctly, 
if "device intel-iommu" is not in front the others, the vtd_realize is called after
vfio_initfn, which would result in no calling of the following code snapshot.
Then there is no channel between vfio device and intel-iommu, so everything
is possible if such channel is gone. So better to place "intel-iommu" first place^_^

hw/vfio/common.c: vfio_listener_region_add()
    if (memory_region_is_iommu(section->mr)) {
        VFIOGuestIOMMU *giommu;

        trace_vfio_listener_region_add_iommu(iova, end);
        /*
         * FIXME: For VFIO iommu types which have KVM acceleration to
         * avoid bouncing all map/unmaps through qemu this way, this
         * would be the right place to wire that up (tell the KVM
         * device emulation the VFIO iommu handles to use).
         */
        giommu = g_malloc0(sizeof(*giommu));
        giommu->iommu = section->mr;
        giommu->iommu_offset = section->offset_within_address_space -
                               section->offset_within_region;
        giommu->container = container;
        giommu->n.notify = vfio_iommu_map_notify;
        giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;

Regards,
Yi L

> Or say, this will be the much shorter reproducer meet the bug:
> 
> $qemu   -machine q35,accel=kvm,kernel-irqchip=split \
>         -cpu host -smp 4 -m 2048 \
>         -nographic -nodefaults -serial stdio \
>         -device vfio-pci,host=05:00.0,bus=pci.1 \
>         -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>         /images/fedora-25.qcow2
> 
> While this may possibly be okay at least on my host (switching the order of the
> two devices):
> 
> $qemu   -machine q35,accel=kvm,kernel-irqchip=split \
>         -cpu host -smp 4 -m 2048 \
>         -nographic -nodefaults -serial stdio \
>         -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>         -device vfio-pci,host=05:00.0,bus=pci.1 \
>         /images/fedora-25.qcow2
> 
> So not sure how the ordering of realization of these two devices (intel-iommu,
> vfio-pci) affected the behavior. One thing I suspect is that in vfio_realize(), we
> have:
> 
>   group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
> errp);
> 
> while here we possibly will be getting &address_space_memory here instead of
> the correct DMA address space since Intel IOMMU device has not yet been
> inited...
> 
> Before I go deeper, any thoughts?
> 
> Thanks,
> 
> -- peterx


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances
  2017-02-20  8:17     ` Liu, Yi L
@ 2017-02-20  8:32       ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-20  8:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Alex Williamson, Lan, Tianyu, Tian, Kevin, mst, jan.kiszka,
	jasowang, qemu-devel, bd.aviv, David Gibson

On Mon, Feb 20, 2017 at 08:17:32AM +0000, Liu, Yi L wrote:
> > -----Original Message-----
> > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org]
> > On Behalf Of Peter Xu
> > Sent: Monday, February 20, 2017 3:48 PM
> > To: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> > mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com; qemu-
> > devel@nongnu.org; bd.aviv@gmail.com; David Gibson
> > <david@gibson.dropbear.id.au>
> > Subject: Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc
> > enhances
> > 
> > On Fri, Feb 17, 2017 at 10:18:35AM -0700, Alex Williamson wrote:
> > > On Tue,  7 Feb 2017 16:28:02 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >
> > > > This is v7 of vt-d vfio enablement series.
> > > [snip]
> > > > =========
> > > > Test Done
> > > > =========
> > > >
> > > > Build test passed for x86_64/arm/ppc64.
> > > >
> > > > Simply tested with x86_64, assigning two PCI devices to a single VM,
> > > > boot the VM using:
> > > >
> > > > bin=x86_64-softmmu/qemu-system-x86_64
> > > > $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
> > > >      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > >      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
> > > >      -device virtio-net-pci,netdev=net0 \
> > > >      -device vfio-pci,host=03:00.0 \
> > > >      -device vfio-pci,host=02:00.0 \
> > > >      -trace events=".trace.vfio" \
> > > >      /var/lib/libvirt/images/vm1.qcow2
> > > >
> > > > pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> > > > vtd_page_walk*
> > > > vtd_replay*
> > > > vtd_inv_desc*
> > > >
> > > > Then, in the guest, run the following tool:
> > > >
> > > >
> > > > https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind
> > > > -group/vfio-bind-group.c
> > > >
> > > > With parameter:
> > > >
> > > >   ./vfio-bind-group 00:03.0 00:04.0
> > > >
> > > > Check host side trace log, I can see pages are replayed and mapped
> > > > in
> > > > 00:04.0 device address space, like:
> > > >
> > > > ...
> > > > vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo
> > > > 0x38fe1001 vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova
> > > > range 0x0 - 0x8000000000 vtd_page_walk_level Page walk
> > > > (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> > > > vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range
> > > > 0x0 - 0x40000000 vtd_page_walk_level Page walk (base=0x34979000,
> > > > level=1) iova range 0x0 - 0x200000 vtd_page_walk_one Page walk
> > > > detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> > > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 ->
> > > > gpa 0x22e25000 mask 0xfff perm 3 vtd_page_walk_one Page walk
> > > > detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm
> > > > 3 vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 ->
> > > > gpa 0x22e2d000 mask 0xfff perm 3 vtd_page_walk_one Page walk
> > > > detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm
> > > > 3 vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 ->
> > > > gpa 0x129bb000 mask 0xfff perm 3 vtd_page_walk_one Page walk
> > > > detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa
> > 0x12a80000 mask 0xfff perm 3 vtd_page_walk_one Page walk detected map
> > level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> > vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa
> > 0x12b22000 mask 0xfff perm 3 vtd_page_walk_one Page walk detected map
> > level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3 ...
> > >
> > > Hi Peter,
> > >
> > > I'm trying to make use of this, with your vtd-vfio-enablement-v7
> > > branch (HEAD 0c1c4e738095).  I'm assigning an 82576 PF to a VM.  It
> > > works with iommu=pt, but if I remove that option, the device does not
> > > work and vfio_iommu_map_notify is never called.  Any suggestions?  My
> > > commandline is below.  Thanks,
> > >
> > > Alex
> > >
> > > /usr/local/bin/qemu-system-x86_64 \
> > >         -name guest=l1,debug-threads=on -S \
> > >         -machine pc-q35-2.9,accel=kvm,usb=off,dump-guest-core=off,kernel-
> > irqchip=split \
> > >         -cpu host -m 10240 -realtime mlock=off -smp
> > 4,sockets=1,cores=2,threads=2 \
> > >         -no-user-config -nodefaults -monitor stdio -rtc base=utc,driftfix=slew \
> > >         -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown \
> > >         -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 \
> > >         -boot strict=on \
> > >         -device ioh3420,port=0x10,chassis=1,id=pci.1,bus=pcie.0,addr=0x2 \
> > >         -device i82801b11-bridge,id=pci.2,bus=pcie.0,addr=0x1e \
> > >         -device pci-bridge,chassis_nr=3,id=pci.3,bus=pci.2,addr=0x0 \
> > >         -device ioh3420,port=0x18,chassis=4,id=pci.4,bus=pcie.0,addr=0x3 \
> > >         -device ioh3420,port=0x20,chassis=5,id=pci.5,bus=pcie.0,addr=0x4 \
> > >         -device ioh3420,port=0x28,chassis=6,id=pci.6,bus=pcie.0,addr=0x5 \
> > >         -device ioh3420,port=0x30,chassis=7,id=pci.7,bus=pcie.0,addr=0x6 \
> > >         -device ioh3420,port=0x38,chassis=8,id=pci.8,bus=pcie.0,addr=0x7 \
> > >         -device ich9-usb-ehci1,id=usb,bus=pcie.0,addr=0x1d.0x7 \
> > >         -device ich9-usb-
> > uhci1,masterbus=usb.0,firstport=0,bus=pcie.0,multifunction=on,addr=0x1d \
> > >         -device ich9-usb-
> > uhci2,masterbus=usb.0,firstport=2,bus=pcie.0,addr=0x1d.0x1 \
> > >         -device ich9-usb-
> > uhci3,masterbus=usb.0,firstport=4,bus=pcie.0,addr=0x1d.0x2 \
> > >         -device virtio-serial-pci,id=virtio-serial0,bus=pci.4,addr=0x0 \
> > >         -drive file=/dev/vg_s20/lv_l1,format=raw,if=none,id=drive-virtio-
> > disk0,cache=none,aio=native \
> > >         -device virtio-blk-pci,scsi=off,bus=pci.5,addr=0x0,drive=drive-virtio-
> > disk0,id=virtio-disk0,bootindex=1 \
> > >         -netdev user,id=hostnet0 \
> > >         -device virtio-net-
> > pci,netdev=hostnet0,id=net0,mac=52:54:00:c2:62:30,bus=pci.1,addr=0x0 \
> > >         -device usb-tablet,id=input0,bus=usb.0,port=1 \
> > >         -vnc :0 -vga std \
> > >         -device vfio-pci,host=01:00.0,id=hostdev0,bus=pci.8,addr=0x0 \
> > >         -device intel-iommu,intremap=on,eim=off,caching-mode=on -trace
> > > events=/trace-events.txt -msg timestamp=on
> > 
> > Alex,
> > 
> > Thanks for testing this series.
> > 
> > I think I reproduced it using my 10g nic as well. What I got is:
> > 
> > [   23.724787] ixgbe 0000:01:00.0 enp1s0: Detected Tx Unit Hang
> > [   23.724787]   Tx Queue             <0>
> > [   23.724787]   TDH, TDT             <0>, <1>
> > [   23.724787]   next_to_use          <1>
> > [   23.724787]   next_to_clean        <0>
> > [   23.724787] tx_buffer_info[next_to_clean]
> > [   23.724787]   time_stamp           <fffbb8bb>
> > [   23.724787]   jiffies              <fffbc780>
> > [   23.729580] ixgbe 0000:01:00.0 enp1s0: tx hang 1 detected on queue 0,
> > resetting adapter
> > [   23.730752] ixgbe 0000:01:00.0 enp1s0: initiating reset due to tx timeout
> > [   23.731768] ixgbe 0000:01:00.0 enp1s0: Reset adapter
> > 
> > Is this the problem you have encountered? (adapter continuously reset)
> > 
> > Interestingly, I found that the problem solves itself after I move the "-device
> > intel-iommu,..." line before all the other devices.
> 
> I also encountered this interesting thing. yes, it is. you must place
> "-device intel-iommu" before the vfio-pci devices. If I remember correctly, 
> if "device intel-iommu" is not in front the others, the vtd_realize is called after
> vfio_initfn, which would result in no calling of the following code snapshot.
> Then there is no channel between vfio device and intel-iommu, so everything
> is possible if such channel is gone. So better to place "intel-iommu" first place^_^
> 
> hw/vfio/common.c: vfio_listener_region_add()
>     if (memory_region_is_iommu(section->mr)) {
>         VFIOGuestIOMMU *giommu;
> 
>         trace_vfio_listener_region_add_iommu(iova, end);
>         /*
>          * FIXME: For VFIO iommu types which have KVM acceleration to
>          * avoid bouncing all map/unmaps through qemu this way, this
>          * would be the right place to wire that up (tell the KVM
>          * device emulation the VFIO iommu handles to use).
>          */
>         giommu = g_malloc0(sizeof(*giommu));
>         giommu->iommu = section->mr;
>         giommu->iommu_offset = section->offset_within_address_space -
>                                section->offset_within_region;
>         giommu->container = container;
>         giommu->n.notify = vfio_iommu_map_notify;
>         giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;

Yeah. I think that's possibly because when we do "-device vfio-pci"
first then "-device intel-iommu" then we are actually listening to the
&address_space_memory and any real update on the IOMMU address space
is lost.

Imho forcing user to add "-device intel-iommu" first might be a little
bit "tough" indeed. Not sure whether we should just provide (or do we
have it?) a way to decide the init order of device list.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances
  2017-02-20  7:47   ` Peter Xu
  2017-02-20  8:17     ` Liu, Yi L
@ 2017-02-20 19:15     ` Alex Williamson
  1 sibling, 0 replies; 63+ messages in thread
From: Alex Williamson @ 2017-02-20 19:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	David Gibson, bd.aviv

On Mon, 20 Feb 2017 15:47:31 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Fri, Feb 17, 2017 at 10:18:35AM -0700, Alex Williamson wrote:
> > On Tue,  7 Feb 2017 16:28:02 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > This is v7 of vt-d vfio enablement series.  
> > [snip]  
> > > =========
> > > Test Done
> > > =========
> > > 
> > > Build test passed for x86_64/arm/ppc64.
> > > 
> > > Simply tested with x86_64, assigning two PCI devices to a single VM,
> > > boot the VM using:
> > > 
> > > bin=x86_64-softmmu/qemu-system-x86_64
> > > $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
> > >      -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > >      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
> > >      -device virtio-net-pci,netdev=net0 \
> > >      -device vfio-pci,host=03:00.0 \
> > >      -device vfio-pci,host=02:00.0 \
> > >      -trace events=".trace.vfio" \
> > >      /var/lib/libvirt/images/vm1.qcow2
> > > 
> > > pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> > > vtd_page_walk*
> > > vtd_replay*
> > > vtd_inv_desc*
> > > 
> > > Then, in the guest, run the following tool:
> > > 
> > >   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> > > 
> > > With parameter:
> > > 
> > >   ./vfio-bind-group 00:03.0 00:04.0
> > > 
> > > Check host side trace log, I can see pages are replayed and mapped in
> > > 00:04.0 device address space, like:
> > > 
> > > ...
> > > vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
> > > vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
> > > vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> > > vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
> > > vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
> > > vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
> > > ...  
> > 
> > Hi Peter,
> > 
> > I'm trying to make use of this, with your vtd-vfio-enablement-v7 branch
> > (HEAD 0c1c4e738095).  I'm assigning an 82576 PF to a VM.  It works with
> > iommu=pt, but if I remove that option, the device does not work and
> > vfio_iommu_map_notify is never called.  Any suggestions?  My
> > commandline is below.  Thanks,
> > 
> > Alex
> > 
> > /usr/local/bin/qemu-system-x86_64 \
> >         -name guest=l1,debug-threads=on -S \
> >         -machine pc-q35-2.9,accel=kvm,usb=off,dump-guest-core=off,kernel-irqchip=split \
> >         -cpu host -m 10240 -realtime mlock=off -smp 4,sockets=1,cores=2,threads=2 \
> >         -no-user-config -nodefaults -monitor stdio -rtc base=utc,driftfix=slew \
> >         -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown \
> >         -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 \
> >         -boot strict=on \
> >         -device ioh3420,port=0x10,chassis=1,id=pci.1,bus=pcie.0,addr=0x2 \
> >         -device i82801b11-bridge,id=pci.2,bus=pcie.0,addr=0x1e \
> >         -device pci-bridge,chassis_nr=3,id=pci.3,bus=pci.2,addr=0x0 \
> >         -device ioh3420,port=0x18,chassis=4,id=pci.4,bus=pcie.0,addr=0x3 \
> >         -device ioh3420,port=0x20,chassis=5,id=pci.5,bus=pcie.0,addr=0x4 \
> >         -device ioh3420,port=0x28,chassis=6,id=pci.6,bus=pcie.0,addr=0x5 \
> >         -device ioh3420,port=0x30,chassis=7,id=pci.7,bus=pcie.0,addr=0x6 \
> >         -device ioh3420,port=0x38,chassis=8,id=pci.8,bus=pcie.0,addr=0x7 \
> >         -device ich9-usb-ehci1,id=usb,bus=pcie.0,addr=0x1d.0x7 \
> >         -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pcie.0,multifunction=on,addr=0x1d \
> >         -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pcie.0,addr=0x1d.0x1 \
> >         -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pcie.0,addr=0x1d.0x2 \
> >         -device virtio-serial-pci,id=virtio-serial0,bus=pci.4,addr=0x0 \
> >         -drive file=/dev/vg_s20/lv_l1,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native \
> >         -device virtio-blk-pci,scsi=off,bus=pci.5,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> >         -netdev user,id=hostnet0 \
> >         -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c2:62:30,bus=pci.1,addr=0x0 \
> >         -device usb-tablet,id=input0,bus=usb.0,port=1 \
> >         -vnc :0 -vga std \
> >         -device vfio-pci,host=01:00.0,id=hostdev0,bus=pci.8,addr=0x0 \
> >         -device intel-iommu,intremap=on,eim=off,caching-mode=on -trace events=/trace-events.txt -msg timestamp=on  
> 
> Alex,
> 
> Thanks for testing this series.
> 
> I think I reproduced it using my 10g nic as well. What I got is:
> 
> [   23.724787] ixgbe 0000:01:00.0 enp1s0: Detected Tx Unit Hang
> [   23.724787]   Tx Queue             <0>
> [   23.724787]   TDH, TDT             <0>, <1>
> [   23.724787]   next_to_use          <1>
> [   23.724787]   next_to_clean        <0>
> [   23.724787] tx_buffer_info[next_to_clean]
> [   23.724787]   time_stamp           <fffbb8bb>
> [   23.724787]   jiffies              <fffbc780>
> [   23.729580] ixgbe 0000:01:00.0 enp1s0: tx hang 1 detected on queue 0, resetting adapter
> [   23.730752] ixgbe 0000:01:00.0 enp1s0: initiating reset due to tx timeout
> [   23.731768] ixgbe 0000:01:00.0 enp1s0: Reset adapter
> 
> Is this the problem you have encountered? (adapter continuously reset)
> 
> Interestingly, I found that the problem solves itself after I move the
> "-device intel-iommu,..." line before all the other devices.
> 
> Or say, this will be the much shorter reproducer meet the bug:
> 
> $qemu   -machine q35,accel=kvm,kernel-irqchip=split \
>         -cpu host -smp 4 -m 2048 \
>         -nographic -nodefaults -serial stdio \
>         -device vfio-pci,host=05:00.0,bus=pci.1 \
>         -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>         /images/fedora-25.qcow2
> 
> While this may possibly be okay at least on my host (switching the
> order of the two devices):
> 
> $qemu   -machine q35,accel=kvm,kernel-irqchip=split \
>         -cpu host -smp 4 -m 2048 \
>         -nographic -nodefaults -serial stdio \
>         -device intel-iommu,intremap=on,eim=off,caching-mode=on \
>         -device vfio-pci,host=05:00.0,bus=pci.1 \
>         /images/fedora-25.qcow2
> 
> So not sure how the ordering of realization of these two devices
> (intel-iommu, vfio-pci) affected the behavior. One thing I suspect is
> that in vfio_realize(), we have:
> 
>   group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
> 
> while here we possibly will be getting &address_space_memory here
> instead of the correct DMA address space since Intel IOMMU device has
> not yet been inited...
> 
> Before I go deeper, any thoughts?


Sounds theory, seems confirmed by Yi.  Makes it pretty impossible to
test using libvirt <qemu:arg> support, which is how I derived my VM
commandline.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances
  2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (17 preceding siblings ...)
  2017-02-17 17:18 ` [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Alex Williamson
@ 2017-02-28  7:52 ` Peter Xu
  18 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2017-02-28  7:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, David Gibson,
	alex.williamson, bd.aviv

On Tue, Feb 07, 2017 at 04:28:02PM +0800, Peter Xu wrote:
> This is v7 of vt-d vfio enablement series.
> 
> v7:
> - for the two traces patches: Change subjects. Remove vtd_err() and
>   vtd_err_nonzero_rsvd() tracers, instead using standalone trace for
>   each of the places. Don't remove any DPRINTF() if there is no
>   replacement. [Jason]
> - add r-b and a-b for Alex/David/Jason.
> - in patch "intel_iommu: renaming gpa to iova where proper", convert
>   one more place where I missed [Jason]
> - fix the place where I should use "~0ULL" not "~0" [Jason]
> - squash patch 16 into 18 [Jason]

Hi, Michael,

Do you have plan to have patch 11-17 as well in 2.9? Just a kind
reminder in case you have it since it's reaching soft freeze. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices Peter Xu
  2017-02-10  6:24   ` Jason Wang
@ 2017-03-16  4:05   ` Peter Xu
  2017-03-19 15:34     ` Aviv B.D.
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-03-16  4:05 UTC (permalink / raw)
  To: qemu-devel, Michael S. Tsirkin, Aviv B.D.
  Cc: tianyu.lan, kevin.tian, jan.kiszka, jasowang, David Gibson,
	alex.williamson

On Tue, Feb 07, 2017 at 04:28:19PM +0800, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
> 
>   "IOMMU: enable intel_iommu map and unmap notifiers"
>   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
> 
> However I removed/fixed some content, and added my own codes.
> 
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
> 
> This patch enables vfio devices for VT-d emulation.
> 
> And, since we already have vhost DMAR support via device-iotlb, a
> natural benefit that this patch brings is that vt-d enabled vhost can
> live even without ATS capability now. Though more tests are needed.
> 

Hi, Michael,

If there is any possiblility that this version be merged in the future
at any point, would you please help add Aviv's sign-off into this
patch as well right here (I think it should be before Jason's r-b):

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>

Since I think we should definitely give Aviv more credit since he's
done a great work before (and devoted lots of time).

(Aviv, please reply if you have other opinions, or I'll just make
 myself bold)

Thanks,

> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

[...]

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
  2017-03-16  4:05   ` Peter Xu
@ 2017-03-19 15:34     ` Aviv B.D.
  2017-03-20  1:56       ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Aviv B.D. @ 2017-03-19 15:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S. Tsirkin, tianyu.lan, kevin.tian,
	Jan Kiszka, Jason Wang, David Gibson, Alex Williamson

Hi Peter,
Thanks, I think that I should receive credit for this patch.

Please attribute it under my technion mail: bdaviv@cs.technion.ac.il.

The signed-off line should be:

Signed-off-by: Aviv Ben-David <bdaviv@cs.technion.ac.il>

Thanks,
Aviv.

On Thu, Mar 16, 2017 at 6:05 AM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Feb 07, 2017 at 04:28:19PM +0800, Peter Xu wrote:
> > This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> > upstream:
> >
> >   "IOMMU: enable intel_iommu map and unmap notifiers"
> >   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
> >
> > However I removed/fixed some content, and added my own codes.
> >
> > Instead of translate() every page for iotlb invalidations (which is
> > slower), we walk the pages when needed and notify in a hook function.
> >
> > This patch enables vfio devices for VT-d emulation.
> >
> > And, since we already have vhost DMAR support via device-iotlb, a
> > natural benefit that this patch brings is that vt-d enabled vhost can
> > live even without ATS capability now. Though more tests are needed.
> >
>
> Hi, Michael,
>
> If there is any possiblility that this version be merged in the future
> at any point, would you please help add Aviv's sign-off into this
> patch as well right here (I think it should be before Jason's r-b):
>
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
>
> Since I think we should definitely give Aviv more credit since he's
> done a great work before (and devoted lots of time).
>
> (Aviv, please reply if you have other opinions, or I'll just make
>  myself bold)
>
> Thanks,
>
> > Reviewed-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
>
> [...]
>
> -- peterx
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
  2017-03-19 15:34     ` Aviv B.D.
@ 2017-03-20  1:56       ` Peter Xu
  2017-03-20  2:12         ` Liu, Yi L
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-03-20  1:56 UTC (permalink / raw)
  To: Aviv B.D.
  Cc: qemu-devel, Michael S. Tsirkin, tianyu.lan, kevin.tian,
	Jan Kiszka, Jason Wang, David Gibson, Alex Williamson

On Sun, Mar 19, 2017 at 05:34:31PM +0200, Aviv B.D. wrote:
> Hi Peter,
> Thanks, I think that I should receive credit for this patch.
> 
> Please attribute it under my technion mail: bdaviv@cs.technion.ac.il.
> 
> The signed-off line should be:
> 
> Signed-off-by: Aviv Ben-David <bdaviv@cs.technion.ac.il>

No problem. I'll update in next post. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
  2017-03-20  1:56       ` Peter Xu
@ 2017-03-20  2:12         ` Liu, Yi L
  2017-03-20  2:41           ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Liu, Yi L @ 2017-03-20  2:12 UTC (permalink / raw)
  To: Peter Xu, Aviv B.D.
  Cc: Lan, Tianyu, Tian, Kevin, Michael S. Tsirkin, Jan Kiszka,
	Jason Wang, qemu-devel, Alex Williamson, David Gibson

Hi Peter,

How about the merge of this series? I'm also trying to rebase my work and prepare to send
out RFC patch.

Regards,
Yi L

> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> Behalf Of Peter Xu
> Sent: Monday, March 20, 2017 9:57 AM
> To: Aviv B.D. <bd.aviv@gmail.com>
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> Michael S. Tsirkin <mst@redhat.com>; Jan Kiszka <jan.kiszka@siemens.com>; Jason
> Wang <jasowang@redhat.com>; qemu-devel <qemu-devel@nongnu.org>; Alex
> Williamson <alex.williamson@redhat.com>; David Gibson
> <david@gibson.dropbear.id.au>
> Subject: Re: [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
> 
> On Sun, Mar 19, 2017 at 05:34:31PM +0200, Aviv B.D. wrote:
> > Hi Peter,
> > Thanks, I think that I should receive credit for this patch.
> >
> > Please attribute it under my technion mail: bdaviv@cs.technion.ac.il.
> >
> > The signed-off line should be:
> >
> > Signed-off-by: Aviv Ben-David <bdaviv@cs.technion.ac.il>
> 
> No problem. I'll update in next post. Thanks,
> 
> -- peterx


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices
  2017-03-20  2:12         ` Liu, Yi L
@ 2017-03-20  2:41           ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2017-03-20  2:41 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Aviv B.D.,
	Lan, Tianyu, Tian, Kevin, Michael S. Tsirkin, Jan Kiszka,
	Jason Wang, qemu-devel, Alex Williamson, David Gibson

On Mon, Mar 20, 2017 at 02:12:15AM +0000, Liu, Yi L wrote:
> Hi Peter,
> 
> How about the merge of this series? I'm also trying to rebase my work and prepare to send
> out RFC patch.

We may need to wait until QEMU 2.10. I have no plan of content change
in next repost (just remove some merged patches, and some tweaks in
commit messages), so even another rebase shouldn't be too hard.

If you have RFC already for your work, IMHO it's okay you just post it
out, mark "RFC for 2.10" in subject, and mention that the series will
be pending on v7 of vtd vfio series, so that review can start earlier.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
  2017-02-10  2:34   ` David Gibson
@ 2017-03-27  8:35   ` Liu, Yi L
  2017-03-27  9:12     ` Peter Xu
  1 sibling, 1 reply; 63+ messages in thread
From: Liu, Yi L @ 2017-03-27  8:35 UTC (permalink / raw)
  To: alex.williamson, Peter Xu
  Cc: Lan, Tianyu, Tian, Kevin, mst, jan.kiszka, jasowang, bd.aviv,
	David Gibson, qemu-devel

> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> Behalf Of Peter Xu
> Sent: Tuesday, February 7, 2017 4:28 PM
> To: qemu-devel@nongnu.org
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
> peterx@redhat.com; alex.williamson@redhat.com; bd.aviv@gmail.com; David
> Gibson <david@gibson.dropbear.id.au>
> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> MemoryRegionIOMMUOps.replay() callback
> 
> Originally we have one memory_region_iommu_replay() function, which is the
> default behavior to replay the translations of the whole IOMMU region. However,
> on some platform like x86, we may want our own replay logic for IOMMU regions.
> This patch add one more hook for IOMMUOps for the callback, and it'll override the
> default if set.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/exec/memory.h | 2 ++
>  memory.c              | 6 ++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h index
> 0767888..30b2a74 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>      void (*notify_flag_changed)(MemoryRegion *iommu,
>                                  IOMMUNotifierFlag old_flags,
>                                  IOMMUNotifierFlag new_flags);
> +    /* Set this up to provide customized IOMMU replay function */
> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>  };
> 
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff --git
> a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion
> *mr, IOMMUNotifier *n,
>      hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
> +    /* If the IOMMU has its own replay callback, override */
> +    if (mr->iommu_ops->replay) {
> +        mr->iommu_ops->replay(mr, n);
> +        return;
> +    }

Hi Alex, Peter,

Will all the other vendors(e.g. PPC, s390, ARM) add their own replay callback
as well? I guess it depends on whether the original replay algorithm work well
for them? Do you have such knowledge?

Regards,
Yi L

> +
>      granularity = memory_region_iommu_get_min_page_size(mr);
> 
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> --
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-27  8:35   ` Liu, Yi L
@ 2017-03-27  9:12     ` Peter Xu
  2017-03-27  9:21       ` Liu, Yi L
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-03-27  9:12 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, Lan, Tianyu, Tian, Kevin, mst, jan.kiszka,
	jasowang, bd.aviv, David Gibson, qemu-devel

On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> > -----Original Message-----
> > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> > Behalf Of Peter Xu
> > Sent: Tuesday, February 7, 2017 4:28 PM
> > To: qemu-devel@nongnu.org
> > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> > mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
> > peterx@redhat.com; alex.williamson@redhat.com; bd.aviv@gmail.com; David
> > Gibson <david@gibson.dropbear.id.au>
> > Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> > MemoryRegionIOMMUOps.replay() callback
> > 
> > Originally we have one memory_region_iommu_replay() function, which is the
> > default behavior to replay the translations of the whole IOMMU region. However,
> > on some platform like x86, we may want our own replay logic for IOMMU regions.
> > This patch add one more hook for IOMMUOps for the callback, and it'll override the
> > default if set.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/exec/memory.h | 2 ++
> >  memory.c              | 6 ++++++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/include/exec/memory.h b/include/exec/memory.h index
> > 0767888..30b2a74 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> >      void (*notify_flag_changed)(MemoryRegion *iommu,
> >                                  IOMMUNotifierFlag old_flags,
> >                                  IOMMUNotifierFlag new_flags);
> > +    /* Set this up to provide customized IOMMU replay function */
> > +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> >  };
> > 
> >  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff --git
> > a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion
> > *mr, IOMMUNotifier *n,
> >      hwaddr addr, granularity;
> >      IOMMUTLBEntry iotlb;
> > +    /* If the IOMMU has its own replay callback, override */
> > +    if (mr->iommu_ops->replay) {
> > +        mr->iommu_ops->replay(mr, n);
> > +        return;
> > +    }
> 
> Hi Alex, Peter,
> 
> Will all the other vendors(e.g. PPC, s390, ARM) add their own replay callback
> as well? I guess it depends on whether the original replay algorithm work well
> for them? Do you have such knowledge?

I guess so. At least for VT-d we had this callback since the default
replay mechanism did not work well on x86 due to its extremely large
memory region size. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-27  9:12     ` Peter Xu
@ 2017-03-27  9:21       ` Liu, Yi L
  2017-03-30 11:06         ` Liu, Yi L
  0 siblings, 1 reply; 63+ messages in thread
From: Liu, Yi L @ 2017-03-27  9:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: alex.williamson, Lan, Tianyu, Tian, Kevin, mst, jan.kiszka,
	jasowang, bd.aviv, David Gibson, qemu-devel

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Monday, March 27, 2017 5:12 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> jasowang@redhat.com; bd.aviv@gmail.com; David Gibson
> <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> MemoryRegionIOMMUOps.replay() callback
> 
> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> > > -----Original Message-----
> > > From: Qemu-devel
> > > [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On Behalf
> > > Of Peter Xu
> > > Sent: Tuesday, February 7, 2017 4:28 PM
> > > To: qemu-devel@nongnu.org
> > > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> > > <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> > > jasowang@redhat.com; peterx@redhat.com; alex.williamson@redhat.com;
> > > bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>
> > > Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> > > MemoryRegionIOMMUOps.replay() callback
> > >
> > > Originally we have one memory_region_iommu_replay() function, which
> > > is the default behavior to replay the translations of the whole
> > > IOMMU region. However, on some platform like x86, we may want our own
> replay logic for IOMMU regions.
> > > This patch add one more hook for IOMMUOps for the callback, and
> > > it'll override the default if set.
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  include/exec/memory.h | 2 ++
> > >  memory.c              | 6 ++++++
> > >  2 files changed, 8 insertions(+)
> > >
> > > diff --git a/include/exec/memory.h b/include/exec/memory.h index
> > > 0767888..30b2a74 100644
> > > --- a/include/exec/memory.h
> > > +++ b/include/exec/memory.h
> > > @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> > >      void (*notify_flag_changed)(MemoryRegion *iommu,
> > >                                  IOMMUNotifierFlag old_flags,
> > >                                  IOMMUNotifierFlag new_flags);
> > > +    /* Set this up to provide customized IOMMU replay function */
> > > +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> > >  };
> > >
> > >  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
> > > --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion
> > > *mr, IOMMUNotifier *n,
> > >      hwaddr addr, granularity;
> > >      IOMMUTLBEntry iotlb;
> > > +    /* If the IOMMU has its own replay callback, override */
> > > +    if (mr->iommu_ops->replay) {
> > > +        mr->iommu_ops->replay(mr, n);
> > > +        return;
> > > +    }
> >
> > Hi Alex, Peter,
> >
> > Will all the other vendors(e.g. PPC, s390, ARM) add their own replay
> > callback as well? I guess it depends on whether the original replay
> > algorithm work well for them? Do you have such knowledge?
> 
> I guess so. At least for VT-d we had this callback since the default replay mechanism
> did not work well on x86 due to its extremely large memory region size. Thanks,

thx. that would make sense. 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-27  9:21       ` Liu, Yi L
@ 2017-03-30 11:06         ` Liu, Yi L
  2017-03-30 11:57           ` Jason Wang
  0 siblings, 1 reply; 63+ messages in thread
From: Liu, Yi L @ 2017-03-30 11:06 UTC (permalink / raw)
  To: 'Peter Xu'
  Cc: 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'jasowang@redhat.com',
	'bd.aviv@gmail.com', 'David Gibson',
	'qemu-devel@nongnu.org'

> -----Original Message-----
> From: Liu, Yi L
> Sent: Monday, March 27, 2017 5:22 PM
> To: Peter Xu <peterx@redhat.com>
> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> jasowang@redhat.com; bd.aviv@gmail.com; David Gibson
> <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
> MemoryRegionIOMMUOps.replay() callback
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Monday, March 27, 2017 5:12 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
> > Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
> > jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
> > Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> > Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> > MemoryRegionIOMMUOps.replay() callback
> >
> > On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> > > > -----Original Message-----
> > > > From: Qemu-devel
> > > > [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> > > > Behalf Of Peter Xu
> > > > Sent: Tuesday, February 7, 2017 4:28 PM
> > > > To: qemu-devel@nongnu.org
> > > > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> > > > <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> > > > jasowang@redhat.com; peterx@redhat.com;
> > > > alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
> > > > <david@gibson.dropbear.id.au>
> > > > Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> > > > MemoryRegionIOMMUOps.replay() callback
> > > >
> > > > Originally we have one memory_region_iommu_replay() function,
> > > > which is the default behavior to replay the translations of the
> > > > whole IOMMU region. However, on some platform like x86, we may
> > > > want our own
> > replay logic for IOMMU regions.
> > > > This patch add one more hook for IOMMUOps for the callback, and
> > > > it'll override the default if set.
> > > >
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  include/exec/memory.h | 2 ++
> > > >  memory.c              | 6 ++++++
> > > >  2 files changed, 8 insertions(+)
> > > >
> > > > diff --git a/include/exec/memory.h b/include/exec/memory.h index
> > > > 0767888..30b2a74 100644
> > > > --- a/include/exec/memory.h
> > > > +++ b/include/exec/memory.h
> > > > @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> > > >      void (*notify_flag_changed)(MemoryRegion *iommu,
> > > >                                  IOMMUNotifierFlag old_flags,
> > > >                                  IOMMUNotifierFlag new_flags);
> > > > +    /* Set this up to provide customized IOMMU replay function */
> > > > +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> > > >  };
> > > >
> > > >  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
> > > > --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> > > > --- a/memory.c
> > > > +++ b/memory.c
> > > > @@ -1630,6 +1630,12 @@ void
> > > > memory_region_iommu_replay(MemoryRegion
> > > > *mr, IOMMUNotifier *n,
> > > >      hwaddr addr, granularity;
> > > >      IOMMUTLBEntry iotlb;
> > > > +    /* If the IOMMU has its own replay callback, override */
> > > > +    if (mr->iommu_ops->replay) {
> > > > +        mr->iommu_ops->replay(mr, n);
> > > > +        return;
> > > > +    }
> > >
> > > Hi Alex, Peter,
> > >
> > > Will all the other vendors(e.g. PPC, s390, ARM) add their own replay
> > > callback as well? I guess it depends on whether the original replay
> > > algorithm work well for them? Do you have such knowledge?
> >
> > I guess so. At least for VT-d we had this callback since the default
> > replay mechanism did not work well on x86 due to its extremely large
> > memory region size. Thanks,
> 
> thx. that would make sense.

Peter,

Just come to mind that there may be a corner case here.

Intel VT-d actually has a "pt" mode which allows device use physical address
even when VT-d is enabled. In kernel, there is a iommu_identity_mapping. 
If a device is in this map, then it would use "pt" mode. So that IOMMU driver
would not build second-level page table for it.

Back to the virtual IOVA implementation, if an assigned device is in the 
iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
is not able to build it when guest SL page table is empty.

So I think building an entire guest PA->HPA mapping before guest kernel boot
would be recommended. Any thoughts?

Regards,
Yi L

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-30 11:06         ` Liu, Yi L
@ 2017-03-30 11:57           ` Jason Wang
  2017-03-31  2:56             ` Peter Xu
  2017-03-31  5:34             ` Liu, Yi L
  0 siblings, 2 replies; 63+ messages in thread
From: Jason Wang @ 2017-03-30 11:57 UTC (permalink / raw)
  To: Liu, Yi L, 'Peter Xu'
  Cc: 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'David Gibson', 'qemu-devel@nongnu.org'



On 2017年03月30日 19:06, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Liu, Yi L
>> Sent: Monday, March 27, 2017 5:22 PM
>> To: Peter Xu <peterx@redhat.com>
>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>> jasowang@redhat.com; bd.aviv@gmail.com; David Gibson
>> <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
>> MemoryRegionIOMMUOps.replay() callback
>>
>>> -----Original Message-----
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Monday, March 27, 2017 5:12 PM
>>> To: Liu, Yi L <yi.l.liu@intel.com>
>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
>>> Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>> MemoryRegionIOMMUOps.replay() callback
>>>
>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
>>>>> -----Original Message-----
>>>>> From: Qemu-devel
>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
>>>>> Behalf Of Peter Xu
>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
>>>>> To: qemu-devel@nongnu.org
>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>>>>> jasowang@redhat.com; peterx@redhat.com;
>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
>>>>> <david@gibson.dropbear.id.au>
>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>
>>>>> Originally we have one memory_region_iommu_replay() function,
>>>>> which is the default behavior to replay the translations of the
>>>>> whole IOMMU region. However, on some platform like x86, we may
>>>>> want our own
>>> replay logic for IOMMU regions.
>>>>> This patch add one more hook for IOMMUOps for the callback, and
>>>>> it'll override the default if set.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>   include/exec/memory.h | 2 ++
>>>>>   memory.c              | 6 ++++++
>>>>>   2 files changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
>>>>> 0767888..30b2a74 100644
>>>>> --- a/include/exec/memory.h
>>>>> +++ b/include/exec/memory.h
>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>>>>>       void (*notify_flag_changed)(MemoryRegion *iommu,
>>>>>                                   IOMMUNotifierFlag old_flags,
>>>>>                                   IOMMUNotifierFlag new_flags);
>>>>> +    /* Set this up to provide customized IOMMU replay function */
>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>>>>>   };
>>>>>
>>>>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
>>>>> --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
>>>>> --- a/memory.c
>>>>> +++ b/memory.c
>>>>> @@ -1630,6 +1630,12 @@ void
>>>>> memory_region_iommu_replay(MemoryRegion
>>>>> *mr, IOMMUNotifier *n,
>>>>>       hwaddr addr, granularity;
>>>>>       IOMMUTLBEntry iotlb;
>>>>> +    /* If the IOMMU has its own replay callback, override */
>>>>> +    if (mr->iommu_ops->replay) {
>>>>> +        mr->iommu_ops->replay(mr, n);
>>>>> +        return;
>>>>> +    }
>>>> Hi Alex, Peter,
>>>>
>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own replay
>>>> callback as well? I guess it depends on whether the original replay
>>>> algorithm work well for them? Do you have such knowledge?
>>> I guess so. At least for VT-d we had this callback since the default
>>> replay mechanism did not work well on x86 due to its extremely large
>>> memory region size. Thanks,
>> thx. that would make sense.
> Peter,
>
> Just come to mind that there may be a corner case here.
>
> Intel VT-d actually has a "pt" mode which allows device use physical address
> even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> would not build second-level page table for it.

Yes, but qemu does not support ECAP_PT now, so guest will still have a 
page table in this case.

>
> Back to the virtual IOVA implementation, if an assigned device is in the
> iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> is not able to build it when guest SL page table is empty.
>
> So I think building an entire guest PA->HPA mapping before guest kernel boot
> would be recommended. Any thoughts?

We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar 
region and use another region without iommu_ops. Then 
vfio_listener_region_add() will just do the correct mappings.

Thanks

>
> Regards,
> Yi L

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-30 11:57           ` Jason Wang
@ 2017-03-31  2:56             ` Peter Xu
  2017-03-31  4:21               ` Jason Wang
  2017-03-31  5:34             ` Liu, Yi L
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-03-31  2:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: Liu, Yi L, 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'David Gibson', 'qemu-devel@nongnu.org'

On Thu, Mar 30, 2017 at 07:57:38PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月30日 19:06, Liu, Yi L wrote:
> >>-----Original Message-----
> >>From: Liu, Yi L
> >>Sent: Monday, March 27, 2017 5:22 PM
> >>To: Peter Xu <peterx@redhat.com>
> >>Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >><kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> >>jasowang@redhat.com; bd.aviv@gmail.com; David Gibson
> >><david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> >>Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>MemoryRegionIOMMUOps.replay() callback
> >>
> >>>-----Original Message-----
> >>>From: Peter Xu [mailto:peterx@redhat.com]
> >>>Sent: Monday, March 27, 2017 5:12 PM
> >>>To: Liu, Yi L <yi.l.liu@intel.com>
> >>>Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
> >>>Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
> >>>jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
> >>>Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> >>>Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>MemoryRegionIOMMUOps.replay() callback
> >>>
> >>>On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> >>>>>-----Original Message-----
> >>>>>From: Qemu-devel
> >>>>>[mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> >>>>>Behalf Of Peter Xu
> >>>>>Sent: Tuesday, February 7, 2017 4:28 PM
> >>>>>To: qemu-devel@nongnu.org
> >>>>>Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >>>>><kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> >>>>>jasowang@redhat.com; peterx@redhat.com;
> >>>>>alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
> >>>>><david@gibson.dropbear.id.au>
> >>>>>Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>>MemoryRegionIOMMUOps.replay() callback
> >>>>>
> >>>>>Originally we have one memory_region_iommu_replay() function,
> >>>>>which is the default behavior to replay the translations of the
> >>>>>whole IOMMU region. However, on some platform like x86, we may
> >>>>>want our own
> >>>replay logic for IOMMU regions.
> >>>>>This patch add one more hook for IOMMUOps for the callback, and
> >>>>>it'll override the default if set.
> >>>>>
> >>>>>Signed-off-by: Peter Xu <peterx@redhat.com>
> >>>>>---
> >>>>>  include/exec/memory.h | 2 ++
> >>>>>  memory.c              | 6 ++++++
> >>>>>  2 files changed, 8 insertions(+)
> >>>>>
> >>>>>diff --git a/include/exec/memory.h b/include/exec/memory.h index
> >>>>>0767888..30b2a74 100644
> >>>>>--- a/include/exec/memory.h
> >>>>>+++ b/include/exec/memory.h
> >>>>>@@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> >>>>>      void (*notify_flag_changed)(MemoryRegion *iommu,
> >>>>>                                  IOMMUNotifierFlag old_flags,
> >>>>>                                  IOMMUNotifierFlag new_flags);
> >>>>>+    /* Set this up to provide customized IOMMU replay function */
> >>>>>+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> >>>>>  };
> >>>>>
> >>>>>  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
> >>>>>--git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> >>>>>--- a/memory.c
> >>>>>+++ b/memory.c
> >>>>>@@ -1630,6 +1630,12 @@ void
> >>>>>memory_region_iommu_replay(MemoryRegion
> >>>>>*mr, IOMMUNotifier *n,
> >>>>>      hwaddr addr, granularity;
> >>>>>      IOMMUTLBEntry iotlb;
> >>>>>+    /* If the IOMMU has its own replay callback, override */
> >>>>>+    if (mr->iommu_ops->replay) {
> >>>>>+        mr->iommu_ops->replay(mr, n);
> >>>>>+        return;
> >>>>>+    }
> >>>>Hi Alex, Peter,
> >>>>
> >>>>Will all the other vendors(e.g. PPC, s390, ARM) add their own replay
> >>>>callback as well? I guess it depends on whether the original replay
> >>>>algorithm work well for them? Do you have such knowledge?
> >>>I guess so. At least for VT-d we had this callback since the default
> >>>replay mechanism did not work well on x86 due to its extremely large
> >>>memory region size. Thanks,
> >>thx. that would make sense.
> >Peter,
> >
> >Just come to mind that there may be a corner case here.
> >
> >Intel VT-d actually has a "pt" mode which allows device use physical address
> >even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> >If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> >would not build second-level page table for it.
> 
> Yes, but qemu does not support ECAP_PT now, so guest will still have a page
> table in this case.
> 
> >
> >Back to the virtual IOVA implementation, if an assigned device is in the
> >iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> >So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> >is not able to build it when guest SL page table is empty.
> >
> >So I think building an entire guest PA->HPA mapping before guest kernel boot
> >would be recommended. Any thoughts?
> 
> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
> region and use another region without iommu_ops. Then
> vfio_listener_region_add() will just do the correct mappings.

Even without any new region. With the patch 16/17 ("intel_iommu: allow
dynamic switch of IOMMU region"), we can just turn the IOMMU region
on/off, following the device's PT bit, maybe using the new
vtd_switch_address_space() interface. That should be enough.

Again, we just need to wait until current series merged.

(Oh, then I found why I had an extra "on/off" parameter in previous
 versions in vtd_switch_address_space(), but it was removed.)

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-31  2:56             ` Peter Xu
@ 2017-03-31  4:21               ` Jason Wang
  2017-03-31  5:01                 ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Wang @ 2017-03-31  4:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liu, Yi L, 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'David Gibson', 'qemu-devel@nongnu.org'



On 2017年03月31日 10:56, Peter Xu wrote:
>>> Just come to mind that there may be a corner case here.
>>>
>>> Intel VT-d actually has a "pt" mode which allows device use physical address
>>> even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
>>> If a device is in this map, then it would use "pt" mode. So that IOMMU driver
>>> would not build second-level page table for it.
>> Yes, but qemu does not support ECAP_PT now, so guest will still have a page
>> table in this case.
>>
>>> Back to the virtual IOVA implementation, if an assigned device is in the
>>> iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
>>> So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
>>> is not able to build it when guest SL page table is empty.
>>>
>>> So I think building an entire guest PA->HPA mapping before guest kernel boot
>>> would be recommended. Any thoughts?
>> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
>> region and use another region without iommu_ops. Then
>> vfio_listener_region_add() will just do the correct mappings.
> Even without any new region. With the patch 16/17 ("intel_iommu: allow
> dynamic switch of IOMMU region"), we can just turn the IOMMU region
> on/off, following the device's PT bit, maybe using the new
> vtd_switch_address_space() interface. That should be enough.

Right. For vhost it was probably need more works, e.g setting up static 
mappings during region_add().

>
> Again, we just need to wait until current series merged.
>
> (Oh, then I found why I had an extra "on/off" parameter in previous
>   versions in vtd_switch_address_space(), but it was removed.)

Good to know this.

Thanks

>
> Thanks,
>
> -- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-31  4:21               ` Jason Wang
@ 2017-03-31  5:01                 ` Peter Xu
  2017-03-31  5:12                   ` Jason Wang
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2017-03-31  5:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Liu, Yi L, 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'David Gibson', 'qemu-devel@nongnu.org'

On Fri, Mar 31, 2017 at 12:21:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月31日 10:56, Peter Xu wrote:
> >>>Just come to mind that there may be a corner case here.
> >>>
> >>>Intel VT-d actually has a "pt" mode which allows device use physical address
> >>>even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> >>>If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> >>>would not build second-level page table for it.
> >>Yes, but qemu does not support ECAP_PT now, so guest will still have a page
> >>table in this case.
> >>
> >>>Back to the virtual IOVA implementation, if an assigned device is in the
> >>>iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> >>>So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> >>>is not able to build it when guest SL page table is empty.
> >>>
> >>>So I think building an entire guest PA->HPA mapping before guest kernel boot
> >>>would be recommended. Any thoughts?
> >>We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
> >>region and use another region without iommu_ops. Then
> >>vfio_listener_region_add() will just do the correct mappings.
> >Even without any new region. With the patch 16/17 ("intel_iommu: allow
> >dynamic switch of IOMMU region"), we can just turn the IOMMU region
> >on/off, following the device's PT bit, maybe using the new
> >vtd_switch_address_space() interface. That should be enough.
> 
> Right. For vhost it was probably need more works, e.g setting up static
> mappings during region_add().

Do we need to?

VFIO will need it for building up shadow page table, even without a
vIOMMU. But imho that should not be needed by vhost, right?

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-31  5:01                 ` Peter Xu
@ 2017-03-31  5:12                   ` Jason Wang
  2017-03-31  5:28                     ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Wang @ 2017-03-31  5:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liu, Yi L, 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'David Gibson', 'qemu-devel@nongnu.org'



On 2017年03月31日 13:01, Peter Xu wrote:
> On Fri, Mar 31, 2017 at 12:21:23PM +0800, Jason Wang wrote:
>>
>> On 2017年03月31日 10:56, Peter Xu wrote:
>>>>> Just come to mind that there may be a corner case here.
>>>>>
>>>>> Intel VT-d actually has a "pt" mode which allows device use physical address
>>>>> even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
>>>>> If a device is in this map, then it would use "pt" mode. So that IOMMU driver
>>>>> would not build second-level page table for it.
>>>> Yes, but qemu does not support ECAP_PT now, so guest will still have a page
>>>> table in this case.
>>>>
>>>>> Back to the virtual IOVA implementation, if an assigned device is in the
>>>>> iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
>>>>> So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
>>>>> is not able to build it when guest SL page table is empty.
>>>>>
>>>>> So I think building an entire guest PA->HPA mapping before guest kernel boot
>>>>> would be recommended. Any thoughts?
>>>> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
>>>> region and use another region without iommu_ops. Then
>>>> vfio_listener_region_add() will just do the correct mappings.
>>> Even without any new region. With the patch 16/17 ("intel_iommu: allow
>>> dynamic switch of IOMMU region"), we can just turn the IOMMU region
>>> on/off, following the device's PT bit, maybe using the new
>>> vtd_switch_address_space() interface. That should be enough.
>> Right. For vhost it was probably need more works, e.g setting up static
>> mappings during region_add().
> Do we need to?

Not a must if we don't care about performance.

>
> VFIO will need it for building up shadow page table, even without a
> vIOMMU. But imho that should not be needed by vhost, right?

Device IOTLB will be enabled unconditionally if iommu_platform is 
specified. If we don't set static mappings, vhost will send IOTLB miss 
request. The performance will be horrible in this case.

Thanks

>
> -- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-31  5:12                   ` Jason Wang
@ 2017-03-31  5:28                     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2017-03-31  5:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Liu, Yi L, 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'David Gibson', 'qemu-devel@nongnu.org'

On Fri, Mar 31, 2017 at 01:12:56PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月31日 13:01, Peter Xu wrote:
> >On Fri, Mar 31, 2017 at 12:21:23PM +0800, Jason Wang wrote:
> >>
> >>On 2017年03月31日 10:56, Peter Xu wrote:
> >>>>>Just come to mind that there may be a corner case here.
> >>>>>
> >>>>>Intel VT-d actually has a "pt" mode which allows device use physical address
> >>>>>even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> >>>>>If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> >>>>>would not build second-level page table for it.
> >>>>Yes, but qemu does not support ECAP_PT now, so guest will still have a page
> >>>>table in this case.
> >>>>
> >>>>>Back to the virtual IOVA implementation, if an assigned device is in the
> >>>>>iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> >>>>>So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> >>>>>is not able to build it when guest SL page table is empty.
> >>>>>
> >>>>>So I think building an entire guest PA->HPA mapping before guest kernel boot
> >>>>>would be recommended. Any thoughts?
> >>>>We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
> >>>>region and use another region without iommu_ops. Then
> >>>>vfio_listener_region_add() will just do the correct mappings.
> >>>Even without any new region. With the patch 16/17 ("intel_iommu: allow
> >>>dynamic switch of IOMMU region"), we can just turn the IOMMU region
> >>>on/off, following the device's PT bit, maybe using the new
> >>>vtd_switch_address_space() interface. That should be enough.
> >>Right. For vhost it was probably need more works, e.g setting up static
> >>mappings during region_add().
> >Do we need to?
> 
> Not a must if we don't care about performance.
> 
> >
> >VFIO will need it for building up shadow page table, even without a
> >vIOMMU. But imho that should not be needed by vhost, right?
> 
> Device IOTLB will be enabled unconditionally if iommu_platform is specified.
> If we don't set static mappings, vhost will send IOTLB miss request. The
> performance will be horrible in this case.

I see, thanks. So looks like we will need one more patch for PT
support now. :)

-- peterx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-30 11:57           ` Jason Wang
  2017-03-31  2:56             ` Peter Xu
@ 2017-03-31  5:34             ` Liu, Yi L
  2017-03-31  7:16               ` Jason Wang
  1 sibling, 1 reply; 63+ messages in thread
From: Liu, Yi L @ 2017-03-31  5:34 UTC (permalink / raw)
  To: Jason Wang, 'Peter Xu'
  Cc: 'alex.williamson@redhat.com',
	Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'David Gibson', 'qemu-devel@nongnu.org'

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, March 30, 2017 7:58 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan, Tianyu
> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>; 'mst@redhat.com'
> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;
> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'
> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-
> devel@nongnu.org>
> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> MemoryRegionIOMMUOps.replay() callback
> 
> 
> 
> On 2017年03月30日 19:06, Liu, Yi L wrote:
> >> -----Original Message-----
> >> From: Liu, Yi L
> >> Sent: Monday, March 27, 2017 5:22 PM
> >> To: Peter Xu <peterx@redhat.com>
> >> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
> >> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
> >> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
> >> Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> >> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
> >> MemoryRegionIOMMUOps.replay() callback
> >>
> >>> -----Original Message-----
> >>> From: Peter Xu [mailto:peterx@redhat.com]
> >>> Sent: Monday, March 27, 2017 5:12 PM
> >>> To: Liu, Yi L <yi.l.liu@intel.com>
> >>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
> >>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
> >>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;
> >>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> >>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>> MemoryRegionIOMMUOps.replay() callback
> >>>
> >>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> >>>>> -----Original Message-----
> >>>>> From: Qemu-devel
> >>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> >>>>> Behalf Of Peter Xu
> >>>>> Sent: Tuesday, February 7, 2017 4:28 PM
> >>>>> To: qemu-devel@nongnu.org
> >>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> >>>>> jasowang@redhat.com; peterx@redhat.com;
> >>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
> >>>>> <david@gibson.dropbear.id.au>
> >>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>> MemoryRegionIOMMUOps.replay() callback
> >>>>>
> >>>>> Originally we have one memory_region_iommu_replay() function,
> >>>>> which is the default behavior to replay the translations of the
> >>>>> whole IOMMU region. However, on some platform like x86, we may
> >>>>> want our own
> >>> replay logic for IOMMU regions.
> >>>>> This patch add one more hook for IOMMUOps for the callback, and
> >>>>> it'll override the default if set.
> >>>>>
> >>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>>>> ---
> >>>>>   include/exec/memory.h | 2 ++
> >>>>>   memory.c              | 6 ++++++
> >>>>>   2 files changed, 8 insertions(+)
> >>>>>
> >>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
> >>>>> 0767888..30b2a74 100644
> >>>>> --- a/include/exec/memory.h
> >>>>> +++ b/include/exec/memory.h
> >>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> >>>>>       void (*notify_flag_changed)(MemoryRegion *iommu,
> >>>>>                                   IOMMUNotifierFlag old_flags,
> >>>>>                                   IOMMUNotifierFlag new_flags);
> >>>>> +    /* Set this up to provide customized IOMMU replay function */
> >>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> >>>>>   };
> >>>>>
> >>>>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
> >>>>> --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> >>>>> --- a/memory.c
> >>>>> +++ b/memory.c
> >>>>> @@ -1630,6 +1630,12 @@ void
> >>>>> memory_region_iommu_replay(MemoryRegion
> >>>>> *mr, IOMMUNotifier *n,
> >>>>>       hwaddr addr, granularity;
> >>>>>       IOMMUTLBEntry iotlb;
> >>>>> +    /* If the IOMMU has its own replay callback, override */
> >>>>> +    if (mr->iommu_ops->replay) {
> >>>>> +        mr->iommu_ops->replay(mr, n);
> >>>>> +        return;
> >>>>> +    }
> >>>> Hi Alex, Peter,
> >>>>
> >>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own
> >>>> replay callback as well? I guess it depends on whether the original
> >>>> replay algorithm work well for them? Do you have such knowledge?
> >>> I guess so. At least for VT-d we had this callback since the default
> >>> replay mechanism did not work well on x86 due to its extremely large
> >>> memory region size. Thanks,
> >> thx. that would make sense.
> > Peter,
> >
> > Just come to mind that there may be a corner case here.
> >
> > Intel VT-d actually has a "pt" mode which allows device use physical
> > address even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> > If a device is in this map, then it would use "pt" mode. So that IOMMU
> > driver would not build second-level page table for it.
> 
> Yes, but qemu does not support ECAP_PT now, so guest will still have a page table in
> this case.

That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So this solution
can work well even a device is in identify_map.

> 
> >
> > Back to the virtual IOVA implementation, if an assigned device is in
> > the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> > So it demands a GPA->HPA mapping in host. However, the
> > iommu->ops.replay is not able to build it when guest SL page table is empty.
> >
> > So I think building an entire guest PA->HPA mapping before guest
> > kernel boot would be recommended. Any thoughts?
> 
> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar region and
> use another region without iommu_ops. Then
> vfio_listener_region_add() will just do the correct mappings.

Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So just comes to
realize that the current replay solution may not work well when I expose ECAP_PT to guest.
I also have a rough idea here. The current listener in container listens to address space
named with devfn if virtual VTd is added. How about adding one more listener to listen
memory address space. So that the listener can build entire guest PA->HPA mapping. Also,
the vfio notifier is registered when changes happen in device address space. However, I
didn’t check if all the layout changes in memory address space happen before the first
dynamic map/unmap request from guest. If not, this solution is not practical.

Thanks,
Yi L

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-31  5:34             ` Liu, Yi L
@ 2017-03-31  7:16               ` Jason Wang
  2017-03-31  7:30                 ` Liu, Yi L
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Wang @ 2017-03-31  7:16 UTC (permalink / raw)
  To: Liu, Yi L, 'Peter Xu'
  Cc: Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'qemu-devel@nongnu.org',
	'alex.williamson@redhat.com', 'David Gibson'



On 2017年03月31日 13:34, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Thursday, March 30, 2017 7:58 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
>> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan, Tianyu
>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>; 'mst@redhat.com'
>> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;
>> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'
>> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-
>> devel@nongnu.org>
>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>> MemoryRegionIOMMUOps.replay() callback
>>
>>
>>
>> On 2017年03月30日 19:06, Liu, Yi L wrote:
>>>> -----Original Message-----
>>>> From: Liu, Yi L
>>>> Sent: Monday, March 27, 2017 5:22 PM
>>>> To: Peter Xu <peterx@redhat.com>
>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
>>>> Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>> MemoryRegionIOMMUOps.replay() callback
>>>>
>>>>> -----Original Message-----
>>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>>> Sent: Monday, March 27, 2017 5:12 PM
>>>>> To: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;
>>>>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>
>>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Qemu-devel
>>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
>>>>>>> Behalf Of Peter Xu
>>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
>>>>>>> To: qemu-devel@nongnu.org
>>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>>>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>>>>>>> jasowang@redhat.com; peterx@redhat.com;
>>>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
>>>>>>> <david@gibson.dropbear.id.au>
>>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>>
>>>>>>> Originally we have one memory_region_iommu_replay() function,
>>>>>>> which is the default behavior to replay the translations of the
>>>>>>> whole IOMMU region. However, on some platform like x86, we may
>>>>>>> want our own
>>>>> replay logic for IOMMU regions.
>>>>>>> This patch add one more hook for IOMMUOps for the callback, and
>>>>>>> it'll override the default if set.
>>>>>>>
>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>> ---
>>>>>>>    include/exec/memory.h | 2 ++
>>>>>>>    memory.c              | 6 ++++++
>>>>>>>    2 files changed, 8 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
>>>>>>> 0767888..30b2a74 100644
>>>>>>> --- a/include/exec/memory.h
>>>>>>> +++ b/include/exec/memory.h
>>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>>>>>>>        void (*notify_flag_changed)(MemoryRegion *iommu,
>>>>>>>                                    IOMMUNotifierFlag old_flags,
>>>>>>>                                    IOMMUNotifierFlag new_flags);
>>>>>>> +    /* Set this up to provide customized IOMMU replay function */
>>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>>>>>>>    };
>>>>>>>
>>>>>>>    typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
>>>>>>> --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
>>>>>>> --- a/memory.c
>>>>>>> +++ b/memory.c
>>>>>>> @@ -1630,6 +1630,12 @@ void
>>>>>>> memory_region_iommu_replay(MemoryRegion
>>>>>>> *mr, IOMMUNotifier *n,
>>>>>>>        hwaddr addr, granularity;
>>>>>>>        IOMMUTLBEntry iotlb;
>>>>>>> +    /* If the IOMMU has its own replay callback, override */
>>>>>>> +    if (mr->iommu_ops->replay) {
>>>>>>> +        mr->iommu_ops->replay(mr, n);
>>>>>>> +        return;
>>>>>>> +    }
>>>>>> Hi Alex, Peter,
>>>>>>
>>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own
>>>>>> replay callback as well? I guess it depends on whether the original
>>>>>> replay algorithm work well for them? Do you have such knowledge?
>>>>> I guess so. At least for VT-d we had this callback since the default
>>>>> replay mechanism did not work well on x86 due to its extremely large
>>>>> memory region size. Thanks,
>>>> thx. that would make sense.
>>> Peter,
>>>
>>> Just come to mind that there may be a corner case here.
>>>
>>> Intel VT-d actually has a "pt" mode which allows device use physical
>>> address even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
>>> If a device is in this map, then it would use "pt" mode. So that IOMMU
>>> driver would not build second-level page table for it.
>> Yes, but qemu does not support ECAP_PT now, so guest will still have a page table in
>> this case.
> That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So this solution
> can work well even a device is in identify_map.
>
>>> Back to the virtual IOVA implementation, if an assigned device is in
>>> the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
>>> So it demands a GPA->HPA mapping in host. However, the
>>> iommu->ops.replay is not able to build it when guest SL page table is empty.
>>>
>>> So I think building an entire guest PA->HPA mapping before guest
>>> kernel boot would be recommended. Any thoughts?
>> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar region and
>> use another region without iommu_ops. Then
>> vfio_listener_region_add() will just do the correct mappings.
> Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So just comes to
> realize that the current replay solution may not work well when I expose ECAP_PT to guest.
> I also have a rough idea here. The current listener in container listens to address space
> named with devfn if virtual VTd is added. How about adding one more listener to listen
> memory address space. So that the listener can build entire guest PA->HPA mapping.

This is only needed for PT. So looks like current code is sufficient to 
do this I think. See the else part of if (memory_region_is_iommu()) of 
vfio_listener_region_add().

Thanks

>   Also,
> the vfio notifier is registered when changes happen in device address space. However, I
> didn’t check if all the layout changes in memory address space happen before the first
> dynamic map/unmap request from guest. If not, this solution is not practical.
>
> Thanks,
> Yi L

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-31  7:16               ` Jason Wang
@ 2017-03-31  7:30                 ` Liu, Yi L
  2017-04-01  5:00                   ` Jason Wang
  0 siblings, 1 reply; 63+ messages in thread
From: Liu, Yi L @ 2017-03-31  7:30 UTC (permalink / raw)
  To: Jason Wang, 'Peter Xu'
  Cc: Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'qemu-devel@nongnu.org',
	'alex.williamson@redhat.com', 'David Gibson'

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Friday, March 31, 2017 3:17 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> 'mst@redhat.com' <mst@redhat.com>; 'jan.kiszka@siemens.com'
> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'qemu-
> devel@nongnu.org' <qemu-devel@nongnu.org>; 'alex.williamson@redhat.com'
> <alex.williamson@redhat.com>; 'David Gibson' <david@gibson.dropbear.id.au>
> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> MemoryRegionIOMMUOps.replay() callback
> 
> 
> 
> On 2017年03月31日 13:34, Liu, Yi L wrote:
> >> -----Original Message-----
> >> From: Jason Wang [mailto:jasowang@redhat.com]
> >> Sent: Thursday, March 30, 2017 7:58 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
> >> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan,
> >> Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> 'mst@redhat.com'
> >> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;
> >> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'
> >> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-
> >> devel@nongnu.org>
> >> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >> MemoryRegionIOMMUOps.replay() callback
> >>
> >>
> >>
> >> On 2017年03月30日 19:06, Liu, Yi L wrote:
> >>>> -----Original Message-----
> >>>> From: Liu, Yi L
> >>>> Sent: Monday, March 27, 2017 5:22 PM
> >>>> To: Peter Xu <peterx@redhat.com>
> >>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
> >>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
> >>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;
> >>>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> >>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>> MemoryRegionIOMMUOps.replay() callback
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Peter Xu [mailto:peterx@redhat.com]
> >>>>> Sent: Monday, March 27, 2017 5:12 PM
> >>>>> To: Liu, Yi L <yi.l.liu@intel.com>
> >>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu
> >>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> >>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
> >>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;
> >>>>> qemu-devel@nongnu.org
> >>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>> MemoryRegionIOMMUOps.replay() callback
> >>>>>
> >>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> >>>>>>> -----Original Message-----
> >>>>>>> From: Qemu-devel
> >>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> >>>>>>> Behalf Of Peter Xu
> >>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
> >>>>>>> To: qemu-devel@nongnu.org
> >>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >>>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> >>>>>>> jasowang@redhat.com; peterx@redhat.com;
> >>>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
> >>>>>>> <david@gibson.dropbear.id.au>
> >>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>>>> MemoryRegionIOMMUOps.replay() callback
> >>>>>>>
> >>>>>>> Originally we have one memory_region_iommu_replay() function,
> >>>>>>> which is the default behavior to replay the translations of the
> >>>>>>> whole IOMMU region. However, on some platform like x86, we may
> >>>>>>> want our own
> >>>>> replay logic for IOMMU regions.
> >>>>>>> This patch add one more hook for IOMMUOps for the callback, and
> >>>>>>> it'll override the default if set.
> >>>>>>>
> >>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>>>>>> ---
> >>>>>>>    include/exec/memory.h | 2 ++
> >>>>>>>    memory.c              | 6 ++++++
> >>>>>>>    2 files changed, 8 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
> >>>>>>> 0767888..30b2a74 100644
> >>>>>>> --- a/include/exec/memory.h
> >>>>>>> +++ b/include/exec/memory.h
> >>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> >>>>>>>        void (*notify_flag_changed)(MemoryRegion *iommu,
> >>>>>>>                                    IOMMUNotifierFlag old_flags,
> >>>>>>>                                    IOMMUNotifierFlag new_flags);
> >>>>>>> +    /* Set this up to provide customized IOMMU replay function */
> >>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier
> >>>>>>> + *notifier);
> >>>>>>>    };
> >>>>>>>
> >>>>>>>    typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> >>>>>>> diff --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> >>>>>>> --- a/memory.c
> >>>>>>> +++ b/memory.c
> >>>>>>> @@ -1630,6 +1630,12 @@ void
> >>>>>>> memory_region_iommu_replay(MemoryRegion
> >>>>>>> *mr, IOMMUNotifier *n,
> >>>>>>>        hwaddr addr, granularity;
> >>>>>>>        IOMMUTLBEntry iotlb;
> >>>>>>> +    /* If the IOMMU has its own replay callback, override */
> >>>>>>> +    if (mr->iommu_ops->replay) {
> >>>>>>> +        mr->iommu_ops->replay(mr, n);
> >>>>>>> +        return;
> >>>>>>> +    }
> >>>>>> Hi Alex, Peter,
> >>>>>>
> >>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own
> >>>>>> replay callback as well? I guess it depends on whether the
> >>>>>> original replay algorithm work well for them? Do you have such knowledge?
> >>>>> I guess so. At least for VT-d we had this callback since the
> >>>>> default replay mechanism did not work well on x86 due to its
> >>>>> extremely large memory region size. Thanks,
> >>>> thx. that would make sense.
> >>> Peter,
> >>>
> >>> Just come to mind that there may be a corner case here.
> >>>
> >>> Intel VT-d actually has a "pt" mode which allows device use physical
> >>> address even when VT-d is enabled. In kernel, there is a
> iommu_identity_mapping.
> >>> If a device is in this map, then it would use "pt" mode. So that
> >>> IOMMU driver would not build second-level page table for it.
> >> Yes, but qemu does not support ECAP_PT now, so guest will still have
> >> a page table in this case.
> > That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So
> > this solution can work well even a device is in identify_map.
> >
> >>> Back to the virtual IOVA implementation, if an assigned device is in
> >>> the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do
> DMA.
> >>> So it demands a GPA->HPA mapping in host. However, the
> >>> iommu->ops.replay is not able to build it when guest SL page table is empty.
> >>>
> >>> So I think building an entire guest PA->HPA mapping before guest
> >>> kernel boot would be recommended. Any thoughts?
> >> We plan to add PT in 2.10, a possible rough idea is disabled iommu
> >> dmar region and use another region without iommu_ops. Then
> >> vfio_listener_region_add() will just do the correct mappings.
> > Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So
> > just comes to realize that the current replay solution may not work well when I
> expose ECAP_PT to guest.
> > I also have a rough idea here. The current listener in container
> > listens to address space named with devfn if virtual VTd is added. How
> > about adding one more listener to listen memory address space. So that the
> listener can build entire guest PA->HPA mapping.
> 
> This is only needed for PT. So looks like current code is sufficient to do this I think.
> See the else part of if (memory_region_is_iommu()) of vfio_listener_region_add().

Jason, when the listener listen to device address space, the "else part" may not work
even we set the mr->iommu_ops = NULL. The mr would be a non-ram region when the
time region_add is called since it is actually listen to changes from device address space.

Regards,
Yi L


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-03-31  7:30                 ` Liu, Yi L
@ 2017-04-01  5:00                   ` Jason Wang
  2017-04-01  6:39                     ` Liu, Yi L
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Wang @ 2017-04-01  5:00 UTC (permalink / raw)
  To: Liu, Yi L, 'Peter Xu'
  Cc: Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'qemu-devel@nongnu.org',
	'alex.williamson@redhat.com', 'David Gibson'



On 2017年03月31日 15:30, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Friday, March 31, 2017 3:17 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
>> 'mst@redhat.com' <mst@redhat.com>; 'jan.kiszka@siemens.com'
>> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'qemu-
>> devel@nongnu.org' <qemu-devel@nongnu.org>; 'alex.williamson@redhat.com'
>> <alex.williamson@redhat.com>; 'David Gibson' <david@gibson.dropbear.id.au>
>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>> MemoryRegionIOMMUOps.replay() callback
>>
>>
>>
>> On 2017年03月31日 13:34, Liu, Yi L wrote:
>>>> -----Original Message-----
>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>> Sent: Thursday, March 30, 2017 7:58 PM
>>>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
>>>> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan,
>>>> Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
>> 'mst@redhat.com'
>>>> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;
>>>> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'
>>>> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-
>>>> devel@nongnu.org>
>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>> MemoryRegionIOMMUOps.replay() callback
>>>>
>>>>
>>>>
>>>> On 2017年03月30日 19:06, Liu, Yi L wrote:
>>>>>> -----Original Message-----
>>>>>> From: Liu, Yi L
>>>>>> Sent: Monday, March 27, 2017 5:22 PM
>>>>>> To: Peter Xu <peterx@redhat.com>
>>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>>>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>>>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;
>>>>>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>>>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>>>>> Sent: Monday, March 27, 2017 5:12 PM
>>>>>>> To: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu
>>>>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
>>>>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
>>>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;
>>>>>>> qemu-devel@nongnu.org
>>>>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>>
>>>>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Qemu-devel
>>>>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
>>>>>>>>> Behalf Of Peter Xu
>>>>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
>>>>>>>>> To: qemu-devel@nongnu.org
>>>>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>>>>>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>>>>>>>>> jasowang@redhat.com; peterx@redhat.com;
>>>>>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
>>>>>>>>> <david@gibson.dropbear.id.au>
>>>>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>>>>
>>>>>>>>> Originally we have one memory_region_iommu_replay() function,
>>>>>>>>> which is the default behavior to replay the translations of the
>>>>>>>>> whole IOMMU region. However, on some platform like x86, we may
>>>>>>>>> want our own
>>>>>>> replay logic for IOMMU regions.
>>>>>>>>> This patch add one more hook for IOMMUOps for the callback, and
>>>>>>>>> it'll override the default if set.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>>>> ---
>>>>>>>>>     include/exec/memory.h | 2 ++
>>>>>>>>>     memory.c              | 6 ++++++
>>>>>>>>>     2 files changed, 8 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
>>>>>>>>> 0767888..30b2a74 100644
>>>>>>>>> --- a/include/exec/memory.h
>>>>>>>>> +++ b/include/exec/memory.h
>>>>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>>>>>>>>>         void (*notify_flag_changed)(MemoryRegion *iommu,
>>>>>>>>>                                     IOMMUNotifierFlag old_flags,
>>>>>>>>>                                     IOMMUNotifierFlag new_flags);
>>>>>>>>> +    /* Set this up to provide customized IOMMU replay function */
>>>>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier
>>>>>>>>> + *notifier);
>>>>>>>>>     };
>>>>>>>>>
>>>>>>>>>     typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>>>>>>>>> diff --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
>>>>>>>>> --- a/memory.c
>>>>>>>>> +++ b/memory.c
>>>>>>>>> @@ -1630,6 +1630,12 @@ void
>>>>>>>>> memory_region_iommu_replay(MemoryRegion
>>>>>>>>> *mr, IOMMUNotifier *n,
>>>>>>>>>         hwaddr addr, granularity;
>>>>>>>>>         IOMMUTLBEntry iotlb;
>>>>>>>>> +    /* If the IOMMU has its own replay callback, override */
>>>>>>>>> +    if (mr->iommu_ops->replay) {
>>>>>>>>> +        mr->iommu_ops->replay(mr, n);
>>>>>>>>> +        return;
>>>>>>>>> +    }
>>>>>>>> Hi Alex, Peter,
>>>>>>>>
>>>>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own
>>>>>>>> replay callback as well? I guess it depends on whether the
>>>>>>>> original replay algorithm work well for them? Do you have such knowledge?
>>>>>>> I guess so. At least for VT-d we had this callback since the
>>>>>>> default replay mechanism did not work well on x86 due to its
>>>>>>> extremely large memory region size. Thanks,
>>>>>> thx. that would make sense.
>>>>> Peter,
>>>>>
>>>>> Just come to mind that there may be a corner case here.
>>>>>
>>>>> Intel VT-d actually has a "pt" mode which allows device use physical
>>>>> address even when VT-d is enabled. In kernel, there is a
>> iommu_identity_mapping.
>>>>> If a device is in this map, then it would use "pt" mode. So that
>>>>> IOMMU driver would not build second-level page table for it.
>>>> Yes, but qemu does not support ECAP_PT now, so guest will still have
>>>> a page table in this case.
>>> That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So
>>> this solution can work well even a device is in identify_map.
>>>
>>>>> Back to the virtual IOVA implementation, if an assigned device is in
>>>>> the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do
>> DMA.
>>>>> So it demands a GPA->HPA mapping in host. However, the
>>>>> iommu->ops.replay is not able to build it when guest SL page table is empty.
>>>>>
>>>>> So I think building an entire guest PA->HPA mapping before guest
>>>>> kernel boot would be recommended. Any thoughts?
>>>> We plan to add PT in 2.10, a possible rough idea is disabled iommu
>>>> dmar region and use another region without iommu_ops. Then
>>>> vfio_listener_region_add() will just do the correct mappings.
>>> Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So
>>> just comes to realize that the current replay solution may not work well when I
>> expose ECAP_PT to guest.
>>> I also have a rough idea here. The current listener in container
>>> listens to address space named with devfn if virtual VTd is added. How
>>> about adding one more listener to listen memory address space. So that the
>> listener can build entire guest PA->HPA mapping.
>>
>> This is only needed for PT. So looks like current code is sufficient to do this I think.
>> See the else part of if (memory_region_is_iommu()) of vfio_listener_region_add().
> Jason, when the listener listen to device address space, the "else part" may not work
> even we set the mr->iommu_ops = NULL. The mr would be a non-ram region when the
> time region_add is called since it is actually listen to changes from device address space.
>
> Regards,
> Yi L
>

See Peter's patch ("intel_iommu: allow dynamic switch of IOMMU region"). 
It has

+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
+                                 "vtd_sys_alias", get_system_memory(),
+                                 0, 
memory_region_size(get_system_memory()));

We can enable sys_alias in when PT is used which should work I think.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-04-01  5:00                   ` Jason Wang
@ 2017-04-01  6:39                     ` Liu, Yi L
  0 siblings, 0 replies; 63+ messages in thread
From: Liu, Yi L @ 2017-04-01  6:39 UTC (permalink / raw)
  To: Jason Wang, 'Peter Xu'
  Cc: Lan, Tianyu, Tian, Kevin, 'mst@redhat.com',
	'jan.kiszka@siemens.com', 'bd.aviv@gmail.com',
	'qemu-devel@nongnu.org',
	'alex.williamson@redhat.com', 'David Gibson'

> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Saturday, April 1, 2017 1:01 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> 'mst@redhat.com' <mst@redhat.com>; 'jan.kiszka@siemens.com'
> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'qemu-
> devel@nongnu.org' <qemu-devel@nongnu.org>; 'alex.williamson@redhat.com'
> <alex.williamson@redhat.com>; 'David Gibson' <david@gibson.dropbear.id.au>
> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> MemoryRegionIOMMUOps.replay() callback
> 
> 
> 
> On 2017年03月31日 15:30, Liu, Yi L wrote:
> >> -----Original Message-----
> >> From: Jason Wang [mailto:jasowang@redhat.com]
> >> Sent: Friday, March 31, 2017 3:17 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
> >> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >> <kevin.tian@intel.com>; 'mst@redhat.com' <mst@redhat.com>;
> 'jan.kiszka@siemens.com'
> >> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>;
> >> 'qemu- devel@nongnu.org' <qemu-devel@nongnu.org>;
> 'alex.williamson@redhat.com'
> >> <alex.williamson@redhat.com>; 'David Gibson'
> >> <david@gibson.dropbear.id.au>
> >> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >> MemoryRegionIOMMUOps.replay() callback
> >>
> >>
> >>
> >> On 2017年03月31日 13:34, Liu, Yi L wrote:
> >>>> -----Original Message-----
> >>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>> Sent: Thursday, March 30, 2017 7:58 PM
> >>>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
> >>>> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan,
> >>>> Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> >> 'mst@redhat.com'
> >>>> <mst@redhat.com>; 'jan.kiszka@siemens.com'
> >>>> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>;
> 'David Gibson'
> >>>> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-
> >>>> devel@nongnu.org>
> >>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>> MemoryRegionIOMMUOps.replay() callback
> >>>>
> >>>>
> >>>>
> >>>> On 2017年03月30日 19:06, Liu, Yi L wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: Liu, Yi L
> >>>>>> Sent: Monday, March 27, 2017 5:22 PM
> >>>>>> To: Peter Xu <peterx@redhat.com>
> >>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu
> >>>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> >>>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
> >>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;
> >>>>>> qemu-devel@nongnu.org
> >>>>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>>> MemoryRegionIOMMUOps.replay() callback
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Peter Xu [mailto:peterx@redhat.com]
> >>>>>>> Sent: Monday, March 27, 2017 5:12 PM
> >>>>>>> To: Liu, Yi L <yi.l.liu@intel.com>
> >>>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu
> >>>>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> >>>>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
> >>>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;
> >>>>>>> qemu-devel@nongnu.org
> >>>>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>>>> MemoryRegionIOMMUOps.replay() callback
> >>>>>>>
> >>>>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Qemu-devel
> >>>>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> >>>>>>>>> Behalf Of Peter Xu
> >>>>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
> >>>>>>>>> To: qemu-devel@nongnu.org
> >>>>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >>>>>>>>> <kevin.tian@intel.com>; mst@redhat.com;
> >>>>>>>>> jan.kiszka@siemens.com; jasowang@redhat.com;
> >>>>>>>>> peterx@redhat.com; alex.williamson@redhat.com;
> >>>>>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>
> >>>>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>>>>>> MemoryRegionIOMMUOps.replay() callback
> >>>>>>>>>
> >>>>>>>>> Originally we have one memory_region_iommu_replay() function,
> >>>>>>>>> which is the default behavior to replay the translations of
> >>>>>>>>> the whole IOMMU region. However, on some platform like x86, we
> >>>>>>>>> may want our own
> >>>>>>> replay logic for IOMMU regions.
> >>>>>>>>> This patch add one more hook for IOMMUOps for the callback,
> >>>>>>>>> and it'll override the default if set.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>>>>>>>> ---
> >>>>>>>>>     include/exec/memory.h | 2 ++
> >>>>>>>>>     memory.c              | 6 ++++++
> >>>>>>>>>     2 files changed, 8 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
> >>>>>>>>> index
> >>>>>>>>> 0767888..30b2a74 100644
> >>>>>>>>> --- a/include/exec/memory.h
> >>>>>>>>> +++ b/include/exec/memory.h
> >>>>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> >>>>>>>>>         void (*notify_flag_changed)(MemoryRegion *iommu,
> >>>>>>>>>                                     IOMMUNotifierFlag old_flags,
> >>>>>>>>>                                     IOMMUNotifierFlag
> >>>>>>>>> new_flags);
> >>>>>>>>> +    /* Set this up to provide customized IOMMU replay function */
> >>>>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier
> >>>>>>>>> + *notifier);
> >>>>>>>>>     };
> >>>>>>>>>
> >>>>>>>>>     typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> >>>>>>>>> diff --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> >>>>>>>>> --- a/memory.c
> >>>>>>>>> +++ b/memory.c
> >>>>>>>>> @@ -1630,6 +1630,12 @@ void
> >>>>>>>>> memory_region_iommu_replay(MemoryRegion
> >>>>>>>>> *mr, IOMMUNotifier *n,
> >>>>>>>>>         hwaddr addr, granularity;
> >>>>>>>>>         IOMMUTLBEntry iotlb;
> >>>>>>>>> +    /* If the IOMMU has its own replay callback, override */
> >>>>>>>>> +    if (mr->iommu_ops->replay) {
> >>>>>>>>> +        mr->iommu_ops->replay(mr, n);
> >>>>>>>>> +        return;
> >>>>>>>>> +    }
> >>>>>>>> Hi Alex, Peter,
> >>>>>>>>
> >>>>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own
> >>>>>>>> replay callback as well? I guess it depends on whether the
> >>>>>>>> original replay algorithm work well for them? Do you have such
> knowledge?
> >>>>>>> I guess so. At least for VT-d we had this callback since the
> >>>>>>> default replay mechanism did not work well on x86 due to its
> >>>>>>> extremely large memory region size. Thanks,
> >>>>>> thx. that would make sense.
> >>>>> Peter,
> >>>>>
> >>>>> Just come to mind that there may be a corner case here.
> >>>>>
> >>>>> Intel VT-d actually has a "pt" mode which allows device use
> >>>>> physical address even when VT-d is enabled. In kernel, there is a
> >> iommu_identity_mapping.
> >>>>> If a device is in this map, then it would use "pt" mode. So that
> >>>>> IOMMU driver would not build second-level page table for it.
> >>>> Yes, but qemu does not support ECAP_PT now, so guest will still
> >>>> have a page table in this case.
> >>> That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map.
> >>> So this solution can work well even a device is in identify_map.
> >>>
> >>>>> Back to the virtual IOVA implementation, if an assigned device is
> >>>>> in the iommu_identity_mapping(e.g. VGA controller), it uses GPA
> >>>>> directly to do
> >> DMA.
> >>>>> So it demands a GPA->HPA mapping in host. However, the
> >>>>> iommu->ops.replay is not able to build it when guest SL page table is empty.
> >>>>>
> >>>>> So I think building an entire guest PA->HPA mapping before guest
> >>>>> kernel boot would be recommended. Any thoughts?
> >>>> We plan to add PT in 2.10, a possible rough idea is disabled iommu
> >>>> dmar region and use another region without iommu_ops. Then
> >>>> vfio_listener_region_add() will just do the correct mappings.
> >>> Good to know it. Actually, I also need to expose ECAP_PT for vSVM.
> >>> So just comes to realize that the current replay solution may not
> >>> work well when I
> >> expose ECAP_PT to guest.
> >>> I also have a rough idea here. The current listener in container
> >>> listens to address space named with devfn if virtual VTd is added.
> >>> How about adding one more listener to listen memory address space.
> >>> So that the
> >> listener can build entire guest PA->HPA mapping.
> >>
> >> This is only needed for PT. So looks like current code is sufficient to do this I think.
> >> See the else part of if (memory_region_is_iommu()) of vfio_listener_region_add().
> > Jason, when the listener listen to device address space, the "else
> > part" may not work even we set the mr->iommu_ops = NULL. The mr would
> > be a non-ram region when the time region_add is called since it is actually listen to
> changes from device address space.
> >
> > Regards,
> > Yi L
> >
> 
> See Peter's patch ("intel_iommu: allow dynamic switch of IOMMU region").
> It has
> 
> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> +                                 "vtd_sys_alias", get_system_memory(),
> +                                 0,
> memory_region_size(get_system_memory()));
> 
> We can enable sys_alias in when PT is used which should work I think.

Great. I think it works. Thx.

Regards,
Yi L

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2017-04-01  6:39 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-07  8:28 [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Peter Xu
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 01/17] vfio: trace map/unmap for notify as well Peter Xu
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 02/17] vfio: introduce vfio_get_vaddr() Peter Xu
2017-02-10  1:12   ` David Gibson
2017-02-10  5:50     ` Peter Xu
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 03/17] vfio: allow to notify unmap for very large region Peter Xu
2017-02-10  1:13   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 04/17] intel_iommu: add "caching-mode" option Peter Xu
2017-02-10  1:14   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 05/17] intel_iommu: simplify irq region translation Peter Xu
2017-02-10  1:15   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 06/17] intel_iommu: renaming gpa to iova where proper Peter Xu
2017-02-10  1:17   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 07/17] intel_iommu: convert dbg macros to traces for inv Peter Xu
2017-02-08  2:47   ` Jason Wang
2017-02-10  1:19   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 08/17] intel_iommu: convert dbg macros to trace for trans Peter Xu
2017-02-08  2:49   ` Jason Wang
2017-02-10  1:20   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 09/17] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2017-02-10  1:20   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 10/17] memory: add section range info for IOMMU notifier Peter Xu
2017-02-10  2:29   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 11/17] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
2017-02-10  2:30   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 12/17] memory: provide iommu_replay_all() Peter Xu
2017-02-10  2:31   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 13/17] memory: introduce memory_region_notify_one() Peter Xu
2017-02-10  2:33   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 14/17] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2017-02-10  2:34   ` David Gibson
2017-03-27  8:35   ` Liu, Yi L
2017-03-27  9:12     ` Peter Xu
2017-03-27  9:21       ` Liu, Yi L
2017-03-30 11:06         ` Liu, Yi L
2017-03-30 11:57           ` Jason Wang
2017-03-31  2:56             ` Peter Xu
2017-03-31  4:21               ` Jason Wang
2017-03-31  5:01                 ` Peter Xu
2017-03-31  5:12                   ` Jason Wang
2017-03-31  5:28                     ` Peter Xu
2017-03-31  5:34             ` Liu, Yi L
2017-03-31  7:16               ` Jason Wang
2017-03-31  7:30                 ` Liu, Yi L
2017-04-01  5:00                   ` Jason Wang
2017-04-01  6:39                     ` Liu, Yi L
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 15/17] intel_iommu: provide its own replay() callback Peter Xu
2017-02-10  2:36   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 16/17] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2017-02-10  2:38   ` David Gibson
2017-02-07  8:28 ` [Qemu-devel] [PATCH v7 17/17] intel_iommu: enable vfio devices Peter Xu
2017-02-10  6:24   ` Jason Wang
2017-03-16  4:05   ` Peter Xu
2017-03-19 15:34     ` Aviv B.D.
2017-03-20  1:56       ` Peter Xu
2017-03-20  2:12         ` Liu, Yi L
2017-03-20  2:41           ` Peter Xu
2017-02-17 17:18 ` [Qemu-devel] [PATCH v7 00/17] VT-d: vfio enablement and misc enhances Alex Williamson
2017-02-20  7:47   ` Peter Xu
2017-02-20  8:17     ` Liu, Yi L
2017-02-20  8:32       ` Peter Xu
2017-02-20 19:15     ` Alex Williamson
2017-02-28  7:52 ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.