All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes
@ 2017-05-19  3:19 Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 01/10] memory: tune last param of iommu_ops.translate() Peter Xu
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

Online repo:

  https://github.com/xzpeter/qemu/tree/vtd-passthrough-misc-1

v4
- remove patch "pc: add 2.10 machine type" since it's merged already
- remove patch "memory: fix address_space_get_iotlb_entry()" since
  the problem is fixed by an much better upstream patch:
  "exec: abstract address_space_do_translate()"
- fix issue that patchew reported
- introduce vtd_ce_type_check(), then use it in
  vtd_dev_to_context_entry() [Jason]
- vtd_pt_enable_fast_path() don't use "SUCCESS" but bool [Jason]
- do address space switch for dsi/global ce invalidate [Jason]
- remove detect_pt in vtd_switch_address_space() [Jason]

v3:
- add one patch to provide machine type 2.10 for pc, add r-b for Edurado
- add r-b/a-b for David on the two memory patches
- add a-b for Paolo on the two memory patches
- remove useless if in vtd_switch_address_space() [Jason]
- check pt_supported when needed [Yi]
- one more patch to check whether dev-iotlb is supported before
  allowing such type of context entry
- enable pt fast path (squashed into current patch 10). when we found
  pt is setup on the first translation, we do address space switch.
  When it's unset, we can capture it via invalidations. [Jason]
- add compat bit for HW_COMPAT_2_9 for "pt" param [Jason]
- one vhost fix for pt (added patch 4)
- faster vhost caching (added patch 12) [Jason]

This series add support for per-device passthrough mode for VT-d
emulation, along with some tweaks on existing codes.

Patches 1-2: memory related cleanups.

Patches 3-7: some VT-d cleanups and fixes.

Patch 8: add support for passthrough.

Patch 9: turn pt off for machine type <=2.9, for compatibility.

Patch 10: vhost enhancement when used with passthrough, to pre-cache
          static mappings.

A simple test with PT mode using 10g nic is line speed.

Please review. Thanks.

Peter Xu (10):
  memory: tune last param of iommu_ops.translate()
  memory: remove the last param in memory_region_iommu_replay()
  x86-iommu: use DeviceClass properties
  intel_iommu: renaming context entry helpers
  intel_iommu: provide vtd_ce_get_type()
  intel_iommu: use IOMMU_ACCESS_FLAG()
  intel_iommu: allow dev-iotlb context entry conditionally
  intel_iommu: support passthrough (PT)
  intel_iommu: turn off pt before 2.9
  vhost: iommu: cache static mapping if there is

 exec.c                         |   3 +-
 hw/alpha/typhoon.c             |   2 +-
 hw/dma/rc4030.c                |   2 +-
 hw/i386/amd_iommu.c            |   4 +-
 hw/i386/intel_iommu.c          | 313 ++++++++++++++++++++++++++++++-----------
 hw/i386/intel_iommu_internal.h |   1 +
 hw/i386/trace-events           |   2 +
 hw/i386/x86-iommu.c            |  48 ++-----
 hw/pci-host/apb.c              |   2 +-
 hw/ppc/spapr_iommu.c           |   2 +-
 hw/s390x/s390-pci-bus.c        |   2 +-
 hw/s390x/s390-pci-inst.c       |   2 +-
 hw/vfio/common.c               |   2 +-
 hw/virtio/trace-events         |   4 +
 hw/virtio/vhost.c              |  49 +++++++
 include/exec/memory.h          |  15 +-
 include/hw/compat.h            |   4 +
 include/hw/i386/x86-iommu.h    |   1 +
 memory.c                       |   7 +-
 19 files changed, 319 insertions(+), 146 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 01/10] memory: tune last param of iommu_ops.translate()
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 02/10] memory: remove the last param in memory_region_iommu_replay() Peter Xu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang, Paolo Bonzini

This patch converts the old "is_write" bool into IOMMUAccessFlags. The
difference is that "is_write" can only express either read/write, but
sometimes what we really want is "none" here (neither read nor write).
Replay is an good example - during replay, we should not check any RW
permission bits since thats not an actual IO at all.

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 exec.c                   |  3 ++-
 hw/alpha/typhoon.c       |  2 +-
 hw/dma/rc4030.c          |  2 +-
 hw/i386/amd_iommu.c      |  4 ++--
 hw/i386/intel_iommu.c    |  4 ++--
 hw/pci-host/apb.c        |  2 +-
 hw/ppc/spapr_iommu.c     |  2 +-
 hw/s390x/s390-pci-bus.c  |  2 +-
 hw/s390x/s390-pci-inst.c |  2 +-
 include/exec/memory.h    | 10 ++++++++--
 memory.c                 |  3 ++-
 11 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/exec.c b/exec.c
index 96e3ac9..0d479b2 100644
--- a/exec.c
+++ b/exec.c
@@ -485,7 +485,8 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
             break;
         }
 
-        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
+        iotlb = mr->iommu_ops->translate(mr, addr, is_write ?
+                                         IOMMU_WO : IOMMU_RO);
         addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
                 | (addr & iotlb.addr_mask));
         *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);
diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index f50f5cf..c1cf780 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -664,7 +664,7 @@ static bool window_translate(TyphoonWindow *win, hwaddr addr,
 /* TODO: A translation failure here ought to set PCI error codes on the
    Pchip and generate a machine check interrupt.  */
 static IOMMUTLBEntry typhoon_translate_iommu(MemoryRegion *iommu, hwaddr addr,
-                                             bool is_write)
+                                             IOMMUAccessFlags flag)
 {
     TyphoonPchip *pchip = container_of(iommu, TyphoonPchip, iommu);
     IOMMUTLBEntry ret;
diff --git a/hw/dma/rc4030.c b/hw/dma/rc4030.c
index 0080141..edf9432 100644
--- a/hw/dma/rc4030.c
+++ b/hw/dma/rc4030.c
@@ -489,7 +489,7 @@ static const MemoryRegionOps jazzio_ops = {
 };
 
 static IOMMUTLBEntry rc4030_dma_translate(MemoryRegion *iommu, hwaddr addr,
-                                          bool is_write)
+                                          IOMMUAccessFlags flag)
 {
     rc4030State *s = container_of(iommu, rc4030State, dma_mr);
     IOMMUTLBEntry ret = {
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 329058d..7b6d4ea 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -988,7 +988,7 @@ static inline bool amdvi_is_interrupt_addr(hwaddr addr)
 }
 
 static IOMMUTLBEntry amdvi_translate(MemoryRegion *iommu, hwaddr addr,
-                                     bool is_write)
+                                     IOMMUAccessFlags flag)
 {
     AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
     AMDVIState *s = as->iommu_state;
@@ -1017,7 +1017,7 @@ static IOMMUTLBEntry amdvi_translate(MemoryRegion *iommu, hwaddr addr,
         return ret;
     }
 
-    amdvi_do_translate(as, addr, is_write, &ret);
+    amdvi_do_translate(as, addr, flag & IOMMU_WO, &ret);
     trace_amdvi_translation_result(as->bus_num, PCI_SLOT(as->devfn),
             PCI_FUNC(as->devfn), addr, ret.translated_addr);
     return ret;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9ba2162..4a51df8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2221,7 +2221,7 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
 }
 
 static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
-                                         bool is_write)
+                                         IOMMUAccessFlags flag)
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
     IntelIOMMUState *s = vtd_as->iommu_state;
@@ -2243,7 +2243,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
     }
 
     vtd_do_iommu_translate(vtd_as, vtd_as->bus, vtd_as->devfn, addr,
-                           is_write, &ret);
+                           flag & IOMMU_WO, &ret);
     VTD_DPRINTF(MMU,
                 "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
                 " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
diff --git a/hw/pci-host/apb.c b/hw/pci-host/apb.c
index edc88f4..2a80f68 100644
--- a/hw/pci-host/apb.c
+++ b/hw/pci-host/apb.c
@@ -209,7 +209,7 @@ static AddressSpace *pbm_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 
 /* Called from RCU critical section */
 static IOMMUTLBEntry pbm_translate_iommu(MemoryRegion *iommu, hwaddr addr,
-                                         bool is_write)
+                                         IOMMUAccessFlags flag)
 {
     IOMMUState *is = container_of(iommu, IOMMUState, iommu);
     hwaddr baseaddr, offset;
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 29c80bb..0341bc0 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -111,7 +111,7 @@ static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
 
 /* Called from RCU critical section */
 static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
-                                               bool is_write)
+                                               IOMMUAccessFlags flag)
 {
     sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
     uint64_t tce;
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 66a6fbe..5651483 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -357,7 +357,7 @@ out:
 }
 
 static IOMMUTLBEntry s390_translate_iommu(MemoryRegion *mr, hwaddr addr,
-                                          bool is_write)
+                                          IOMMUAccessFlags flag)
 {
     uint64_t pte;
     uint32_t flags;
diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 314a9cb..8bc7c98 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -624,7 +624,7 @@ int rpcit_service_call(S390CPU *cpu, uint8_t r1, uint8_t r2)
 
     mr = &iommu->iommu_mr;
     while (start < end) {
-        entry = mr->iommu_ops->translate(mr, start, 0);
+        entry = mr->iommu_ops->translate(mr, start, IOMMU_NONE);
 
         if (!entry.translated_addr) {
             pbdev->state = ZPCI_FS_ERROR;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 99e0f54..97fd0c2 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -185,8 +185,14 @@ struct MemoryRegionOps {
 typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 
 struct MemoryRegionIOMMUOps {
-    /* Return a TLB entry that contains a given address. */
-    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /*
+     * Return a TLB entry that contains a given address. Flag should
+     * be the access permission of this translation operation. We can
+     * set flag to IOMMU_NONE to mean that we don't need any
+     * read/write permission checks, like, when for region replay.
+     */
+    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr,
+                               IOMMUAccessFlags flag);
     /* Returns minimum supported page size */
     uint64_t (*get_min_page_size)(MemoryRegion *iommu);
     /* Called when IOMMU Notifier flag changed */
diff --git a/memory.c b/memory.c
index b727f5e..3f0aae8 100644
--- a/memory.c
+++ b/memory.c
@@ -1625,6 +1625,7 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
 {
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
+    IOMMUAccessFlags flag = is_write ? IOMMU_WO : IOMMU_RO;
 
     /* If the IOMMU has its own replay callback, override */
     if (mr->iommu_ops->replay) {
@@ -1635,7 +1636,7 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-        iotlb = mr->iommu_ops->translate(mr, addr, is_write);
+        iotlb = mr->iommu_ops->translate(mr, addr, flag);
         if (iotlb.perm != IOMMU_NONE) {
             n->notify(n, &iotlb);
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 02/10] memory: remove the last param in memory_region_iommu_replay()
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 01/10] memory: tune last param of iommu_ops.translate() Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 03/10] x86-iommu: use DeviceClass properties Peter Xu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang, Paolo Bonzini

We were always passing in that one as "false" to assume that's an read
operation, and we also assume that IOMMU translation would always have
that read permission. A better permission would be IOMMU_NONE since the
replay is after all not a real read operation, but just a page table
rebuilding process.

CC: David Gibson <david@gibson.dropbear.id.au>
CC: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c      | 2 +-
 include/exec/memory.h | 5 +----
 memory.c              | 8 +++-----
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a8f12ee..b9abe77 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -502,7 +502,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
-        memory_region_iommu_replay(giommu->iommu, &giommu->n, false);
+        memory_region_iommu_replay(giommu->iommu, &giommu->n);
 
         return;
     }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 97fd0c2..bfdc685 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -731,11 +731,8 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
  *
  * @mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
- * @is_write: Whether to treat the replay as a translate "write"
- *     through the iommu
  */
-void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
-                                bool is_write);
+void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n);
 
 /**
  * memory_region_iommu_replay_all: replay existing IOMMU translations
diff --git a/memory.c b/memory.c
index 3f0aae8..0ddc4cc 100644
--- a/memory.c
+++ b/memory.c
@@ -1620,12 +1620,10 @@ uint64_t memory_region_iommu_get_min_page_size(MemoryRegion *mr)
     return TARGET_PAGE_SIZE;
 }
 
-void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
-                                bool is_write)
+void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
 {
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
-    IOMMUAccessFlags flag = is_write ? IOMMU_WO : IOMMU_RO;
 
     /* If the IOMMU has its own replay callback, override */
     if (mr->iommu_ops->replay) {
@@ -1636,7 +1634,7 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-        iotlb = mr->iommu_ops->translate(mr, addr, flag);
+        iotlb = mr->iommu_ops->translate(mr, addr, IOMMU_NONE);
         if (iotlb.perm != IOMMU_NONE) {
             n->notify(n, &iotlb);
         }
@@ -1654,7 +1652,7 @@ void memory_region_iommu_replay_all(MemoryRegion *mr)
     IOMMUNotifier *notifier;
 
     IOMMU_NOTIFIER_FOREACH(notifier, mr) {
-        memory_region_iommu_replay(mr, notifier, false);
+        memory_region_iommu_replay(mr, notifier);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 03/10] x86-iommu: use DeviceClass properties
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 01/10] memory: tune last param of iommu_ops.translate() Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 02/10] memory: remove the last param in memory_region_iommu_replay() Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 04/10] intel_iommu: renaming context entry helpers Peter Xu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

No reason to keep tens of lines if we can do it actually far shorter.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/x86-iommu.c | 47 +++++++----------------------------------------
 1 file changed, 7 insertions(+), 40 deletions(-)

diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
index 23dcd3f..02b8825 100644
--- a/hw/i386/x86-iommu.c
+++ b/hw/i386/x86-iommu.c
@@ -88,55 +88,22 @@ static void x86_iommu_realize(DeviceState *dev, Error **errp)
     x86_iommu_set_default(X86_IOMMU_DEVICE(dev));
 }
 
+static Property x86_iommu_properties[] = {
+    DEFINE_PROP_BOOL("intremap", X86IOMMUState, intr_supported, false),
+    DEFINE_PROP_BOOL("device-iotlb", X86IOMMUState, dt_supported, false),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
 static void x86_iommu_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     dc->realize = x86_iommu_realize;
-}
-
-static bool x86_iommu_intremap_prop_get(Object *o, Error **errp)
-{
-    X86IOMMUState *s = X86_IOMMU_DEVICE(o);
-    return s->intr_supported;
-}
-
-static void x86_iommu_intremap_prop_set(Object *o, bool value, Error **errp)
-{
-    X86IOMMUState *s = X86_IOMMU_DEVICE(o);
-    s->intr_supported = value;
-}
-
-static bool x86_iommu_device_iotlb_prop_get(Object *o, Error **errp)
-{
-    X86IOMMUState *s = X86_IOMMU_DEVICE(o);
-    return s->dt_supported;
-}
-
-static void x86_iommu_device_iotlb_prop_set(Object *o, bool value, Error **errp)
-{
-    X86IOMMUState *s = X86_IOMMU_DEVICE(o);
-    s->dt_supported = value;
-}
-
-static void x86_iommu_instance_init(Object *o)
-{
-    X86IOMMUState *s = X86_IOMMU_DEVICE(o);
-
-    /* By default, do not support IR */
-    s->intr_supported = false;
-    object_property_add_bool(o, "intremap", x86_iommu_intremap_prop_get,
-                             x86_iommu_intremap_prop_set, NULL);
-    s->dt_supported = false;
-    object_property_add_bool(o, "device-iotlb",
-                             x86_iommu_device_iotlb_prop_get,
-                             x86_iommu_device_iotlb_prop_set,
-                             NULL);
+    dc->props = x86_iommu_properties;
 }
 
 static const TypeInfo x86_iommu_info = {
     .name          = TYPE_X86_IOMMU_DEVICE,
     .parent        = TYPE_SYS_BUS_DEVICE,
-    .instance_init = x86_iommu_instance_init,
     .instance_size = sizeof(X86IOMMUState),
     .class_init    = x86_iommu_class_init,
     .class_size    = sizeof(X86IOMMUClass),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 04/10] intel_iommu: renaming context entry helpers
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (2 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 03/10] x86-iommu: use DeviceClass properties Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 05/10] intel_iommu: provide vtd_ce_get_type() Peter Xu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

The old names are too long and less ordered. Let's start to use
vtd_ce_*() as a pattern.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4a51df8..f06055f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -512,7 +512,7 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
     return 0;
 }
 
-static inline bool vtd_context_entry_present(VTDContextEntry *context)
+static inline bool vtd_ce_present(VTDContextEntry *context)
 {
     return context->lo & VTD_CONTEXT_ENTRY_P;
 }
@@ -533,7 +533,7 @@ static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
     return 0;
 }
 
-static inline dma_addr_t vtd_get_slpt_base_from_context(VTDContextEntry *ce)
+static inline dma_addr_t vtd_ce_get_slpt_base(VTDContextEntry *ce)
 {
     return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
 }
@@ -585,19 +585,19 @@ static inline bool vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
 /* Get the page-table level that hardware should use for the second-level
  * page-table walk from the Address Width field of context-entry.
  */
-static inline uint32_t vtd_get_level_from_context_entry(VTDContextEntry *ce)
+static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
 {
     return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
 }
 
-static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
+static inline uint32_t vtd_ce_get_agaw(VTDContextEntry *ce)
 {
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
 static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
 {
-    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
+    uint32_t ce_agaw = vtd_ce_get_agaw(ce);
     return 1ULL << MIN(ce_agaw, VTD_MGAW);
 }
 
@@ -642,8 +642,8 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
                              uint64_t *slptep, uint32_t *slpte_level,
                              bool *reads, bool *writes)
 {
-    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
-    uint32_t level = vtd_get_level_from_context_entry(ce);
+    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
+    uint32_t level = vtd_ce_get_level(ce);
     uint32_t offset;
     uint64_t slpte;
     uint64_t access_right_check;
@@ -664,7 +664,7 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
             VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
                         "entry at level %"PRIu32 " for iova 0x%"PRIx64,
                         level, iova);
-            if (level == vtd_get_level_from_context_entry(ce)) {
+            if (level == vtd_ce_get_level(ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
             } else {
@@ -809,8 +809,8 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
                          vtd_page_walk_hook hook_fn, void *private,
                          bool notify_unmap)
 {
-    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
-    uint32_t level = vtd_get_level_from_context_entry(ce);
+    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
+    uint32_t level = vtd_ce_get_level(ce);
 
     if (!vtd_iova_range_check(start, ce)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
@@ -851,7 +851,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return ret_fr;
     }
 
-    if (!vtd_context_entry_present(ce)) {
+    if (!vtd_ce_present(ce)) {
         /* Not error - it's okay we don't have context entry. */
         trace_vtd_ce_not_present(bus_num, devfn);
         return -VTD_FR_CONTEXT_ENTRY_P;
@@ -861,7 +861,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return -VTD_FR_CONTEXT_ENTRY_RSVD;
     }
     /* Check if the programming of context-entry is valid */
-    if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
+    if (!vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
         trace_vtd_ce_invalid(ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_INV;
     } else {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 05/10] intel_iommu: provide vtd_ce_get_type()
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (3 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 04/10] intel_iommu: renaming context entry helpers Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 06/10] intel_iommu: use IOMMU_ACCESS_FLAG() Peter Xu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

Helper to fetch VT-d context entry type.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f06055f..b477143 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -595,6 +595,11 @@ static inline uint32_t vtd_ce_get_agaw(VTDContextEntry *ce)
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
+static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce)
+{
+    return ce->lo & VTD_CONTEXT_ENTRY_TT;
+}
+
 static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
 {
     uint32_t ce_agaw = vtd_ce_get_agaw(ce);
@@ -865,7 +870,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         trace_vtd_ce_invalid(ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_INV;
     } else {
-        switch (ce->lo & VTD_CONTEXT_ENTRY_TT) {
+        switch (vtd_ce_get_type(ce)) {
         case VTD_CONTEXT_TT_MULTI_LEVEL:
             /* fall through */
         case VTD_CONTEXT_TT_DEV_IOTLB:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 06/10] intel_iommu: use IOMMU_ACCESS_FLAG()
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (4 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 05/10] intel_iommu: provide vtd_ce_get_type() Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 07/10] intel_iommu: allow dev-iotlb context entry conditionally Peter Xu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

We have that now, so why not use it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b477143..3240e5d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1010,7 +1010,7 @@ out:
     entry->iova = addr & page_mask;
     entry->translated_addr = vtd_get_slpte_addr(slpte) & page_mask;
     entry->addr_mask = ~page_mask;
-    entry->perm = (writes ? 2 : 0) + (reads ? 1 : 0);
+    entry->perm = IOMMU_ACCESS_FLAG(reads, writes);
 }
 
 static void vtd_root_table_setup(IntelIOMMUState *s)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 07/10] intel_iommu: allow dev-iotlb context entry conditionally
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (5 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 06/10] intel_iommu: use IOMMU_ACCESS_FLAG() Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 08/10] intel_iommu: support passthrough (PT) Peter Xu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

When device-iotlb is not specified, we should fail this check. A new
function vtd_ce_type_check() is introduced.

While I'm at it, clean up the vtd_dev_to_context_entry() a bit - replace
many "else if" usage into direct if check. That'll make the logic more
clear.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 49 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 3240e5d..aac2cc7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -600,6 +600,26 @@ static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce)
     return ce->lo & VTD_CONTEXT_ENTRY_TT;
 }
 
+/* Return true if check passed, otherwise false */
+static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
+                                     VTDContextEntry *ce)
+{
+    switch (vtd_ce_get_type(ce)) {
+    case VTD_CONTEXT_TT_MULTI_LEVEL:
+        /* Always supported */
+        break;
+    case VTD_CONTEXT_TT_DEV_IOTLB:
+        if (!x86_iommu->dt_supported) {
+            return false;
+        }
+        break;
+    default:
+        /* Unknwon type */
+        return false;
+    }
+    return true;
+}
+
 static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
 {
     uint32_t ce_agaw = vtd_ce_get_agaw(ce);
@@ -836,6 +856,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
 {
     VTDRootEntry re;
     int ret_fr;
+    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
 
     ret_fr = vtd_get_root_entry(s, bus_num, &re);
     if (ret_fr) {
@@ -846,7 +867,9 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         /* Not error - it's okay we don't have root entry. */
         trace_vtd_re_not_present(bus_num);
         return -VTD_FR_ROOT_ENTRY_P;
-    } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
+    }
+
+    if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
         trace_vtd_re_invalid(re.rsvd, re.val);
         return -VTD_FR_ROOT_ENTRY_RSVD;
     }
@@ -860,26 +883,26 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         /* Not error - it's okay we don't have context entry. */
         trace_vtd_ce_not_present(bus_num, devfn);
         return -VTD_FR_CONTEXT_ENTRY_P;
-    } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
-               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
+    }
+
+    if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
+        (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
         trace_vtd_ce_invalid(ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_RSVD;
     }
+
     /* Check if the programming of context-entry is valid */
     if (!vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
         trace_vtd_ce_invalid(ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_INV;
-    } else {
-        switch (vtd_ce_get_type(ce)) {
-        case VTD_CONTEXT_TT_MULTI_LEVEL:
-            /* fall through */
-        case VTD_CONTEXT_TT_DEV_IOTLB:
-            break;
-        default:
-            trace_vtd_ce_invalid(ce->hi, ce->lo);
-            return -VTD_FR_CONTEXT_ENTRY_INV;
-        }
     }
+
+    /* Do translation type check */
+    if (!vtd_ce_type_check(x86_iommu, ce)) {
+        trace_vtd_ce_invalid(ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_INV;
+    }
+
     return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 08/10] intel_iommu: support passthrough (PT)
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (6 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 07/10] intel_iommu: allow dev-iotlb context entry conditionally Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-25 10:40   ` Liu, Yi L
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 09/10] intel_iommu: turn off pt before 2.9 Peter Xu
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

Hardware support for VT-d device passthrough. Although current Linux can
live with iommu=pt even without this, but this is faster than when using
software passthrough.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 231 ++++++++++++++++++++++++++++++-----------
 hw/i386/intel_iommu_internal.h |   1 +
 hw/i386/trace-events           |   2 +
 hw/i386/x86-iommu.c            |   1 +
 include/hw/i386/x86-iommu.h    |   1 +
 5 files changed, 177 insertions(+), 59 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index aac2cc7..15610b9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -613,6 +613,11 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
             return false;
         }
         break;
+    case VTD_CONTEXT_TT_PASS_THROUGH:
+        if (!x86_iommu->pt_supported) {
+            return false;
+        }
+        break;
     default:
         /* Unknwon type */
         return false;
@@ -660,6 +665,29 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     }
 }
 
+/* Find the VTD address space associated with a given bus number */
+static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
+{
+    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
+    if (!vtd_bus) {
+        /*
+         * Iterate over the registered buses to find the one which
+         * currently hold this bus number, and update the bus_num
+         * lookup table:
+         */
+        GHashTableIter iter;
+
+        g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
+        while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
+            if (pci_bus_num(vtd_bus->bus) == bus_num) {
+                s->vtd_as_by_bus_num[bus_num] = vtd_bus;
+                return vtd_bus;
+            }
+        }
+    }
+    return vtd_bus;
+}
+
 /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
@@ -906,6 +934,91 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     return 0;
 }
 
+/*
+ * Fetch translation type for specific device. Returns <0 if error
+ * happens, otherwise return the shifted type to check against
+ * VTD_CONTEXT_TT_*.
+ */
+static int vtd_dev_get_trans_type(VTDAddressSpace *as)
+{
+    IntelIOMMUState *s;
+    VTDContextEntry ce;
+    int ret;
+
+    s = as->iommu_state;
+
+    ret = vtd_dev_to_context_entry(s, pci_bus_num(as->bus),
+                                   as->devfn, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    return vtd_ce_get_type(&ce);
+}
+
+static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
+{
+    int ret;
+
+    assert(as);
+
+    ret = vtd_dev_get_trans_type(as);
+    if (ret < 0) {
+        /*
+         * Possibly failed to parse the context entry for some reason
+         * (e.g., during init, or any guest configuration errors on
+         * context entries). We should assume PT not enabled for
+         * safety.
+         */
+        return false;
+    }
+
+    return ret == VTD_CONTEXT_TT_PASS_THROUGH;
+}
+
+/* Return whether the device is using IOMMU translation. */
+static bool vtd_switch_address_space(VTDAddressSpace *as)
+{
+    bool use_iommu;
+
+    assert(as);
+
+    use_iommu = as->iommu_state->dmar_enabled & !vtd_dev_pt_enabled(as);
+
+    trace_vtd_switch_address_space(pci_bus_num(as->bus),
+                                   VTD_PCI_SLOT(as->devfn),
+                                   VTD_PCI_FUNC(as->devfn),
+                                   use_iommu);
+
+    /* Turn off first then on the other */
+    if (use_iommu) {
+        memory_region_set_enabled(&as->sys_alias, false);
+        memory_region_set_enabled(&as->iommu, true);
+    } else {
+        memory_region_set_enabled(&as->iommu, false);
+        memory_region_set_enabled(&as->sys_alias, true);
+    }
+
+    return use_iommu;
+}
+
+static void vtd_switch_address_space_all(IntelIOMMUState *s)
+{
+    GHashTableIter iter;
+    VTDBus *vtd_bus;
+    int i;
+
+    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
+    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
+        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
+            if (!vtd_bus->dev_as[i]) {
+                continue;
+            }
+            vtd_switch_address_space(vtd_bus->dev_as[i]);
+        }
+    }
+}
+
 static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
 {
     return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
@@ -943,6 +1056,31 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
     return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
 }
 
+static void vtd_pt_enable_fast_path(IntelIOMMUState *s, uint16_t source_id)
+{
+    VTDBus *vtd_bus;
+    VTDAddressSpace *vtd_as;
+    bool success = false;
+
+    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
+    if (!vtd_bus) {
+        goto out;
+    }
+
+    vtd_as = vtd_bus->dev_as[VTD_SID_TO_DEVFN(source_id)];
+    if (!vtd_as) {
+        goto out;
+    }
+
+    if (vtd_switch_address_space(vtd_as) == false) {
+        /* We switched off IOMMU region successfully. */
+        success = true;
+    }
+
+out:
+    trace_vtd_pt_enable_fast_path(source_id, success);
+}
+
 /* Map dev to context-entry then do a paging-structures walk to do a iommu
  * translation.
  *
@@ -1014,6 +1152,30 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
+    /*
+     * We don't need to translate for pass-through context entries.
+     * Also, let's ignore IOTLB caching as well for PT devices.
+     */
+    if (vtd_ce_get_type(&ce) == VTD_CONTEXT_TT_PASS_THROUGH) {
+        entry->translated_addr = entry->iova;
+        entry->addr_mask = VTD_PAGE_SIZE - 1;
+        entry->perm = IOMMU_RW;
+        trace_vtd_translate_pt(source_id, entry->iova);
+
+        /*
+         * When this happens, it means firstly caching-mode is not
+         * enabled, and this is the first passthrough translation for
+         * the device. Let's enable the fast path for passthrough.
+         *
+         * When passthrough is disabled again for the device, we can
+         * capture it via the context entry invalidation, then the
+         * IOMMU region can be swapped back.
+         */
+        vtd_pt_enable_fast_path(s, source_id);
+
+        return;
+    }
+
     ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
                                &reads, &writes);
     if (ret_fr) {
@@ -1083,6 +1245,7 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
     }
+    vtd_switch_address_space_all(s);
     /*
      * From VT-d spec 6.5.2.1, a global context entry invalidation
      * should be followed by a IOTLB global invalidation, so we should
@@ -1093,29 +1256,6 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
-
-/* Find the VTD address space currently associated with a given bus number,
- */
-static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
-{
-    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
-    if (!vtd_bus) {
-        /* Iterate over the registered buses to find the one
-         * which currently hold this bus number, and update the bus_num lookup table:
-         */
-        GHashTableIter iter;
-
-        g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
-        while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
-            if (pci_bus_num(vtd_bus->bus) == bus_num) {
-                s->vtd_as_by_bus_num[bus_num] = vtd_bus;
-                return vtd_bus;
-            }
-        }
-    }
-    return vtd_bus;
-}
-
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -1158,6 +1298,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                                              VTD_PCI_FUNC(devfn_it));
                 vtd_as->context_cache_entry.context_cache_gen = 0;
                 /*
+                 * Do switch address space when needed, in case if the
+                 * device passthrough bit is switched.
+                 */
+                vtd_switch_address_space(vtd_as);
+                /*
                  * So a device is moving out of (or moving into) a
                  * domain, a replay() suites here to notify all the
                  * IOMMU_NOTIFIER_MAP registers about this change.
@@ -1389,42 +1534,6 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
 }
 
-static void vtd_switch_address_space(VTDAddressSpace *as)
-{
-    assert(as);
-
-    trace_vtd_switch_address_space(pci_bus_num(as->bus),
-                                   VTD_PCI_SLOT(as->devfn),
-                                   VTD_PCI_FUNC(as->devfn),
-                                   as->iommu_state->dmar_enabled);
-
-    /* Turn off first then on the other */
-    if (as->iommu_state->dmar_enabled) {
-        memory_region_set_enabled(&as->sys_alias, false);
-        memory_region_set_enabled(&as->iommu, true);
-    } else {
-        memory_region_set_enabled(&as->iommu, false);
-        memory_region_set_enabled(&as->sys_alias, true);
-    }
-}
-
-static void vtd_switch_address_space_all(IntelIOMMUState *s)
-{
-    GHashTableIter iter;
-    VTDBus *vtd_bus;
-    int i;
-
-    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
-    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
-        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
-            if (!vtd_bus->dev_as[i]) {
-                continue;
-            }
-            vtd_switch_address_space(vtd_bus->dev_as[i]);
-        }
-    }
-}
-
 /* Handle Translation Enable/Disable */
 static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 {
@@ -2872,6 +2981,10 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_DT;
     }
 
+    if (x86_iommu->pt_supported) {
+        s->ecap |= VTD_ECAP_PT;
+    }
+
     if (s->caching_mode) {
         s->cap |= VTD_CAP_CM;
     }
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 29d6707..0e73a65 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -187,6 +187,7 @@
 /* Interrupt Remapping support */
 #define VTD_ECAP_IR                 (1ULL << 3)
 #define VTD_ECAP_EIM                (1ULL << 4)
+#define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
 
 /* CAP_REG */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 04a6980..72556da 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -38,6 +38,8 @@ vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"P
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
+vtd_translate_pt(uint16_t sid, uint64_t addr) "source id 0x%"PRIu16", iova 0x%"PRIx64
+vtd_pt_enable_fast_path(uint16_t sid, bool success) "sid 0x%"PRIu16" %d"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
index 02b8825..293caf8 100644
--- a/hw/i386/x86-iommu.c
+++ b/hw/i386/x86-iommu.c
@@ -91,6 +91,7 @@ static void x86_iommu_realize(DeviceState *dev, Error **errp)
 static Property x86_iommu_properties[] = {
     DEFINE_PROP_BOOL("intremap", X86IOMMUState, intr_supported, false),
     DEFINE_PROP_BOOL("device-iotlb", X86IOMMUState, dt_supported, false),
+    DEFINE_PROP_BOOL("pt", X86IOMMUState, pt_supported, true),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
index 361c07c..ef89c0c 100644
--- a/include/hw/i386/x86-iommu.h
+++ b/include/hw/i386/x86-iommu.h
@@ -74,6 +74,7 @@ struct X86IOMMUState {
     SysBusDevice busdev;
     bool intr_supported;        /* Whether vIOMMU supports IR */
     bool dt_supported;          /* Whether vIOMMU supports DT */
+    bool pt_supported;          /* Whether vIOMMU supports pass-through */
     IommuType type;             /* IOMMU type - AMD/Intel     */
     QLIST_HEAD(, IEC_Notifier) iec_notifiers; /* IEC notify list */
 };
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 09/10] intel_iommu: turn off pt before 2.9
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (7 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 08/10] intel_iommu: support passthrough (PT) Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is Peter Xu
  2017-05-25  8:16 ` [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Jason Wang
  10 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

This is for compatibility.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/hw/compat.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/hw/compat.h b/include/hw/compat.h
index 55b1765..4c53d60 100644
--- a/include/hw/compat.h
+++ b/include/hw/compat.h
@@ -6,6 +6,10 @@
         .driver   = "pci-bridge",\
         .property = "shpc",\
         .value    = "off",\
+    },{\
+        .driver   = "intel-iommu",\
+        .property = "pt",\
+        .value    = "off",\
     },
 
 #define HW_COMPAT_2_8 \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (8 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 09/10] intel_iommu: turn off pt before 2.9 Peter Xu
@ 2017-05-19  3:19 ` Peter Xu
  2017-05-19 16:55   ` Michael S. Tsirkin
  2017-05-25  8:16 ` [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Jason Wang
  10 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2017-05-19  3:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu, peterx, Jason Wang

This patch pre-heat vhost iotlb cache when passthrough mode enabled.

Sometimes, even if user specified iommu_platform for vhost devices,
IOMMU might still be disabled. One case is passthrough mode in VT-d
implementation. We can detect this by observing iommu_list. If it's
empty, it means IOMMU translation is disabled, then we can actually
pre-heat the translation (it'll be static mapping then) by first
invalidating all IOTLB, then cache existing memory ranges into vhost
backend iotlb using 1:1 mapping.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/virtio/trace-events |  4 ++++
 hw/virtio/vhost.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 1f7a7c1..54dcbb3 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
 virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
 virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
 virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
+
+# hw/virtio/vhost.c
+vhost_iommu_commit(void) ""
+vhost_iommu_static_preheat(void) ""
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 03a46a7..8069135 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -27,6 +27,7 @@
 #include "hw/virtio/virtio-access.h"
 #include "migration/blocker.h"
 #include "sysemu/dma.h"
+#include "trace.h"
 
 /* enabled until disconnected backend stabilizes */
 #define _VHOST_DEBUG 1
@@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 }
 
+static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
+{
+    return !QLIST_EMPTY(&dev->iommu_list);
+}
+
 static void vhost_iommu_region_add(MemoryListener *listener,
                                    MemoryRegionSection *section)
 {
@@ -782,6 +788,48 @@ static void vhost_iommu_region_del(MemoryListener *listener,
     }
 }
 
+static void vhost_iommu_commit(MemoryListener *listener)
+{
+    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
+                                         iommu_listener);
+    struct vhost_memory_region *r;
+    int i;
+
+    trace_vhost_iommu_commit();
+
+    if (!vhost_iommu_mr_enabled(dev)) {
+        /*
+        * This means iommu_platform is enabled, however iommu memory
+        * region is disabled, e.g., when device passthrough is setup.
+        * Then, no translation is needed any more.
+        *
+        * Let's first invalidate the whole IOTLB, then pre-heat the
+        * static mapping by looping over vhost memory ranges.
+        */
+
+        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
+                                                          UINT64_MAX)) {
+            error_report("%s: flush existing IOTLB failed", __func__);
+            return;
+        }
+
+        for (i = 0; i < dev->mem->nregions; i++) {
+            r = &dev->mem->regions[i];
+            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
+            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
+                                                          r->guest_phys_addr,
+                                                          r->userspace_addr,
+                                                          r->memory_size,
+                                                          IOMMU_RW)) {
+                error_report("%s: pre-heat static mapping failed", __func__);
+                return;
+            }
+        }
+
+        trace_vhost_iommu_static_preheat();
+    }
+}
+
 static void vhost_region_nop(MemoryListener *listener,
                              MemoryRegionSection *section)
 {
@@ -1298,6 +1346,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     hdev->iommu_listener = (MemoryListener) {
         .region_add = vhost_iommu_region_add,
         .region_del = vhost_iommu_region_del,
+        .commit = vhost_iommu_commit,
     };
 
     if (hdev->migration_blocker == NULL) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is Peter Xu
@ 2017-05-19 16:55   ` Michael S. Tsirkin
  2017-05-22  2:30     ` Jason Wang
  2017-05-22  2:42     ` Peter Xu
  0 siblings, 2 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2017-05-19 16:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, David Gibson, yi.l.liu, Marcel Apfelbaum, Lan Tianyu,
	Jason Wang

On Fri, May 19, 2017 at 11:19:49AM +0800, Peter Xu wrote:
> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> 
> Sometimes, even if user specified iommu_platform for vhost devices,
> IOMMU might still be disabled. One case is passthrough mode in VT-d
> implementation. We can detect this by observing iommu_list. If it's
> empty, it means IOMMU translation is disabled, then we can actually
> pre-heat the translation (it'll be static mapping then) by first
> invalidating all IOTLB, then cache existing memory ranges into vhost
> backend iotlb using 1:1 mapping.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

I don't really understand. Is this a performance optimization?
Can you post some #s please?

Also, if it's PT, can't we bypass iommu altogether? That would be
even faster ...

> ---
>  hw/virtio/trace-events |  4 ++++
>  hw/virtio/vhost.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 53 insertions(+)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 1f7a7c1..54dcbb3 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
> +
> +# hw/virtio/vhost.c
> +vhost_iommu_commit(void) ""
> +vhost_iommu_static_preheat(void) ""
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 03a46a7..8069135 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -27,6 +27,7 @@
>  #include "hw/virtio/virtio-access.h"
>  #include "migration/blocker.h"
>  #include "sysemu/dma.h"
> +#include "trace.h"
>  
>  /* enabled until disconnected backend stabilizes */
>  #define _VHOST_DEBUG 1
> @@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      }
>  }
>  
> +static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
> +{
> +    return !QLIST_EMPTY(&dev->iommu_list);
> +}
> +
>  static void vhost_iommu_region_add(MemoryListener *listener,
>                                     MemoryRegionSection *section)
>  {
> @@ -782,6 +788,48 @@ static void vhost_iommu_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static void vhost_iommu_commit(MemoryListener *listener)
> +{
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         iommu_listener);
> +    struct vhost_memory_region *r;
> +    int i;
> +
> +    trace_vhost_iommu_commit();
> +
> +    if (!vhost_iommu_mr_enabled(dev)) {
> +        /*
> +        * This means iommu_platform is enabled, however iommu memory
> +        * region is disabled, e.g., when device passthrough is setup.
> +        * Then, no translation is needed any more.
> +        *
> +        * Let's first invalidate the whole IOTLB, then pre-heat the
> +        * static mapping by looping over vhost memory ranges.
> +        */
> +
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
> +                                                          UINT64_MAX)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        for (i = 0; i < dev->mem->nregions; i++) {
> +            r = &dev->mem->regions[i];
> +            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
> +            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
> +                                                          r->guest_phys_addr,
> +                                                          r->userspace_addr,
> +                                                          r->memory_size,
> +                                                          IOMMU_RW)) {
> +                error_report("%s: pre-heat static mapping failed", __func__);
> +                return;
> +            }
> +        }
> +
> +        trace_vhost_iommu_static_preheat();
> +    }
> +}
> +
>  static void vhost_region_nop(MemoryListener *listener,
>                               MemoryRegionSection *section)
>  {
> @@ -1298,6 +1346,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>      hdev->iommu_listener = (MemoryListener) {
>          .region_add = vhost_iommu_region_add,
>          .region_del = vhost_iommu_region_del,
> +        .commit = vhost_iommu_commit,
>      };
>  
>      if (hdev->migration_blocker == NULL) {
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is
  2017-05-19 16:55   ` Michael S. Tsirkin
@ 2017-05-22  2:30     ` Jason Wang
  2017-05-22  2:42     ` Peter Xu
  1 sibling, 0 replies; 18+ messages in thread
From: Jason Wang @ 2017-05-22  2:30 UTC (permalink / raw)
  To: Michael S. Tsirkin, Peter Xu
  Cc: qemu-devel, David Gibson, yi.l.liu, Marcel Apfelbaum, Lan Tianyu



On 2017年05月20日 00:55, Michael S. Tsirkin wrote:
> On Fri, May 19, 2017 at 11:19:49AM +0800, Peter Xu wrote:
>> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
>>
>> Sometimes, even if user specified iommu_platform for vhost devices,
>> IOMMU might still be disabled. One case is passthrough mode in VT-d
>> implementation. We can detect this by observing iommu_list. If it's
>> empty, it means IOMMU translation is disabled, then we can actually
>> pre-heat the translation (it'll be static mapping then) by first
>> invalidating all IOTLB, then cache existing memory ranges into vhost
>> backend iotlb using 1:1 mapping.
>>
>> Signed-off-by: Peter Xu<peterx@redhat.com>
> I don't really understand. Is this a performance optimization?
> Can you post some #s please?
>
> Also, if it's PT, can't we bypass iommu altogether?

The problem is, since device could be moved between domains, which means 
we need new notifier to notify vhost to enable or disable IOMMU_PLATFORM.

> That would be
> even faster ...
>

Should be the same (except for the first access in no CM mode), we pass 
and use vhost_memory_regions as what we've used for non iommu case.

Thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is
  2017-05-19 16:55   ` Michael S. Tsirkin
  2017-05-22  2:30     ` Jason Wang
@ 2017-05-22  2:42     ` Peter Xu
  2017-05-25 18:14       ` Michael S. Tsirkin
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Xu @ 2017-05-22  2:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, David Gibson, yi.l.liu, Marcel Apfelbaum, Lan Tianyu,
	Jason Wang

On Fri, May 19, 2017 at 07:55:26PM +0300, Michael S. Tsirkin wrote:
> On Fri, May 19, 2017 at 11:19:49AM +0800, Peter Xu wrote:
> > This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> > 
> > Sometimes, even if user specified iommu_platform for vhost devices,
> > IOMMU might still be disabled. One case is passthrough mode in VT-d
> > implementation. We can detect this by observing iommu_list. If it's
> > empty, it means IOMMU translation is disabled, then we can actually
> > pre-heat the translation (it'll be static mapping then) by first
> > invalidating all IOTLB, then cache existing memory ranges into vhost
> > backend iotlb using 1:1 mapping.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> I don't really understand. Is this a performance optimization?
> Can you post some #s please?

Yes, it is. Vhost can work even without this patch, but it should be
faster when with this patch.

As mentioned in the commit message and below comment [1], this patch
pre-heat the cache for vhost. Currently the cache entries depends on
the system memory ranges (dev->mem->nregions), and it should be far
smaller than vhost's cache count (currently it is statically defined
as max_iotlb_entries=2048 in kernel). If with current patch, these
cache entries can cover the whole possible DMA ranges that PT mode
would allow, so we won't have any cache miss then.

For the comments, do you have any better suggestion besides commit
message and [1]?

> 
> Also, if it's PT, can't we bypass iommu altogether? That would be
> even faster ...

Yes, but I don't yet know a good way to do it... Any suggestion is
welcomed as well.

Btw, do you have any comment on other patches besides this one? Since
this patch can really be isolated from the whole PT support series.

Thanks,

> 
> > ---
> >  hw/virtio/trace-events |  4 ++++
> >  hw/virtio/vhost.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 53 insertions(+)
> > 
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 1f7a7c1..54dcbb3 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
> >  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
> >  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
> >  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
> > +
> > +# hw/virtio/vhost.c
> > +vhost_iommu_commit(void) ""
> > +vhost_iommu_static_preheat(void) ""
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 03a46a7..8069135 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -27,6 +27,7 @@
> >  #include "hw/virtio/virtio-access.h"
> >  #include "migration/blocker.h"
> >  #include "sysemu/dma.h"
> > +#include "trace.h"
> >  
> >  /* enabled until disconnected backend stabilizes */
> >  #define _VHOST_DEBUG 1
> > @@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >      }
> >  }
> >  
> > +static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
> > +{
> > +    return !QLIST_EMPTY(&dev->iommu_list);
> > +}
> > +
> >  static void vhost_iommu_region_add(MemoryListener *listener,
> >                                     MemoryRegionSection *section)
> >  {
> > @@ -782,6 +788,48 @@ static void vhost_iommu_region_del(MemoryListener *listener,
> >      }
> >  }
> >  
> > +static void vhost_iommu_commit(MemoryListener *listener)
> > +{
> > +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> > +                                         iommu_listener);
> > +    struct vhost_memory_region *r;
> > +    int i;
> > +
> > +    trace_vhost_iommu_commit();
> > +
> > +    if (!vhost_iommu_mr_enabled(dev)) {
> > +        /*
> > +        * This means iommu_platform is enabled, however iommu memory
> > +        * region is disabled, e.g., when device passthrough is setup.
> > +        * Then, no translation is needed any more.
> > +        *
> > +        * Let's first invalidate the whole IOTLB, then pre-heat the
> > +        * static mapping by looping over vhost memory ranges.
> > +        */

[1]

> > +
> > +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
> > +                                                          UINT64_MAX)) {
> > +            error_report("%s: flush existing IOTLB failed", __func__);
> > +            return;
> > +        }
> > +
> > +        for (i = 0; i < dev->mem->nregions; i++) {
> > +            r = &dev->mem->regions[i];
> > +            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
> > +            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
> > +                                                          r->guest_phys_addr,
> > +                                                          r->userspace_addr,
> > +                                                          r->memory_size,
> > +                                                          IOMMU_RW)) {
> > +                error_report("%s: pre-heat static mapping failed", __func__);
> > +                return;
> > +            }
> > +        }
> > +
> > +        trace_vhost_iommu_static_preheat();
> > +    }
> > +}
> > +
> >  static void vhost_region_nop(MemoryListener *listener,
> >                               MemoryRegionSection *section)
> >  {
> > @@ -1298,6 +1346,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> >      hdev->iommu_listener = (MemoryListener) {
> >          .region_add = vhost_iommu_region_add,
> >          .region_del = vhost_iommu_region_del,
> > +        .commit = vhost_iommu_commit,
> >      };
> >  
> >      if (hdev->migration_blocker == NULL) {
> > -- 
> > 2.7.4

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes
  2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
                   ` (9 preceding siblings ...)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is Peter Xu
@ 2017-05-25  8:16 ` Jason Wang
  10 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2017-05-25  8:16 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Michael S . Tsirkin, David Gibson, yi.l.liu, Marcel Apfelbaum,
	Lan Tianyu



On 2017年05月19日 11:19, Peter Xu wrote:
> Online repo:
>
>    https://github.com/xzpeter/qemu/tree/vtd-passthrough-misc-1
>
> v4
> - remove patch "pc: add 2.10 machine type" since it's merged already
> - remove patch "memory: fix address_space_get_iotlb_entry()" since
>    the problem is fixed by an much better upstream patch:
>    "exec: abstract address_space_do_translate()"
> - fix issue that patchew reported
> - introduce vtd_ce_type_check(), then use it in
>    vtd_dev_to_context_entry() [Jason]
> - vtd_pt_enable_fast_path() don't use "SUCCESS" but bool [Jason]
> - do address space switch for dsi/global ce invalidate [Jason]
> - remove detect_pt in vtd_switch_address_space() [Jason]
>
> v3:
> - add one patch to provide machine type 2.10 for pc, add r-b for Edurado
> - add r-b/a-b for David on the two memory patches
> - add a-b for Paolo on the two memory patches
> - remove useless if in vtd_switch_address_space() [Jason]
> - check pt_supported when needed [Yi]
> - one more patch to check whether dev-iotlb is supported before
>    allowing such type of context entry
> - enable pt fast path (squashed into current patch 10). when we found
>    pt is setup on the first translation, we do address space switch.
>    When it's unset, we can capture it via invalidations. [Jason]
> - add compat bit for HW_COMPAT_2_9 for "pt" param [Jason]
> - one vhost fix for pt (added patch 4)
> - faster vhost caching (added patch 12) [Jason]
>
> This series add support for per-device passthrough mode for VT-d
> emulation, along with some tweaks on existing codes.
>
> Patches 1-2: memory related cleanups.
>
> Patches 3-7: some VT-d cleanups and fixes.
>
> Patch 8: add support for passthrough.
>
> Patch 9: turn pt off for machine type <=2.9, for compatibility.
>
> Patch 10: vhost enhancement when used with passthrough, to pre-cache
>            static mappings.
>
> A simple test with PT mode using 10g nic is line speed.
>
> Please review. Thanks.
>
> Peter Xu (10):
>    memory: tune last param of iommu_ops.translate()
>    memory: remove the last param in memory_region_iommu_replay()
>    x86-iommu: use DeviceClass properties
>    intel_iommu: renaming context entry helpers
>    intel_iommu: provide vtd_ce_get_type()
>    intel_iommu: use IOMMU_ACCESS_FLAG()
>    intel_iommu: allow dev-iotlb context entry conditionally
>    intel_iommu: support passthrough (PT)
>    intel_iommu: turn off pt before 2.9
>    vhost: iommu: cache static mapping if there is
>
>   exec.c                         |   3 +-
>   hw/alpha/typhoon.c             |   2 +-
>   hw/dma/rc4030.c                |   2 +-
>   hw/i386/amd_iommu.c            |   4 +-
>   hw/i386/intel_iommu.c          | 313 ++++++++++++++++++++++++++++++-----------
>   hw/i386/intel_iommu_internal.h |   1 +
>   hw/i386/trace-events           |   2 +
>   hw/i386/x86-iommu.c            |  48 ++-----
>   hw/pci-host/apb.c              |   2 +-
>   hw/ppc/spapr_iommu.c           |   2 +-
>   hw/s390x/s390-pci-bus.c        |   2 +-
>   hw/s390x/s390-pci-inst.c       |   2 +-
>   hw/vfio/common.c               |   2 +-
>   hw/virtio/trace-events         |   4 +
>   hw/virtio/vhost.c              |  49 +++++++
>   include/exec/memory.h          |  15 +-
>   include/hw/compat.h            |   4 +
>   include/hw/i386/x86-iommu.h    |   1 +
>   memory.c                       |   7 +-
>   19 files changed, 319 insertions(+), 146 deletions(-)
>

For the series

Reviewed-by: Jason Wang <jasowang@redhat.com>

Thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/10] intel_iommu: support passthrough (PT)
  2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 08/10] intel_iommu: support passthrough (PT) Peter Xu
@ 2017-05-25 10:40   ` Liu, Yi L
  0 siblings, 0 replies; 18+ messages in thread
From: Liu, Yi L @ 2017-05-25 10:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Lan Tianyu, yi.l.liu, Michael S . Tsirkin,
	Jason Wang, Marcel Apfelbaum, David Gibson

On Fri, May 19, 2017 at 11:19:47AM +0800, Peter Xu wrote:

Reviewed-by: Liu, Yi L <yi.l.liu@linux.intel.com>

Regards,
Yi L

> Hardware support for VT-d device passthrough. Although current Linux can
> live with iommu=pt even without this, but this is faster than when using
> software passthrough.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c          | 231 ++++++++++++++++++++++++++++++-----------
>  hw/i386/intel_iommu_internal.h |   1 +
>  hw/i386/trace-events           |   2 +
>  hw/i386/x86-iommu.c            |   1 +
>  include/hw/i386/x86-iommu.h    |   1 +
>  5 files changed, 177 insertions(+), 59 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index aac2cc7..15610b9 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -613,6 +613,11 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
>              return false;
>          }
>          break;
> +    case VTD_CONTEXT_TT_PASS_THROUGH:
> +        if (!x86_iommu->pt_supported) {
> +            return false;
> +        }
> +        break;
>      default:
>          /* Unknwon type */
>          return false;
> @@ -660,6 +665,29 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>      }
>  }
>  
> +/* Find the VTD address space associated with a given bus number */
> +static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
> +{
> +    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
> +    if (!vtd_bus) {
> +        /*
> +         * Iterate over the registered buses to find the one which
> +         * currently hold this bus number, and update the bus_num
> +         * lookup table:
> +         */
> +        GHashTableIter iter;
> +
> +        g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +        while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> +            if (pci_bus_num(vtd_bus->bus) == bus_num) {
> +                s->vtd_as_by_bus_num[bus_num] = vtd_bus;
> +                return vtd_bus;
> +            }
> +        }
> +    }
> +    return vtd_bus;
> +}
> +
>  /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> @@ -906,6 +934,91 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>      return 0;
>  }
>  
> +/*
> + * Fetch translation type for specific device. Returns <0 if error
> + * happens, otherwise return the shifted type to check against
> + * VTD_CONTEXT_TT_*.
> + */
> +static int vtd_dev_get_trans_type(VTDAddressSpace *as)
> +{
> +    IntelIOMMUState *s;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    s = as->iommu_state;
> +
> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(as->bus),
> +                                   as->devfn, &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return vtd_ce_get_type(&ce);
> +}
> +
> +static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
> +{
> +    int ret;
> +
> +    assert(as);
> +
> +    ret = vtd_dev_get_trans_type(as);
> +    if (ret < 0) {
> +        /*
> +         * Possibly failed to parse the context entry for some reason
> +         * (e.g., during init, or any guest configuration errors on
> +         * context entries). We should assume PT not enabled for
> +         * safety.
> +         */
> +        return false;
> +    }
> +
> +    return ret == VTD_CONTEXT_TT_PASS_THROUGH;
> +}
> +
> +/* Return whether the device is using IOMMU translation. */
> +static bool vtd_switch_address_space(VTDAddressSpace *as)
> +{
> +    bool use_iommu;
> +
> +    assert(as);
> +
> +    use_iommu = as->iommu_state->dmar_enabled & !vtd_dev_pt_enabled(as);
> +
> +    trace_vtd_switch_address_space(pci_bus_num(as->bus),
> +                                   VTD_PCI_SLOT(as->devfn),
> +                                   VTD_PCI_FUNC(as->devfn),
> +                                   use_iommu);
> +
> +    /* Turn off first then on the other */
> +    if (use_iommu) {
> +        memory_region_set_enabled(&as->sys_alias, false);
> +        memory_region_set_enabled(&as->iommu, true);
> +    } else {
> +        memory_region_set_enabled(&as->iommu, false);
> +        memory_region_set_enabled(&as->sys_alias, true);
> +    }
> +
> +    return use_iommu;
> +}
> +
> +static void vtd_switch_address_space_all(IntelIOMMUState *s)
> +{
> +    GHashTableIter iter;
> +    VTDBus *vtd_bus;
> +    int i;
> +
> +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> +            if (!vtd_bus->dev_as[i]) {
> +                continue;
> +            }
> +            vtd_switch_address_space(vtd_bus->dev_as[i]);
> +        }
> +    }
> +}
> +
>  static inline uint16_t vtd_make_source_id(uint8_t bus_num, uint8_t devfn)
>  {
>      return ((bus_num & 0xffUL) << 8) | (devfn & 0xffUL);
> @@ -943,6 +1056,31 @@ static inline bool vtd_is_interrupt_addr(hwaddr addr)
>      return VTD_INTERRUPT_ADDR_FIRST <= addr && addr <= VTD_INTERRUPT_ADDR_LAST;
>  }
>  
> +static void vtd_pt_enable_fast_path(IntelIOMMUState *s, uint16_t source_id)
> +{
> +    VTDBus *vtd_bus;
> +    VTDAddressSpace *vtd_as;
> +    bool success = false;
> +
> +    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
> +    if (!vtd_bus) {
> +        goto out;
> +    }
> +
> +    vtd_as = vtd_bus->dev_as[VTD_SID_TO_DEVFN(source_id)];
> +    if (!vtd_as) {
> +        goto out;
> +    }
> +
> +    if (vtd_switch_address_space(vtd_as) == false) {
> +        /* We switched off IOMMU region successfully. */
> +        success = true;
> +    }
> +
> +out:
> +    trace_vtd_pt_enable_fast_path(source_id, success);
> +}
> +
>  /* Map dev to context-entry then do a paging-structures walk to do a iommu
>   * translation.
>   *
> @@ -1014,6 +1152,30 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          cc_entry->context_cache_gen = s->context_cache_gen;
>      }
>  
> +    /*
> +     * We don't need to translate for pass-through context entries.
> +     * Also, let's ignore IOTLB caching as well for PT devices.
> +     */
> +    if (vtd_ce_get_type(&ce) == VTD_CONTEXT_TT_PASS_THROUGH) {
> +        entry->translated_addr = entry->iova;
> +        entry->addr_mask = VTD_PAGE_SIZE - 1;
> +        entry->perm = IOMMU_RW;
> +        trace_vtd_translate_pt(source_id, entry->iova);
> +
> +        /*
> +         * When this happens, it means firstly caching-mode is not
> +         * enabled, and this is the first passthrough translation for
> +         * the device. Let's enable the fast path for passthrough.
> +         *
> +         * When passthrough is disabled again for the device, we can
> +         * capture it via the context entry invalidation, then the
> +         * IOMMU region can be swapped back.
> +         */
> +        vtd_pt_enable_fast_path(s, source_id);
> +
> +        return;
> +    }
> +
>      ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
>                                 &reads, &writes);
>      if (ret_fr) {
> @@ -1083,6 +1245,7 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>      if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
>          vtd_reset_context_cache(s);
>      }
> +    vtd_switch_address_space_all(s);
>      /*
>       * From VT-d spec 6.5.2.1, a global context entry invalidation
>       * should be followed by a IOTLB global invalidation, so we should
> @@ -1093,29 +1256,6 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>      vtd_iommu_replay_all(s);
>  }
>  
> -
> -/* Find the VTD address space currently associated with a given bus number,
> - */
> -static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
> -{
> -    VTDBus *vtd_bus = s->vtd_as_by_bus_num[bus_num];
> -    if (!vtd_bus) {
> -        /* Iterate over the registered buses to find the one
> -         * which currently hold this bus number, and update the bus_num lookup table:
> -         */
> -        GHashTableIter iter;
> -
> -        g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> -        while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) {
> -            if (pci_bus_num(vtd_bus->bus) == bus_num) {
> -                s->vtd_as_by_bus_num[bus_num] = vtd_bus;
> -                return vtd_bus;
> -            }
> -        }
> -    }
> -    return vtd_bus;
> -}
> -
>  /* Do a context-cache device-selective invalidation.
>   * @func_mask: FM field after shifting
>   */
> @@ -1158,6 +1298,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                                               VTD_PCI_FUNC(devfn_it));
>                  vtd_as->context_cache_entry.context_cache_gen = 0;
>                  /*
> +                 * Do switch address space when needed, in case if the
> +                 * device passthrough bit is switched.
> +                 */
> +                vtd_switch_address_space(vtd_as);
> +                /*
>                   * So a device is moving out of (or moving into) a
>                   * domain, a replay() suites here to notify all the
>                   * IOMMU_NOTIFIER_MAP registers about this change.
> @@ -1389,42 +1534,6 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>  }
>  
> -static void vtd_switch_address_space(VTDAddressSpace *as)
> -{
> -    assert(as);
> -
> -    trace_vtd_switch_address_space(pci_bus_num(as->bus),
> -                                   VTD_PCI_SLOT(as->devfn),
> -                                   VTD_PCI_FUNC(as->devfn),
> -                                   as->iommu_state->dmar_enabled);
> -
> -    /* Turn off first then on the other */
> -    if (as->iommu_state->dmar_enabled) {
> -        memory_region_set_enabled(&as->sys_alias, false);
> -        memory_region_set_enabled(&as->iommu, true);
> -    } else {
> -        memory_region_set_enabled(&as->iommu, false);
> -        memory_region_set_enabled(&as->sys_alias, true);
> -    }
> -}
> -
> -static void vtd_switch_address_space_all(IntelIOMMUState *s)
> -{
> -    GHashTableIter iter;
> -    VTDBus *vtd_bus;
> -    int i;
> -
> -    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> -    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> -        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> -            if (!vtd_bus->dev_as[i]) {
> -                continue;
> -            }
> -            vtd_switch_address_space(vtd_bus->dev_as[i]);
> -        }
> -    }
> -}
> -
>  /* Handle Translation Enable/Disable */
>  static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>  {
> @@ -2872,6 +2981,10 @@ static void vtd_init(IntelIOMMUState *s)
>          s->ecap |= VTD_ECAP_DT;
>      }
>  
> +    if (x86_iommu->pt_supported) {
> +        s->ecap |= VTD_ECAP_PT;
> +    }
> +
>      if (s->caching_mode) {
>          s->cap |= VTD_CAP_CM;
>      }
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 29d6707..0e73a65 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -187,6 +187,7 @@
>  /* Interrupt Remapping support */
>  #define VTD_ECAP_IR                 (1ULL << 3)
>  #define VTD_ECAP_EIM                (1ULL << 4)
> +#define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
>  
>  /* CAP_REG */
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 04a6980..72556da 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -38,6 +38,8 @@ vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"P
>  vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
>  vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>  vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64
> +vtd_translate_pt(uint16_t sid, uint64_t addr) "source id 0x%"PRIu16", iova 0x%"PRIx64
> +vtd_pt_enable_fast_path(uint16_t sid, bool success) "sid 0x%"PRIu16" %d"
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
> index 02b8825..293caf8 100644
> --- a/hw/i386/x86-iommu.c
> +++ b/hw/i386/x86-iommu.c
> @@ -91,6 +91,7 @@ static void x86_iommu_realize(DeviceState *dev, Error **errp)
>  static Property x86_iommu_properties[] = {
>      DEFINE_PROP_BOOL("intremap", X86IOMMUState, intr_supported, false),
>      DEFINE_PROP_BOOL("device-iotlb", X86IOMMUState, dt_supported, false),
> +    DEFINE_PROP_BOOL("pt", X86IOMMUState, pt_supported, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
> index 361c07c..ef89c0c 100644
> --- a/include/hw/i386/x86-iommu.h
> +++ b/include/hw/i386/x86-iommu.h
> @@ -74,6 +74,7 @@ struct X86IOMMUState {
>      SysBusDevice busdev;
>      bool intr_supported;        /* Whether vIOMMU supports IR */
>      bool dt_supported;          /* Whether vIOMMU supports DT */
> +    bool pt_supported;          /* Whether vIOMMU supports pass-through */
>      IommuType type;             /* IOMMU type - AMD/Intel     */
>      QLIST_HEAD(, IEC_Notifier) iec_notifiers; /* IEC notify list */
>  };
> -- 
> 2.7.4
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is
  2017-05-22  2:42     ` Peter Xu
@ 2017-05-25 18:14       ` Michael S. Tsirkin
  2017-05-29  4:29         ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2017-05-25 18:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, David Gibson, yi.l.liu, Marcel Apfelbaum, Lan Tianyu,
	Jason Wang

On Mon, May 22, 2017 at 10:42:00AM +0800, Peter Xu wrote:
> On Fri, May 19, 2017 at 07:55:26PM +0300, Michael S. Tsirkin wrote:
> > On Fri, May 19, 2017 at 11:19:49AM +0800, Peter Xu wrote:
> > > This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> > > 
> > > Sometimes, even if user specified iommu_platform for vhost devices,
> > > IOMMU might still be disabled. One case is passthrough mode in VT-d
> > > implementation. We can detect this by observing iommu_list. If it's
> > > empty, it means IOMMU translation is disabled, then we can actually
> > > pre-heat the translation (it'll be static mapping then) by first
> > > invalidating all IOTLB, then cache existing memory ranges into vhost
> > > backend iotlb using 1:1 mapping.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > I don't really understand. Is this a performance optimization?
> > Can you post some #s please?
> 
> Yes, it is. Vhost can work even without this patch, but it should be
> faster when with this patch.

You'll have to include perf testing numbers then.

> As mentioned in the commit message and below comment [1], this patch
> pre-heat the cache for vhost. Currently the cache entries depends on
> the system memory ranges (dev->mem->nregions), and it should be far
> smaller than vhost's cache count (currently it is statically defined
> as max_iotlb_entries=2048 in kernel). If with current patch, these
> cache entries can cover the whole possible DMA ranges that PT mode
> would allow, so we won't have any cache miss then.
> 
> For the comments, do you have any better suggestion besides commit
> message and [1]?
> 
> > 
> > Also, if it's PT, can't we bypass iommu altogether? That would be
> > even faster ...
> 
> Yes, but I don't yet know a good way to do it... Any suggestion is
> welcomed as well.
> 
> Btw, do you have any comment on other patches besides this one? Since
> this patch can really be isolated from the whole PT support series.
> 
> Thanks,

I've applied the rest of the series.

> > 
> > > ---
> > >  hw/virtio/trace-events |  4 ++++
> > >  hw/virtio/vhost.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 53 insertions(+)
> > > 
> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index 1f7a7c1..54dcbb3 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
> > >  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
> > >  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
> > >  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
> > > +
> > > +# hw/virtio/vhost.c
> > > +vhost_iommu_commit(void) ""
> > > +vhost_iommu_static_preheat(void) ""
> > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > > index 03a46a7..8069135 100644
> > > --- a/hw/virtio/vhost.c
> > > +++ b/hw/virtio/vhost.c
> > > @@ -27,6 +27,7 @@
> > >  #include "hw/virtio/virtio-access.h"
> > >  #include "migration/blocker.h"
> > >  #include "sysemu/dma.h"
> > > +#include "trace.h"
> > >  
> > >  /* enabled until disconnected backend stabilizes */
> > >  #define _VHOST_DEBUG 1
> > > @@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > >      }
> > >  }
> > >  
> > > +static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
> > > +{
> > > +    return !QLIST_EMPTY(&dev->iommu_list);
> > > +}
> > > +
> > >  static void vhost_iommu_region_add(MemoryListener *listener,
> > >                                     MemoryRegionSection *section)
> > >  {
> > > @@ -782,6 +788,48 @@ static void vhost_iommu_region_del(MemoryListener *listener,
> > >      }
> > >  }
> > >  
> > > +static void vhost_iommu_commit(MemoryListener *listener)
> > > +{
> > > +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> > > +                                         iommu_listener);
> > > +    struct vhost_memory_region *r;
> > > +    int i;
> > > +
> > > +    trace_vhost_iommu_commit();
> > > +
> > > +    if (!vhost_iommu_mr_enabled(dev)) {
> > > +        /*
> > > +        * This means iommu_platform is enabled, however iommu memory
> > > +        * region is disabled, e.g., when device passthrough is setup.
> > > +        * Then, no translation is needed any more.
> > > +        *
> > > +        * Let's first invalidate the whole IOTLB, then pre-heat the
> > > +        * static mapping by looping over vhost memory ranges.
> > > +        */
> 
> [1]
> 
> > > +
> > > +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
> > > +                                                          UINT64_MAX)) {
> > > +            error_report("%s: flush existing IOTLB failed", __func__);
> > > +            return;
> > > +        }
> > > +
> > > +        for (i = 0; i < dev->mem->nregions; i++) {
> > > +            r = &dev->mem->regions[i];
> > > +            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
> > > +            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
> > > +                                                          r->guest_phys_addr,
> > > +                                                          r->userspace_addr,
> > > +                                                          r->memory_size,
> > > +                                                          IOMMU_RW)) {
> > > +                error_report("%s: pre-heat static mapping failed", __func__);
> > > +                return;
> > > +            }
> > > +        }
> > > +
> > > +        trace_vhost_iommu_static_preheat();
> > > +    }
> > > +}
> > > +
> > >  static void vhost_region_nop(MemoryListener *listener,
> > >                               MemoryRegionSection *section)
> > >  {
> > > @@ -1298,6 +1346,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> > >      hdev->iommu_listener = (MemoryListener) {
> > >          .region_add = vhost_iommu_region_add,
> > >          .region_del = vhost_iommu_region_del,
> > > +        .commit = vhost_iommu_commit,
> > >      };
> > >  
> > >      if (hdev->migration_blocker == NULL) {
> > > -- 
> > > 2.7.4
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is
  2017-05-25 18:14       ` Michael S. Tsirkin
@ 2017-05-29  4:29         ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2017-05-29  4:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, David Gibson, yi.l.liu, Marcel Apfelbaum, Lan Tianyu,
	Jason Wang

On Thu, May 25, 2017 at 09:14:55PM +0300, Michael S. Tsirkin wrote:
> On Mon, May 22, 2017 at 10:42:00AM +0800, Peter Xu wrote:
> > On Fri, May 19, 2017 at 07:55:26PM +0300, Michael S. Tsirkin wrote:
> > > On Fri, May 19, 2017 at 11:19:49AM +0800, Peter Xu wrote:
> > > > This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> > > > 
> > > > Sometimes, even if user specified iommu_platform for vhost devices,
> > > > IOMMU might still be disabled. One case is passthrough mode in VT-d
> > > > implementation. We can detect this by observing iommu_list. If it's
> > > > empty, it means IOMMU translation is disabled, then we can actually
> > > > pre-heat the translation (it'll be static mapping then) by first
> > > > invalidating all IOTLB, then cache existing memory ranges into vhost
> > > > backend iotlb using 1:1 mapping.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > I don't really understand. Is this a performance optimization?
> > > Can you post some #s please?
> > 
> > Yes, it is. Vhost can work even without this patch, but it should be
> > faster when with this patch.
> 
> You'll have to include perf testing numbers then.

My mistake to not have compared the numbers before, since it's just so
obvious to me that this patch should help.

Though after some simple streaming tests, it shows that this patch
didn't boost performance at all. I added some traces in vhost kernel
to know what's happened, please see below.

Without this patch, boot with iommu=pt, I see IOTLB cache insertion
like this:

vhost_process_iotlb_msg: iotlb new: 1 (0x17879b240-0x17fffffff)
vhost_process_iotlb_msg: iotlb new: 2 (0x17879d240-0x17fffffff)
vhost_process_iotlb_msg: iotlb new: 3 (0x17879a000-0x17fffffff)
vhost_process_iotlb_msg: iotlb new: 4 (0x178570000-0x17fffffff)
vhost_process_iotlb_msg: iotlb new: 5 (0x178532606-0x17fffffff)
vhost_process_iotlb_msg: iotlb new: 6 (0x177bad0e2-0x17fffffff)
vhost_process_iotlb_msg: iotlb new: 7 (0x1768560e2-0x17fffffff)

(Note: we can see the pattern is (ADDR-0x17fffffff), while ADDR is
 increasing, finally the range will cover all addresses that vhost
 needs for DMA, then we won't have cache miss, and 0x17fffffff is the
 upper limit for a 4G memory guest)

While after this patch (this is expected):

vhost_process_iotlb_msg: iotlb new: 1 (0x100000000-0x17fffffff)
vhost_process_iotlb_msg: iotlb new: 2 (0x0-0x9ffff)
vhost_process_iotlb_msg: iotlb new: 3 (0xc0000-0x7fffffff)

(Note: this is one entry per RAM memory region)

So it explained well on why performance didn't really change even
before applying this patch: currently when iommu=pt is on,
address_space_get_iotlb_entry() can get IOTLB that is bigger than page
size (if you see the code, plen decides the page mask, and plen is
only limited by memory region sizes when PT is enabled). So until the
7th cache miss IOTLB request, the range is big enough to cover the
rest of DMA addresses.

My preference is that we still apply this patch even there is no
performance gain on simple streaming test. Reasons:

- the old code has good performance depending on implementation of
  address_space_get_iotlb_entry(), which may alter in the future

- after apply the patch, we are 100% sure that we won't cache miss,
  while we cannot guarantee that without it. If not apply the patch,
  we may still encounter cache miss (e.g., access address <0x1768560e2
  after the 7th cache miss in above test), which can introduce that
  cache-missed IO to be delayed.

>
> > As mentioned in the commit message and below comment [1], this patch
> > pre-heat the cache for vhost. Currently the cache entries depends on
> > the system memory ranges (dev->mem->nregions), and it should be far
> > smaller than vhost's cache count (currently it is statically defined
> > as max_iotlb_entries=2048 in kernel). If with current patch, these
> > cache entries can cover the whole possible DMA ranges that PT mode
> > would allow, so we won't have any cache miss then.
> > 
> > For the comments, do you have any better suggestion besides commit
> > message and [1]?
> > 
> > > 
> > > Also, if it's PT, can't we bypass iommu altogether? That would be
> > > even faster ...
> > 
> > Yes, but I don't yet know a good way to do it... Any suggestion is
> > welcomed as well.
> > 
> > Btw, do you have any comment on other patches besides this one? Since
> > this patch can really be isolated from the whole PT support series.
> > 
> > Thanks,
> 
> I've applied the rest of the series.

Thank you very much.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-05-29  4:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-19  3:19 [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 01/10] memory: tune last param of iommu_ops.translate() Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 02/10] memory: remove the last param in memory_region_iommu_replay() Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 03/10] x86-iommu: use DeviceClass properties Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 04/10] intel_iommu: renaming context entry helpers Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 05/10] intel_iommu: provide vtd_ce_get_type() Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 06/10] intel_iommu: use IOMMU_ACCESS_FLAG() Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 07/10] intel_iommu: allow dev-iotlb context entry conditionally Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 08/10] intel_iommu: support passthrough (PT) Peter Xu
2017-05-25 10:40   ` Liu, Yi L
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 09/10] intel_iommu: turn off pt before 2.9 Peter Xu
2017-05-19  3:19 ` [Qemu-devel] [PATCH v4 10/10] vhost: iommu: cache static mapping if there is Peter Xu
2017-05-19 16:55   ` Michael S. Tsirkin
2017-05-22  2:30     ` Jason Wang
2017-05-22  2:42     ` Peter Xu
2017-05-25 18:14       ` Michael S. Tsirkin
2017-05-29  4:29         ` Peter Xu
2017-05-25  8:16 ` [Qemu-devel] [PATCH v4 00/10] VT-d: PT (passthrough) mode support and misc fixes Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.