All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances
@ 2017-01-13  3:06 Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
                   ` (14 more replies)
  0 siblings, 15 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

v3:
- fix style error reported by patchew
- fix comment in domain switch patch: use "IOMMU address space" rather
  than "IOMMU region" [Kevin]
- add ack-by for Paolo in patch:
  "memory: add section range info for IOMMU notifier"
  (this is seperately collected besides this thread)
- remove 3 patches which are merged already (from Jason)
- rebase to master b6c0897

v2:
- change comment for "end" parameter in vtd_page_walk() [Tianyu]
- change comment for "a iova" to "an iova" [Yi]
- fix fault printed val for GPA address in vtd_page_walk_level (debug
  only)
- rebased to master (rather than Aviv's v6 series) and merged Aviv's
  series v6: picked patch 1 (as patch 1 in this series), dropped patch
  2, re-wrote patch 3 (as patch 17 of this series).
- picked up two more bugfix patches from Jason's DMAR series
- picked up the following patch as well:
  "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"

This RFC series is a re-work for Aviv B.D.'s vfio enablement series
with vt-d:

  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html

Aviv has done a great job there, and what we still lack there are
mostly the following:

(1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
    memory region.

(2) VT-d still haven't provide a correct replay() mechanism (e.g.,
    when IOMMU domain switches, things will broke).

This series should have solved the above two issues.

Online repo:

  https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v2

I would be glad to hear about any review comments for above patches.

=========
Test Done
=========

Build test passed for x86_64/arm/ppc64.

Simply tested with x86_64, assigning two PCI devices to a single VM,
boot the VM using:

bin=x86_64-softmmu/qemu-system-x86_64
$bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
     -device intel-iommu,intremap=on,eim=off,cache-mode=on \
     -netdev user,id=net0,hostfwd=tcp::5555-:22 \
     -device virtio-net-pci,netdev=net0 \
     -device vfio-pci,host=03:00.0 \
     -device vfio-pci,host=02:00.0 \
     -trace events=".trace.vfio" \
     /var/lib/libvirt/images/vm1.qcow2

pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
vtd_page_walk*
vtd_replay*
vtd_inv_desc*

Then, in the guest, run the following tool:

  https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c

With parameter:

  ./vfio-bind-group 00:03.0 00:04.0

Check host side trace log, I can see pages are replayed and mapped in
00:04.0 device address space, like:

...
vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
...

=========
Todo List
=========

- error reporting for the assigned devices (as Tianyu has mentioned)

- per-domain address-space: A better solution in the future may be -
  we maintain one address space per IOMMU domain in the guest (so
  multiple devices can share a same address space if they are sharing
  the same IOMMU domains in the guest), rather than one address space
  per device (which is current implementation of vt-d). However that's
  a step further than this series, and let's see whether we can first
  provide a workable version of device assignment with vt-d
  protection.

- more to come...

Thanks,

Aviv Ben-David (1):
  IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to
    guest

Peter Xu (13):
  intel_iommu: simplify irq region translation
  intel_iommu: renaming gpa to iova where proper
  intel_iommu: fix trace for inv desc handling
  intel_iommu: fix trace for addr translation
  intel_iommu: vtd_slpt_level_shift check level
  memory: add section range info for IOMMU notifier
  memory: provide iommu_replay_all()
  memory: introduce memory_region_notify_one()
  memory: add MemoryRegionIOMMUOps.replay() callback
  intel_iommu: provide its own replay() callback
  intel_iommu: do replay when context invalidate
  intel_iommu: allow dynamic switch of IOMMU region
  intel_iommu: enable vfio devices

 hw/i386/intel_iommu.c          | 589 +++++++++++++++++++++++++++++++----------
 hw/i386/intel_iommu_internal.h |   1 +
 hw/i386/trace-events           |  28 ++
 hw/vfio/common.c               |   7 +-
 include/exec/memory.h          |  30 +++
 include/hw/i386/intel_iommu.h  |  12 +
 memory.c                       |  42 ++-
 7 files changed, 557 insertions(+), 152 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-20  8:32   ` Tian, Kevin
  2017-01-20 15:42   ` Eric Blake
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation Peter Xu
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

From: Aviv Ben-David <bd.aviv@gmail.com>

This capability asks the guest to invalidate cache before each map operation.
We can use this invalidation to trap map operations in the hypervisor.

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c          | 5 +++++
 hw/i386/intel_iommu_internal.h | 1 +
 include/hw/i386/intel_iommu.h  | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ec62239..2868e37 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2107,6 +2107,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
                             ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
+    DEFINE_PROP_BOOL("cache-mode", IntelIOMMUState, cache_mode_enabled, FALSE),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -2488,6 +2489,10 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_DT;
     }
 
+    if (s->cache_mode_enabled) {
+        s->cap |= VTD_CAP_CM;
+    }
+
     vtd_reset_context_cache(s);
     vtd_reset_iotlb(s);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 356f188..4104121 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -202,6 +202,7 @@
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
 #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_CM                  (1ULL << 7)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 405c9d1..749eef9 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -257,6 +257,8 @@ struct IntelIOMMUState {
     uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
     uint32_t version;
 
+    bool cache_mode_enabled;        /* RO - is cap CM enabled? */
+
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
     bool dmar_enabled;              /* Set if DMA remapping is enabled */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-20  8:22   ` Tian, Kevin
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper Peter Xu
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Before we have int-remap, we need to bypass interrupt write requests.
That's not necessary now - we have supported int-remap, and all the irq
region requests should be redirected there. Cleaning up the block with
an assertion instead.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2868e37..77d467a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -818,28 +818,12 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     bool writes = true;
     VTDIOTLBEntry *iotlb_entry;
 
-    /* Check if the request is in interrupt address range */
-    if (vtd_is_interrupt_addr(addr)) {
-        if (is_write) {
-            /* FIXME: since we don't know the length of the access here, we
-             * treat Non-DWORD length write requests without PASID as
-             * interrupt requests, too. Withoud interrupt remapping support,
-             * we just use 1:1 mapping.
-             */
-            VTD_DPRINTF(MMU, "write request to interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            entry->iova = addr & VTD_PAGE_MASK_4K;
-            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
-            entry->addr_mask = ~VTD_PAGE_MASK_4K;
-            entry->perm = IOMMU_WO;
-            return;
-        } else {
-            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
-            return;
-        }
-    }
+    /*
+     * We have standalone memory region for interrupt addresses, we
+     * should never receive translation requests in this region.
+     */
+    assert(!vtd_is_interrupt_addr(addr));
+
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-20  8:27   ` Tian, Kevin
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling Peter Xu
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

There are lots of places in current intel_iommu.c codes that named
"iova" as "gpa". It is really confusing to use a name "gpa" in these
places (which is very easily to be understood as "Guest Physical
Address", while it's not). To make the codes (much) easier to be read, I
decided to do this once and for all.

No functional change is made. Only literal ones.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 77d467a..275c3db 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -259,7 +259,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                 " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
                 domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
@@ -575,12 +575,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
     return slpte;
 }
 
-/* Given a gpa and the level of paging structure, return the offset of current
- * level.
+/* Given an iova and the level of paging structure, return the offset
+ * of current level.
  */
-static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
+static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
 {
-    return (gpa >> vtd_slpt_level_shift(level)) &
+    return (iova >> vtd_slpt_level_shift(level)) &
             ((1ULL << VTD_SL_LEVEL_BITS) - 1);
 }
 
@@ -628,10 +628,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     }
 }
 
-/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
+/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
+static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
                             uint64_t *slptep, uint32_t *slpte_level,
                             bool *reads, bool *writes)
 {
@@ -642,11 +642,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
     uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check;
 
-    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
-     * and AW in context-entry.
+    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
      */
-    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
-        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
+    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
@@ -654,13 +654,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
     access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
 
     while (true) {
-        offset = vtd_gpa_level_offset(gpa, level);
+        offset = vtd_iova_level_offset(iova, level);
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
             VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
-                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
-                        level, gpa);
+                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
+                        level, iova);
             if (level == vtd_get_level_from_context_entry(ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
@@ -672,8 +672,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
         *writes = (*writes) && (slpte & VTD_SL_W);
         if (!(slpte & access_right_check)) {
             VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
-                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (is_write ? "write" : "read"), gpa, slpte);
+                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
+                        (is_write ? "write" : "read"), iova, slpte);
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
@@ -827,7 +827,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                     " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
                     iotlb_entry->slpte, iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
@@ -2025,7 +2025,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
                            is_write, &ret);
     VTD_DPRINTF(MMU,
                 "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
-                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
+                " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
                 VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
                 vtd_as->devfn, addr, ret.translated_addr);
     return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (2 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  7:46   ` Jason Wang
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 05/14] intel_iommu: fix trace for addr translation Peter Xu
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
good, and we should end the day when we need to recompile the code
before getting useful debugging information for vt-d. Time to switch to
the trace system.

This is the first patch to do it.

Generally, the rule of mine is:

- for the old GENERAL typed message, I use error_report() directly if
  apply. Those are something shouldn't happen, and we should print those
  errors in all cases, even without enabling debug and tracing.

- for the non-GENERAL typed messages, remove those VTD_PRINTF()s that
  looks hardly used, and convert the rest lines into trace_*().

- for useless DPRINTFs, I removed them.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 98 ++++++++++++++++++++++++---------------------------
 hw/i386/trace-events  | 13 +++++++
 2 files changed, 59 insertions(+), 52 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 275c3db..459e575 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -35,6 +35,7 @@
 #include "sysemu/kvm.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm_i386.h"
+#include "trace.h"
 
 /*#define DEBUG_INTEL_IOMMU*/
 #ifdef DEBUG_INTEL_IOMMU
@@ -474,22 +475,19 @@ static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
 /* Set the IWC field and try to generate an invalidation completion interrupt */
 static void vtd_generate_completion_event(IntelIOMMUState *s)
 {
-    VTD_DPRINTF(INV, "completes an invalidation wait command with "
-                "Interrupt Flag");
     if (vtd_get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
-        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
-                    "serviced by software, "
-                    "new invalidation event is not generated");
+        trace_vtd_inv_desc_wait_irq("One pending, skip current");
         return;
     }
     vtd_set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
     vtd_set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
     if (vtd_get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
-        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
-                    "event is not generated");
+        trace_vtd_inv_desc_wait_irq("IM in IECTL_REG is set, "
+                                    "new event not generated");
         return;
     } else {
         /* Generate the interrupt event */
+        trace_vtd_inv_desc_wait_irq("Generating complete event");
         vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
         vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
     }
@@ -923,6 +921,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_inv_desc_cc_global();
     s->context_cache_gen++;
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
@@ -962,9 +961,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
     uint16_t mask;
     VTDBus *vtd_bus;
     VTDAddressSpace *vtd_as;
-    uint16_t devfn;
+    uint8_t bus_n, devfn;
     uint16_t devfn_it;
 
+    trace_vtd_inv_desc_cc_devices(source_id, func_mask);
+
     switch (func_mask & 3) {
     case 0:
         mask = 0;   /* No bits in the SID field masked */
@@ -980,16 +981,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
         break;
     }
     mask = ~mask;
-    VTD_DPRINTF(INV, "device-selective invalidation source 0x%"PRIx16
-                    " mask %"PRIu16, source_id, mask);
-    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
+
+    bus_n = VTD_SID_TO_BUS(source_id);
+    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
     if (vtd_bus) {
         devfn = VTD_SID_TO_DEVFN(source_id);
         for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
             vtd_as = vtd_bus->dev_as[devfn_it];
             if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
-                VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
-                            devfn_it);
+                trace_vtd_inv_desc_cc_device(bus_n, (devfn_it >> 3) & 0x1f,
+                                             devfn_it & 3);
                 vtd_as->context_cache_entry.context_cache_gen = 0;
             }
         }
@@ -1302,7 +1303,7 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 {
     if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
         (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Invalidation "
+        error_report("Non-zero reserved field in Invalidation "
                     "Wait Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
                     inv_desc->hi, inv_desc->lo);
         return false;
@@ -1316,21 +1317,20 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
         /* FIXME: need to be masked with HAW? */
         dma_addr_t status_addr = inv_desc->hi;
-        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
-                    status_data, status_addr);
+        trace_vtd_inv_desc_wait_sw(status_addr, status_data);
         status_data = cpu_to_le32(status_data);
         if (dma_memory_write(&address_space_memory, status_addr, &status_data,
                              sizeof(status_data))) {
-            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
+            error_report("Invalidate Desc Wait status write failed");
             return false;
         }
     } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
         /* Interrupt flag */
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
         vtd_generate_completion_event(s);
     } else {
-        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
+        error_report("invalid Invalidation Wait Descriptor: "
+                     "hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                     inv_desc->hi, inv_desc->lo);
         return false;
     }
     return true;
@@ -1339,30 +1339,32 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
                                            VTDInvDesc *inv_desc)
 {
+    uint16_t sid, fmask;
+
     if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Context-cache "
-                    "Invalidate Descriptor");
+        error_report("non-zero reserved field in Context-cache "
+                     "Invalidate Descriptor");
         return false;
     }
     switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
     case VTD_INV_DESC_CC_DOMAIN:
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
+        trace_vtd_inv_desc_cc_domain(
+            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
         /* Fall through */
     case VTD_INV_DESC_CC_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
         vtd_context_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_CC_DEVICE:
-        vtd_context_device_invalidate(s, VTD_INV_DESC_CC_SID(inv_desc->lo),
-                                      VTD_INV_DESC_CC_FM(inv_desc->lo));
+        sid = VTD_INV_DESC_CC_SID(inv_desc->lo);
+        fmask = VTD_INV_DESC_CC_FM(inv_desc->lo);
+        vtd_context_device_invalidate(s, sid, fmask);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in Context-cache "
-                    "Invalidate Descriptor hi 0x%"PRIx64  " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        error_report("invalid granularity in Context-cache "
+                     "Invalidate Descriptor hi 0x%"PRIx64" lo 0x%"PRIx64,
+                     inv_desc->hi, inv_desc->lo);
         return false;
     }
     return true;
@@ -1376,7 +1378,7 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
     if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
         (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in IOTLB "
+        error_report("non-zero reserved field in IOTLB "
                     "Invalidate Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
                     inv_desc->hi, inv_desc->lo);
         return false;
@@ -1384,14 +1386,13 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
     switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
     case VTD_INV_DESC_IOTLB_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
+        trace_vtd_inv_desc_iotlb_global();
         vtd_iotlb_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_IOTLB_DOMAIN:
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    domain_id);
+        trace_vtd_inv_desc_iotlb_domain(domain_id);
         vtd_iotlb_domain_invalidate(s, domain_id);
         break;
 
@@ -1399,18 +1400,17 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
         addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
         am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
-        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
-                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
+        trace_vtd_inv_desc_iotlb_pages(domain_id, addr, am);
         if (am > VTD_MAMV) {
-            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
-                        "%"PRIu8, (uint8_t)VTD_MAMV);
+            error_report("supported max address mask value is %"PRIu8,
+                         (uint8_t)VTD_MAMV);
             return false;
         }
         vtd_iotlb_page_invalidate(s, domain_id, addr, am);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in IOTLB Invalidate "
+        error_report("invalid granularity in IOTLB Invalidate "
                     "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
                     inv_desc->hi, inv_desc->lo);
         return false;
@@ -1492,7 +1492,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
     VTDInvDesc inv_desc;
     uint8_t desc_type;
 
-    VTD_DPRINTF(INV, "iq head %"PRIu16, s->iq_head);
     if (!vtd_get_inv_desc(s->iq, s->iq_head, &inv_desc)) {
         s->iq_last_desc_type = VTD_INV_DESC_NONE;
         return false;
@@ -1503,33 +1502,28 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 
     switch (desc_type) {
     case VTD_INV_DESC_CC:
-        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("context-cache", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_context_cache_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IOTLB:
-        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iotlb", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_iotlb_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_WAIT:
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_wait_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IEC:
-        VTD_DPRINTF(INV, "Invalidation Interrupt Entry Cache "
-                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iec", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
             return false;
         }
@@ -1544,9 +1538,9 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
-                    inv_desc.hi, inv_desc.lo, desc_type);
+        error_report("Unkonw Invalidation Descriptor type "
+                     "hi 0x%"PRIx64" lo 0x%"PRIx64" type %"PRIu8,
+                     inv_desc.hi, inv_desc.lo, desc_type);
         return false;
     }
     s->iq_head++;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index d2b4973..fba81f4 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -10,6 +10,19 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
 # hw/i386/x86-iommu.c
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
+# hw/i386/intel_iommu.c
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
+vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
+vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
+vtd_inv_desc_cc_global(void) "context invalidate globally"
+vtd_inv_desc_cc_device(uint8_t bus, uint8_t dev, uint8_t fn) "context invalidate device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate devices sid 0x%"PRIx16" fmask 0x%"PRIx16
+vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
+vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
+vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
+vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
+vtd_inv_desc_wait_irq(const char *msg) "%s"
+
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
 amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 05/14] intel_iommu: fix trace for addr translation
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (3 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 06/14] intel_iommu: vtd_slpt_level_shift check level Peter Xu
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Another patch to convert the DPRINTF() stuffs. This patch focuses on the
address translation path and caching.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 84 +++++++++++++++++++++++++--------------------------
 hw/i386/trace-events  |  7 +++++
 2 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 459e575..b4166e0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -260,11 +260,9 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
-                domain_id);
+    trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
-        VTD_DPRINTF(CACHE, "iotlb exceeds size limit, forced to reset");
+        trace_vtd_iotlb_reset("iotlb exceeds size limit");
         vtd_reset_iotlb(s);
     }
 
@@ -505,8 +503,8 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
 
     addr = s->root + index * sizeof(*re);
     if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
-                    " + %"PRIu8, s->root, index);
+        error_report("Fail to access root-entry at 0x%"PRIx64
+                     " index %"PRIu8, s->root, index);
         re->val = 0;
         return -VTD_FR_ROOT_TABLE_INV;
     }
@@ -525,13 +523,12 @@ static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
     dma_addr_t addr;
 
     if (!vtd_root_entry_present(root)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
+        error_report("Root-entry is not present");
         return -VTD_FR_ROOT_ENTRY_P;
     }
     addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
     if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
-                    " + %"PRIu8,
+        error_report("Fail to access context-entry at 0x%"PRIx64" ind %"PRIu8,
                     (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
         return -VTD_FR_CONTEXT_TABLE_INV;
     }
@@ -644,7 +641,7 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
      * in CAP_REG and AW in context-entry.
      */
     if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
-        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
+        error_report("IOVA 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
@@ -656,7 +653,7 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
-            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
+            error_report("Fail to access second-level paging "
                         "entry at level %"PRIu32 " for iova 0x%"PRIx64,
                         level, iova);
             if (level == vtd_get_level_from_context_entry(ce)) {
@@ -669,13 +666,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
         *reads = (*reads) && (slpte & VTD_SL_R);
         *writes = (*writes) && (slpte & VTD_SL_W);
         if (!(slpte & access_right_check)) {
-            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
-                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (is_write ? "write" : "read"), iova, slpte);
+            error_report("Lack of %s permission for iova 0x%"PRIx64
+                         " slpte 0x%"PRIx64,
+                         (is_write ? "write" : "read"), iova, slpte);
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
-            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
+            error_report("Non-zero reserved field in second "
                         "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
                         level, slpte);
             return -VTD_FR_PAGING_ENTRY_RSVD;
@@ -704,12 +701,13 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_root_entry_present(&re)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry #%"PRIu8 " is not present",
-                    bus_num);
+        /* Not error - it's okay we don't have root entry. */
+        trace_vtd_re_not_present(bus_num);
         return -VTD_FR_ROOT_ENTRY_P;
     } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
+        error_report("Non-zero reserved field in root-entry bus_num %d "
+                     "hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                     bus_num, re.rsvd, re.val);
         return -VTD_FR_ROOT_ENTRY_RSVD;
     }
 
@@ -719,22 +717,20 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_context_entry_present(ce)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
-                    "is not present", devfn, bus_num);
+        /* Not error - it's okay we don't have context entry. */
+        trace_vtd_ce_not_present(bus_num, devfn);
         return -VTD_FR_CONTEXT_ENTRY_P;
     } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
                (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: non-zero reserved field in context-entry "
+        error_report("Non-zero reserved field in context-entry"
                     "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_RSVD;
     }
     /* Check if the programming of context-entry is valid */
     if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
-        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
-                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    ce->hi, ce->lo);
+        error_report("Unsupported Address Width value in "
+                     "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                     ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_INV;
     } else {
         switch (ce->lo & VTD_CONTEXT_ENTRY_TT) {
@@ -746,6 +742,9 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
             VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
                         "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
                         ce->hi, ce->lo);
+            error_report("Unsupported Translation Type in "
+                         "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                         ce->hi, ce->lo);
             return -VTD_FR_CONTEXT_ENTRY_INV;
         }
     }
@@ -825,9 +824,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                    " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
-                    iotlb_entry->slpte, iotlb_entry->domain_id);
+        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+                                 iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
         reads = iotlb_entry->read_flags;
         writes = iotlb_entry->write_flags;
@@ -836,10 +834,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     }
     /* Try to fetch context-entry from cache first */
     if (cc_entry->context_cache_gen == s->context_cache_gen) {
-        VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 ")",
-                    bus_num, devfn, cc_entry->context_entry.hi,
-                    cc_entry->context_entry.lo, cc_entry->context_cache_gen);
+        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
+                               cc_entry->context_entry.lo,
+                               cc_entry->context_cache_gen);
         ce = cc_entry->context_entry;
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
     } else {
@@ -848,19 +845,18 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         if (ret_fr) {
             ret_fr = -ret_fr;
             if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
-                            "requests through this context-entry "
-                            "(with FPD Set)");
+                error_report("Fault processing is disabled for DMA "
+                             "requests through this context-entry "
+                             "(with FPD Set)");
             } else {
                 vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
             }
             return;
         }
         /* Update context-cache */
-        VTD_DPRINTF(CACHE, "update context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 "->%"PRIu32 ")",
-                    bus_num, devfn, ce.hi, ce.lo,
-                    cc_entry->context_cache_gen, s->context_cache_gen);
+        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
+                                  cc_entry->context_cache_gen,
+                                  s->context_cache_gen);
         cc_entry->context_entry = ce;
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
@@ -870,8 +866,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     if (ret_fr) {
         ret_fr = -ret_fr;
         if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-            VTD_DPRINTF(FLOG, "fault processing is disabled for DMA requests "
-                        "through this context-entry (with FPD Set)");
+            error_report("Fault processing is disabled for DMA "
+                         "requests through this context-entry "
+                         "(with FPD Set)");
         } else {
             vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
         }
@@ -1031,6 +1028,7 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
 
 static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_iotlb_reset("global invalidation recved");
     vtd_reset_iotlb(s);
 }
 
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index fba81f4..eba9bf2 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,13 @@ vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PR
 vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
 vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
 vtd_inv_desc_wait_irq(const char *msg) "%s"
+vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
+vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
+vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
+vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
+vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 06/14] intel_iommu: vtd_slpt_level_shift check level
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (4 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 05/14] intel_iommu: fix trace for addr translation Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier Peter Xu
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This helps in debugging incorrect level passed in.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b4166e0..b4019d0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -168,6 +168,7 @@ static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
 /* The shift of an addr for a certain level of paging structure */
 static inline uint32_t vtd_slpt_level_shift(uint32_t level)
 {
+    assert(level != 0);
     return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (5 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 06/14] intel_iommu: vtd_slpt_level_shift check level Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  7:55   ` Jason Wang
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 08/14] memory: provide iommu_replay_all() Peter Xu
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

In this patch, IOMMUNotifier.{start|end} are introduced to store section
information for a specific notifier. When notification occurs, we not
only check the notification type (MAP|UNMAP), but also check whether the
notified iova is in the range of specific IOMMU notifier, and skip those
notifiers if not in the listened range.

When removing an region, we need to make sure we removed the correct
VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c      | 7 ++++++-
 include/exec/memory.h | 3 +++
 memory.c              | 4 +++-
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 801578b..6f648da 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -455,6 +455,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
         giommu->container = container;
         giommu->n.notify = vfio_iommu_map_notify;
         giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
+        giommu->n.start = section->offset_within_region;
+        llend = int128_add(int128_make64(giommu->n.start), section->size);
+        llend = int128_sub(llend, int128_one());
+        giommu->n.end = int128_get64(llend);
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
@@ -525,7 +529,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (giommu->iommu == section->mr) {
+            if (giommu->iommu == section->mr &&
+                giommu->n.start == section->offset_within_region) {
                 memory_region_unregister_iommu_notifier(giommu->iommu,
                                                         &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index bec9756..7649e74 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -84,6 +84,9 @@ typedef enum {
 struct IOMMUNotifier {
     void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
     IOMMUNotifierFlag notifier_flags;
+    /* Notify for address space range start <= addr <= end */
+    hwaddr start;
+    hwaddr end;
     QLIST_ENTRY(IOMMUNotifier) node;
 };
 typedef struct IOMMUNotifier IOMMUNotifier;
diff --git a/memory.c b/memory.c
index 2bfc37f..e88bb54 100644
--- a/memory.c
+++ b/memory.c
@@ -1671,7 +1671,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     }
 
     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
-        if (iommu_notifier->notifier_flags & request_flags) {
+        if (iommu_notifier->notifier_flags & request_flags &&
+            iommu_notifier->start <= entry.iova &&
+            iommu_notifier->end >= entry.iova) {
             iommu_notifier->notify(iommu_notifier, &entry);
         }
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 08/14] memory: provide iommu_replay_all()
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (6 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one() Peter Xu
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This is an "global" version of exising memory_region_iommu_replay() - we
announce the translations to all the registered notifiers, instead of a
specific one.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 8 ++++++++
 memory.c              | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7649e74..2233f99 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -694,6 +694,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
                                 bool is_write);
 
 /**
+ * memory_region_iommu_replay_all: replay existing IOMMU translations
+ * to all the notifiers registered.
+ *
+ * @mr: the memory region to observe
+ */
+void memory_region_iommu_replay_all(MemoryRegion *mr);
+
+/**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
  * changes to IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index e88bb54..df62bd1 100644
--- a/memory.c
+++ b/memory.c
@@ -1645,6 +1645,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     }
 }
 
+void memory_region_iommu_replay_all(MemoryRegion *mr)
+{
+    IOMMUNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &mr->iommu_notify, node) {
+        memory_region_iommu_replay(mr, notifier, false);
+    }
+}
+
 void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
                                              IOMMUNotifier *n)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one()
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (7 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 08/14] memory: provide iommu_replay_all() Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  7:58   ` Jason Wang
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 10/14] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Generalizing the notify logic in memory_region_notify_iommu() into a
single function. This can be further used in customized replay()
functions for IOMMUs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 15 +++++++++++++++
 memory.c              | 29 ++++++++++++++++++-----------
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2233f99..f367e54 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -669,6 +669,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
                                 IOMMUTLBEntry entry);
 
 /**
+ * memory_region_notify_one: notify a change in an IOMMU translation
+ *                           entry to a single notifier
+ *
+ * This works just like memory_region_notify_iommu(), but it only
+ * notifies a specific notifier, not all of them.
+ *
+ * @notifier: the notifier to be notified
+ * @entry: the new entry in the IOMMU translation table.  The entry
+ *         replaces all old entries for the same virtual I/O address range.
+ *         Deleted entries have .@perm == 0.
+ */
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry);
+
+/**
  * memory_region_register_iommu_notifier: register a notifier for changes to
  * IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index df62bd1..6e4c872 100644
--- a/memory.c
+++ b/memory.c
@@ -1665,26 +1665,33 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
     memory_region_update_iommu_notify_flags(mr);
 }
 
-void memory_region_notify_iommu(MemoryRegion *mr,
-                                IOMMUTLBEntry entry)
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry)
 {
-    IOMMUNotifier *iommu_notifier;
     IOMMUNotifierFlag request_flags;
 
-    assert(memory_region_is_iommu(mr));
-
-    if (entry.perm & IOMMU_RW) {
+    if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
     } else {
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
+    if (notifier->notifier_flags & request_flags &&
+        notifier->start <= entry->iova &&
+        notifier->end >= entry->iova) {
+        notifier->notify(notifier, entry);
+    }
+}
+
+void memory_region_notify_iommu(MemoryRegion *mr,
+                                IOMMUTLBEntry entry)
+{
+    IOMMUNotifier *iommu_notifier;
+
+    assert(memory_region_is_iommu(mr));
+
     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
-        if (iommu_notifier->notifier_flags & request_flags &&
-            iommu_notifier->start <= entry.iova &&
-            iommu_notifier->end >= entry.iova) {
-            iommu_notifier->notify(iommu_notifier, &entry);
-        }
+        memory_region_notify_one(iommu_notifier, &entry);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 10/14] memory: add MemoryRegionIOMMUOps.replay() callback
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (8 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one() Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback Peter Xu
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Originally we have one memory_region_iommu_replay() function, which is
the default behavior to replay the translations of the whole IOMMU
region. However, on some platform like x86, we may want our own replay
logic for IOMMU regions. This patch add one more hook for IOMMUOps for
the callback, and it'll override the default if set.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 2 ++
 memory.c              | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index f367e54..cff6958 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -181,6 +181,8 @@ struct MemoryRegionIOMMUOps {
     void (*notify_flag_changed)(MemoryRegion *iommu,
                                 IOMMUNotifierFlag old_flags,
                                 IOMMUNotifierFlag new_flags);
+    /* Set this up to provide customized IOMMU replay function */
+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/memory.c b/memory.c
index 6e4c872..609ac67 100644
--- a/memory.c
+++ b/memory.c
@@ -1629,6 +1629,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    /* If the IOMMU has its own replay callback, override */
+    if (mr->iommu_ops->replay) {
+        mr->iommu_ops->replay(mr, n);
+        return;
+    }
+
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (9 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 10/14] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-13  9:26   ` Jason Wang
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate Peter Xu
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

The default replay() don't work for VT-d since vt-d will have a huge
default memory region which covers address range 0-(2^64-1). This will
normally bring a dead loop when guest starts.

The solution is simple - we don't walk over all the regions. Instead, we
jump over the regions when we found that the page directories are empty.
It'll greatly reduce the time to walk the whole region.

To achieve this, we provided a page walk helper to do that, invoking
corresponding hook function when we found an page we are interested in.
vtd_page_walk_level() is the core logic for the page walking. It's
interface is designed to suite further use case, e.g., to invalidate a
range of addresses.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/i386/trace-events  |   8 ++
 include/exec/memory.h |   2 +
 3 files changed, 217 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b4019d0..59bf683 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -600,6 +600,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
+static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
+{
+    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
+    return 1ULL << MIN(ce_agaw, VTD_MGAW);
+}
+
+/* Return true if IOVA passes range check, otherwise false. */
+static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
+{
+    /*
+     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
+     */
+    return !(iova & ~(vtd_iova_limit(ce) - 1));
+}
+
 static const uint64_t vtd_paging_entry_rsvd_field[] = {
     [0] = ~0ULL,
     /* For not large page */
@@ -635,13 +651,9 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     uint32_t level = vtd_get_level_from_context_entry(ce);
     uint32_t offset;
     uint64_t slpte;
-    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check;
 
-    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
-     * in CAP_REG and AW in context-entry.
-     */
-    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+    if (!vtd_iova_range_check(iova, ce)) {
         error_report("IOVA 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
@@ -689,6 +701,166 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
     }
 }
 
+typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
+
+/**
+ * vtd_page_walk_level - walk over specific level for IOVA range
+ *
+ * @addr: base GPA addr to start the walk
+ * @start: IOVA range start address
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @read: whether parent level has read permission
+ * @write: whether parent level has write permission
+ * @skipped: accumulated skipped ranges
+ * @notify_unmap: whether we should notify invalid entries
+ */
+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
+                               uint64_t end, vtd_page_walk_hook hook_fn,
+                               void *private, uint32_t level,
+                               bool read, bool write, uint64_t *skipped,
+                               bool notify_unmap)
+{
+    bool read_cur, write_cur, entry_valid;
+    uint32_t offset;
+    uint64_t slpte;
+    uint64_t subpage_size, subpage_mask;
+    IOMMUTLBEntry entry;
+    uint64_t iova = start;
+    uint64_t iova_next;
+    uint64_t skipped_local = 0;
+    int ret = 0;
+
+    trace_vtd_page_walk_level(addr, level, start, end);
+
+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
+    subpage_mask = vtd_slpt_level_page_mask(level);
+
+    while (iova < end) {
+        iova_next = (iova & subpage_mask) + subpage_size;
+
+        offset = vtd_iova_level_offset(iova, level);
+        slpte = vtd_get_slpte(addr, offset);
+
+        /*
+         * When one of the following case happens, we assume the whole
+         * range is invalid:
+         *
+         * 1. read block failed
+         * 3. reserved area non-zero
+         * 2. both read & write flag are not set
+         */
+
+        if (slpte == (uint64_t)-1) {
+            trace_vtd_page_walk_skip_read(iova, iova_next);
+            skipped_local++;
+            goto next;
+        }
+
+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
+            skipped_local++;
+            goto next;
+        }
+
+        /* Permissions are stacked with parents' */
+        read_cur = read && (slpte & VTD_SL_R);
+        write_cur = write && (slpte & VTD_SL_W);
+
+        /*
+         * As long as we have either read/write permission, this is
+         * a valid entry. The rule works for both page or page tables.
+         */
+        entry_valid = read_cur | write_cur;
+
+        if (vtd_is_last_slpte(slpte, level)) {
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            /*
+             * This might be meaningless addr if (!read_cur &&
+             * !write_cur), but after all this field will be
+             * meaningless in that case, so let's share the code to
+             * generate the IOTLBs no matter it's an MAP or UNMAP
+             */
+            entry.translated_addr = vtd_get_slpte_addr(slpte);
+            entry.addr_mask = ~subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            if (!entry_valid && !notify_unmap) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                skipped_local++;
+                goto next;
+            }
+            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
+                                    entry.addr_mask, entry.perm);
+            if (hook_fn) {
+                ret = hook_fn(&entry, private);
+                if (ret < 0) {
+                    error_report("Detected error in page walk hook "
+                                 "function, stop walk.");
+                    return ret;
+                }
+            }
+        } else {
+            if (!entry_valid) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                skipped_local++;
+                goto next;
+            }
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
+                                      MIN(iova_next, end), hook_fn, private,
+                                      level - 1, read_cur, write_cur,
+                                      &skipped_local, notify_unmap);
+            if (ret < 0) {
+                error_report("Detected page walk error on addr 0x%"PRIx64
+                             " level %"PRIu32", stop walk.", addr, level - 1);
+                return ret;
+            }
+        }
+
+next:
+        iova = iova_next;
+    }
+
+    if (skipped) {
+        *skipped += skipped_local;
+    }
+
+    return 0;
+}
+
+/**
+ * vtd_page_walk - walk specific IOVA range, and call the hook
+ *
+ * @ce: context entry to walk upon
+ * @start: IOVA address to start the walk
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: the hook that to be called for each detected area
+ * @private: private data for the hook function
+ */
+static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
+                         vtd_page_walk_hook hook_fn, void *private)
+{
+    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
+    uint32_t level = vtd_get_level_from_context_entry(ce);
+
+    if (!vtd_iova_range_check(start, ce)) {
+        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
+                     start, end);
+        return -VTD_FR_ADDR_BEYOND_MGAW;
+    }
+
+    if (!vtd_iova_range_check(end, ce)) {
+        /* Fix end so that it reaches the maximum */
+        end = vtd_iova_limit(ce);
+    }
+
+    trace_vtd_page_walk(ce->hi, ce->lo, start, end);
+
+    return vtd_page_walk_level(addr, start, end, hook_fn, private,
+                               level, true, true, NULL, false);
+}
+
 /* Map a device to its corresponding domain (context-entry) */
 static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
                                     uint8_t devfn, VTDContextEntry *ce)
@@ -2426,6 +2598,35 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
+{
+    memory_region_notify_one((IOMMUNotifier *)private, entry);
+    return 0;
+}
+
+static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
+{
+    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_n = pci_bus_num(vtd_as->bus);
+    VTDContextEntry ce;
+
+    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+        /*
+         * Scanned a valid context entry, walk over the pages and
+         * notify when needed.
+         */
+        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                  PCI_FUNC(vtd_as->devfn), ce.hi, ce.lo);
+        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
+    } else {
+        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                    PCI_FUNC(vtd_as->devfn));
+    }
+
+    return;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -2440,6 +2641,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     s->iommu_ops.translate = vtd_iommu_translate;
     s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
+    s->iommu_ops.replay = vtd_iommu_replay;
     s->root = 0;
     s->root_extended = false;
     s->dmar_enabled = false;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index eba9bf2..92d210d 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -29,6 +29,14 @@ vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t doma
 vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
+vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_page_walk(uint64_t hi, uint64_t lo, uint64_t start, uint64_t end) "Page walk for ce (0x%"PRIx64", 0x%"PRIx64") iova range 0x%"PRIx64" - 0x%"PRIx64
+vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "Page walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
+vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "Page walk detected map level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
+vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
+vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/exec/memory.h b/include/exec/memory.h
index cff6958..49664f4 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -59,6 +59,8 @@ typedef enum {
     IOMMU_RW   = 3,
 } IOMMUAccessFlags;
 
+#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
+
 struct IOMMUTLBEntry {
     AddressSpace    *target_as;
     hwaddr           iova;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (10 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-16  5:53   ` Jason Wang
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

Before this one we only invalidate context cache when we receive context
entry invalidations. However it's possible that the invalidation also
contains a domain switch (only if cache-mode is enabled for vIOMMU). In
that case we need to notify all the registered components about the new
mapping.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 59bf683..fd75112 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1162,6 +1162,7 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                 trace_vtd_inv_desc_cc_device(bus_n, (devfn_it >> 3) & 0x1f,
                                              devfn_it & 3);
                 vtd_as->context_cache_entry.context_cache_gen = 0;
+                memory_region_iommu_replay_all(&vtd_as->iommu);
             }
         }
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (11 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-16  6:20   ` Jason Wang
  2017-01-16 19:53   ` Alex Williamson
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices Peter Xu
  2017-01-13 15:58 ` [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
  14 siblings, 2 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This is preparation work to finally enabled dynamic switching ON/OFF for
VT-d protection. The old VT-d codes is using static IOMMU address space,
and that won't satisfy vfio-pci device listeners.

Let me explain.

vfio-pci devices depend on the memory region listener and IOMMU replay
mechanism to make sure the device mapping is coherent with the guest
even if there are domain switches. And there are two kinds of domain
switches:

  (1) switch from domain A -> B
  (2) switch from domain A -> no domain (e.g., turn DMAR off)

Case (1) is handled by the context entry invalidation handling by the
VT-d replay logic. What the replay function should do here is to replay
the existing page mappings in domain B.

However for case (2), we don't want to replay any domain mappings - we
just need the default GPA->HPA mappings (the address_space_memory
mapping). And this patch helps on case (2) to build up the mapping
automatically by leveraging the vfio-pci memory listeners.

Another important thing that this patch does is to seperate
IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
depend on the DMAR region (like before this patch). It should be a
standalone region, and it should be able to be activated without
DMAR (which is a common behavior of Linux kernel - by default it enables
IR while disabled DMAR).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
v3:
- fix another trivial style issue patchew reported but I missed in v2

v2:
- fix issues reported by patchew
- switch domain by enable/disable memory regions [David]
- provide vtd_switch_address_space{_all}()
- provide a better comment on the memory regions

test done: with intel_iommu device, boot vm with/without
"intel_iommu=on" parameter.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 78 ++++++++++++++++++++++++++++++++++++++++---
 hw/i386/trace-events          |  2 +-
 include/hw/i386/intel_iommu.h |  2 ++
 3 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fd75112..2596f11 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
 }
 
+static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
+{
+    assert(as);
+
+    trace_vtd_switch_address_space(pci_bus_num(as->bus),
+                                   VTD_PCI_SLOT(as->devfn),
+                                   VTD_PCI_FUNC(as->devfn),
+                                   iommu_enabled);
+
+    /* Turn off first then on the other */
+    if (iommu_enabled) {
+        memory_region_set_enabled(&as->sys_alias, false);
+        memory_region_set_enabled(&as->iommu, true);
+    } else {
+        memory_region_set_enabled(&as->iommu, false);
+        memory_region_set_enabled(&as->sys_alias, true);
+    }
+}
+
+static void vtd_switch_address_space_all(IntelIOMMUState *s, bool enabled)
+{
+    GHashTableIter iter;
+    VTDBus *vtd_bus;
+    int i;
+
+    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
+    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
+        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
+            if (!vtd_bus->dev_as[i]) {
+                continue;
+            }
+            vtd_switch_address_space(vtd_bus->dev_as[i], enabled);
+        }
+    }
+}
+
 /* Handle Translation Enable/Disable */
 static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 {
+    if (s->dmar_enabled == en) {
+        return;
+    }
+
     VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
 
     if (en) {
@@ -1360,6 +1400,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
         /* Ok - report back to driver */
         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
     }
+
+    vtd_switch_address_space_all(s, en);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -2586,15 +2628,43 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->devfn = (uint8_t)devfn;
         vtd_dev_as->iommu_state = s;
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+
+        /*
+         * Memory region relationships looks like (Address range shows
+         * only lower 32 bits to make it short in length...):
+         *
+         * |-----------------+-------------------+----------|
+         * | Name            | Address range     | Priority |
+         * |-----------------+-------------------+----------+
+         * | vtd_root        | 00000000-ffffffff |        0 |
+         * |  intel_iommu    | 00000000-ffffffff |        1 |
+         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
+         * |  intel_iommu_ir | fee00000-feefffff |       64 |
+         * |-----------------+-------------------+----------|
+         *
+         * We enable/disable DMAR by switching enablement for
+         * vtd_sys_alias and intel_iommu regions. IR region is always
+         * enabled.
+         */
         memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
                                  &s->iommu_ops, "intel_iommu", UINT64_MAX);
+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
+                                 "vtd_sys_alias", get_system_memory(),
+                                 0, memory_region_size(get_system_memory()));
         memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
                               &vtd_mem_ir_ops, s, "intel_iommu_ir",
                               VTD_INTERRUPT_ADDR_SIZE);
-        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
-                                    &vtd_dev_as->iommu_ir);
-        address_space_init(&vtd_dev_as->as,
-                           &vtd_dev_as->iommu, name);
+        memory_region_init(&vtd_dev_as->root, OBJECT(s),
+                           "vtd_root", UINT64_MAX);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root,
+                                            VTD_INTERRUPT_ADDR_FIRST,
+                                            &vtd_dev_as->iommu_ir, 64);
+        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->sys_alias, 1);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->iommu, 1);
+        vtd_switch_address_space(vtd_dev_as, s->dmar_enabled);
     }
     return vtd_dev_as;
 }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 92d210d..beaef61 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -11,7 +11,6 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
 # hw/i386/intel_iommu.c
-vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
 vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
 vtd_inv_desc_cc_global(void) "context invalidate globally"
@@ -37,6 +36,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
 vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
 vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
 vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
+vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 749eef9..9c3f6c0 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -83,6 +83,8 @@ struct VTDAddressSpace {
     uint8_t devfn;
     AddressSpace as;
     MemoryRegion iommu;
+    MemoryRegion root;
+    MemoryRegion sys_alias;
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (12 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2017-01-13  3:06 ` Peter Xu
  2017-01-16  6:30   ` Jason Wang
  2017-01-13 15:58 ` [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
  14 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  3:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, peterx,
	alex.williamson, bd.aviv

This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
upstream:

  "IOMMU: enable intel_iommu map and unmap notifiers"
  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html

However I removed/fixed some content, and added my own codes.

Instead of translate() every page for iotlb invalidations (which is
slower), we walk the pages when needed and notify in a hook function.

This patch enables vfio devices for VT-d emulation.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 68 +++++++++++++++++++++++++++++++++++++------
 include/hw/i386/intel_iommu.h |  8 +++++
 2 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2596f11..104200b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -839,7 +839,8 @@ next:
  * @private: private data for the hook function
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
-                         vtd_page_walk_hook hook_fn, void *private)
+                         vtd_page_walk_hook hook_fn, void *private,
+                         bool notify_unmap)
 {
     dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
     uint32_t level = vtd_get_level_from_context_entry(ce);
@@ -858,7 +859,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
     trace_vtd_page_walk(ce->hi, ce->lo, start, end);
 
     return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, NULL, false);
+                               level, true, true, NULL, notify_unmap);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1212,6 +1213,34 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
                                 &domain_id);
 }
 
+static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
+                                           void *private)
+{
+    memory_region_notify_iommu((MemoryRegion *)private, *entry);
+    return 0;
+}
+
+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
+                                           uint16_t domain_id, hwaddr addr,
+                                           uint8_t am)
+{
+    IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    int ret;
+
+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
+        VTDAddressSpace *vtd_as = node->vtd_as;
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce);
+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
+                          vtd_page_invalidate_notify_hook,
+                          (void *)&vtd_as->iommu, true);
+        }
+    }
+}
+
+
 static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                                       hwaddr addr, uint8_t am)
 {
@@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
 }
 
 /* Flush IOTLB
@@ -2244,15 +2274,34 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu,
                                           IOMMUNotifierFlag new)
 {
     VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    IntelIOMMUNotifierNode *node = NULL;
+    IntelIOMMUNotifierNode *next_node = NULL;
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_report("Device at bus %s addr %02x.%d requires iommu "
-                     "notifier which is currently not supported by "
-                     "intel-iommu emulation",
-                     vtd_as->bus->qbus.name, PCI_SLOT(vtd_as->devfn),
-                     PCI_FUNC(vtd_as->devfn));
+    if (!s->cache_mode_enabled && new & IOMMU_NOTIFIER_MAP) {
+        error_report("We need to set cache_mode=1 for intel-iommu to enable "
+                     "device assignment with IOMMU protection.");
         exit(1);
     }
+
+    /* Add new ndoe if no mapping was exising before this call */
+    if (old == IOMMU_NOTIFIER_NONE) {
+        node = g_malloc0(sizeof(*node));
+        node->vtd_as = vtd_as;
+        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
+        return;
+    }
+
+    /* update notifier node with new flags */
+    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
+        if (node->vtd_as == vtd_as) {
+            if (new == IOMMU_NOTIFIER_NONE) {
+                QLIST_REMOVE(node, next);
+                g_free(node);
+            }
+            return;
+        }
+    }
 }
 
 static const VMStateDescription vtd_vmstate = {
@@ -2689,7 +2738,7 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
          */
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn), ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
+        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n, false);
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
@@ -2871,6 +2920,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    QLIST_INIT(&s->notifiers_list);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 9c3f6c0..832cfc9 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -63,6 +63,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDIrq VTDIrq;
 typedef struct VTD_MSIMessage VTD_MSIMessage;
+typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -249,6 +250,11 @@ struct VTD_MSIMessage {
 /* When IR is enabled, all MSI/MSI-X data bits should be zero */
 #define VTD_IR_MSI_DATA          (0)
 
+struct IntelIOMMUNotifierNode {
+    VTDAddressSpace *vtd_as;
+    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
+};
+
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
@@ -286,6 +292,8 @@ struct IntelIOMMUState {
     MemoryRegionIOMMUOps iommu_ops;
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    /* list of registered notifiers */
+    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling Peter Xu
@ 2017-01-13  7:46   ` Jason Wang
  2017-01-13  9:13     ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-13  7:46 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月13日 11:06, Peter Xu wrote:
> VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
> good, and we should end the day when we need to recompile the code
> before getting useful debugging information for vt-d. Time to switch to
> the trace system.
>
> This is the first patch to do it.
>
> Generally, the rule of mine is:
>
> - for the old GENERAL typed message, I use error_report() directly if
>    apply. Those are something shouldn't happen, and we should print those
>    errors in all cases, even without enabling debug and tracing.

Looks like some were guest trigger-able. If yes, let's try not use 
error_report() for not being flooded.

Thanks

>
> - for the non-GENERAL typed messages, remove those VTD_PRINTF()s that
>    looks hardly used, and convert the rest lines into trace_*().
>
> - for useless DPRINTFs, I removed them.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier Peter Xu
@ 2017-01-13  7:55   ` Jason Wang
  2017-01-13  9:23     ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-13  7:55 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月13日 11:06, Peter Xu wrote:
> In this patch, IOMMUNotifier.{start|end} are introduced to store section
> information for a specific notifier. When notification occurs, we not
> only check the notification type (MAP|UNMAP), but also check whether the
> notified iova is in the range of specific IOMMU notifier, and skip those
> notifiers if not in the listened range.
>
> When removing an region, we need to make sure we removed the correct
> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
>
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/common.c      | 7 ++++++-
>   include/exec/memory.h | 3 +++
>   memory.c              | 4 +++-
>   3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 801578b..6f648da 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -455,6 +455,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           giommu->container = container;
>           giommu->n.notify = vfio_iommu_map_notify;
>           giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> +        giommu->n.start = section->offset_within_region;
> +        llend = int128_add(int128_make64(giommu->n.start), section->size);
> +        llend = int128_sub(llend, int128_one());
> +        giommu->n.end = int128_get64(llend);
>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>   
>           memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> @@ -525,7 +529,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>           VFIOGuestIOMMU *giommu;
>   
>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> -            if (giommu->iommu == section->mr) {
> +            if (giommu->iommu == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
>                   memory_region_unregister_iommu_notifier(giommu->iommu,
>                                                           &giommu->n);
>                   QLIST_REMOVE(giommu, giommu_next);
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index bec9756..7649e74 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -84,6 +84,9 @@ typedef enum {
>   struct IOMMUNotifier {
>       void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
>       IOMMUNotifierFlag notifier_flags;
> +    /* Notify for address space range start <= addr <= end */
> +    hwaddr start;
> +    hwaddr end;
>       QLIST_ENTRY(IOMMUNotifier) node;
>   };
>   typedef struct IOMMUNotifier IOMMUNotifier;
> diff --git a/memory.c b/memory.c
> index 2bfc37f..e88bb54 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1671,7 +1671,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>       }
>   
>       QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> -        if (iommu_notifier->notifier_flags & request_flags) {
> +        if (iommu_notifier->notifier_flags & request_flags &&
> +            iommu_notifier->start <= entry.iova &&
> +            iommu_notifier->end >= entry.iova) {
>               iommu_notifier->notify(iommu_notifier, &entry);
>           }
>       }

This seems breaks vhost device IOTLB. How about keep the the behavior 
somehow?

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one()
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one() Peter Xu
@ 2017-01-13  7:58   ` Jason Wang
  2017-01-16  7:08     ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-13  7:58 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月13日 11:06, Peter Xu wrote:
> Generalizing the notify logic in memory_region_notify_iommu() into a
> single function. This can be further used in customized replay()
> functions for IOMMUs.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/exec/memory.h | 15 +++++++++++++++
>   memory.c              | 29 ++++++++++++++++++-----------
>   2 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 2233f99..f367e54 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -669,6 +669,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>                                   IOMMUTLBEntry entry);
>   
>   /**
> + * memory_region_notify_one: notify a change in an IOMMU translation
> + *                           entry to a single notifier
> + *
> + * This works just like memory_region_notify_iommu(), but it only
> + * notifies a specific notifier, not all of them.
> + *
> + * @notifier: the notifier to be notified
> + * @entry: the new entry in the IOMMU translation table.  The entry
> + *         replaces all old entries for the same virtual I/O address range.
> + *         Deleted entries have .@perm == 0.
> + */
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry);
> +
> +/**
>    * memory_region_register_iommu_notifier: register a notifier for changes to
>    * IOMMU translation entries.
>    *
> diff --git a/memory.c b/memory.c
> index df62bd1..6e4c872 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1665,26 +1665,33 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>       memory_region_update_iommu_notify_flags(mr);
>   }
>   
> -void memory_region_notify_iommu(MemoryRegion *mr,
> -                                IOMMUTLBEntry entry)
> +void memory_region_notify_one(IOMMUNotifier *notifier,
> +                              IOMMUTLBEntry *entry)
>   {
> -    IOMMUNotifier *iommu_notifier;
>       IOMMUNotifierFlag request_flags;
>   
> -    assert(memory_region_is_iommu(mr));
> -
> -    if (entry.perm & IOMMU_RW) {
> +    if (entry->perm & IOMMU_RW) {
>           request_flags = IOMMU_NOTIFIER_MAP;
>       } else {
>           request_flags = IOMMU_NOTIFIER_UNMAP;
>       }

Nit: you can keep this outside the loop.

Thanks

>   
> +    if (notifier->notifier_flags & request_flags &&
> +        notifier->start <= entry->iova &&
> +        notifier->end >= entry->iova) {
> +        notifier->notify(notifier, entry);
> +    }
> +}
> +
> +void memory_region_notify_iommu(MemoryRegion *mr,
> +                                IOMMUTLBEntry entry)
> +{
> +    IOMMUNotifier *iommu_notifier;
> +
> +    assert(memory_region_is_iommu(mr));
> +
>       QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> -        if (iommu_notifier->notifier_flags & request_flags &&
> -            iommu_notifier->start <= entry.iova &&
> -            iommu_notifier->end >= entry.iova) {
> -            iommu_notifier->notify(iommu_notifier, &entry);
> -        }
> +        memory_region_notify_one(iommu_notifier, &entry);
>       }
>   }
>   

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling
  2017-01-13  7:46   ` Jason Wang
@ 2017-01-13  9:13     ` Peter Xu
  2017-01-13  9:33       ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  9:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Fri, Jan 13, 2017 at 03:46:31PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月13日 11:06, Peter Xu wrote:
> >VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
> >good, and we should end the day when we need to recompile the code
> >before getting useful debugging information for vt-d. Time to switch to
> >the trace system.
> >
> >This is the first patch to do it.
> >
> >Generally, the rule of mine is:
> >
> >- for the old GENERAL typed message, I use error_report() directly if
> >   apply. Those are something shouldn't happen, and we should print those
> >   errors in all cases, even without enabling debug and tracing.
> 
> Looks like some were guest trigger-able. If yes, let's try not use
> error_report() for not being flooded.

Yes, it's intended. Most of the error_report()s in this patch can be
triggered by guest, but only by illegal guest behaviors (e.g.,
non-zero reserved fields, or illegal descriptors, etc.). In that
sense, shall we keep them even guest can trigger them? Since people
will never see them if they are running generic and good kernels. More
importantly, these error_report()s can be good hints when guest
encounters issues, for better debugging and triaging.

Actually we have such usage in existing QEMU as well. For example,
when we maintain the DMA mapping in vfio-pci, it's possible that the
shadow page table is mapped illegally due to some reason (that depends
on the guest as well, may not be guest kernel, but DPDK applications
inside guest), and the map() can fail. Here we have:

    ret = vfio_dma_map(container, iova,
                        iotlb->addr_mask + 1, vaddr,
                        !(iotlb->perm & IOMMU_WO) || mr->readonly);
    if (ret) {
        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                        "0x%"HWADDR_PRIx", %p) = %d (%m)",
                        container, iova,
                        iotlb->addr_mask + 1, vaddr, ret);
    }

Which I think is playing the same role here - we will never see these
lines if the guest is normal, and these lines will be useful when bad
things happen.

So I would slightly prefer that we keep these error_reports() for now,
as long as they won't flush the screen for most of the users. (during
the time I played with this series, none of them jumped out :)

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier
  2017-01-13  7:55   ` Jason Wang
@ 2017-01-13  9:23     ` Peter Xu
  2017-01-13  9:37       ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-13  9:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Fri, Jan 13, 2017 at 03:55:22PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月13日 11:06, Peter Xu wrote:
> >In this patch, IOMMUNotifier.{start|end} are introduced to store section
> >information for a specific notifier. When notification occurs, we not
> >only check the notification type (MAP|UNMAP), but also check whether the
> >notified iova is in the range of specific IOMMU notifier, and skip those
> >notifiers if not in the listened range.
> >
> >When removing an region, we need to make sure we removed the correct
> >VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> >
> >Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> >Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> >Signed-off-by: Peter Xu <peterx@redhat.com>
> >---
> >  hw/vfio/common.c      | 7 ++++++-
> >  include/exec/memory.h | 3 +++
> >  memory.c              | 4 +++-
> >  3 files changed, 12 insertions(+), 2 deletions(-)
> >
> >diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >index 801578b..6f648da 100644
> >--- a/hw/vfio/common.c
> >+++ b/hw/vfio/common.c
> >@@ -455,6 +455,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >          giommu->container = container;
> >          giommu->n.notify = vfio_iommu_map_notify;
> >          giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> >+        giommu->n.start = section->offset_within_region;
> >+        llend = int128_add(int128_make64(giommu->n.start), section->size);
> >+        llend = int128_sub(llend, int128_one());
> >+        giommu->n.end = int128_get64(llend);
> >          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> >@@ -525,7 +529,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >          VFIOGuestIOMMU *giommu;
> >          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >-            if (giommu->iommu == section->mr) {
> >+            if (giommu->iommu == section->mr &&
> >+                giommu->n.start == section->offset_within_region) {
> >                  memory_region_unregister_iommu_notifier(giommu->iommu,
> >                                                          &giommu->n);
> >                  QLIST_REMOVE(giommu, giommu_next);
> >diff --git a/include/exec/memory.h b/include/exec/memory.h
> >index bec9756..7649e74 100644
> >--- a/include/exec/memory.h
> >+++ b/include/exec/memory.h
> >@@ -84,6 +84,9 @@ typedef enum {
> >  struct IOMMUNotifier {
> >      void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
> >      IOMMUNotifierFlag notifier_flags;
> >+    /* Notify for address space range start <= addr <= end */
> >+    hwaddr start;
> >+    hwaddr end;
> >      QLIST_ENTRY(IOMMUNotifier) node;
> >  };
> >  typedef struct IOMMUNotifier IOMMUNotifier;
> >diff --git a/memory.c b/memory.c
> >index 2bfc37f..e88bb54 100644
> >--- a/memory.c
> >+++ b/memory.c
> >@@ -1671,7 +1671,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
> >      }
> >      QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> >-        if (iommu_notifier->notifier_flags & request_flags) {
> >+        if (iommu_notifier->notifier_flags & request_flags &&
> >+            iommu_notifier->start <= entry.iova &&
> >+            iommu_notifier->end >= entry.iova) {
> >              iommu_notifier->notify(iommu_notifier, &entry);
> >          }
> >      }
> 
> This seems breaks vhost device IOTLB. How about keep the the behavior
> somehow?

Thanks to point out. How about I squash this into this patch?

--------8<--------
diff --git a/memory.c b/memory.c
index e88bb54..6de02dd 100644
--- a/memory.c
+++ b/memory.c
@@ -1608,8 +1608,14 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
         return;
     }
 
+    if (n->start == 0 && n->end == 0) {
+        /* If these are not specified, we listen to the whole range */
+        n->end = (hwaddr)(-1);
+    }
+
     /* We need to register for at least one bitfield */
     assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
+    assert(n->start <= n->end);
     QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
     memory_region_update_iommu_notify_flags(mr);
 }
-------->8--------

-- peterx

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback Peter Xu
@ 2017-01-13  9:26   ` Jason Wang
  2017-01-16  7:31     ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-13  9:26 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月13日 11:06, Peter Xu wrote:
> The default replay() don't work for VT-d since vt-d will have a huge
> default memory region which covers address range 0-(2^64-1). This will
> normally bring a dead loop when guest starts.

I think it just takes too much time instead of dead loop?

>
> The solution is simple - we don't walk over all the regions. Instead, we
> jump over the regions when we found that the page directories are empty.
> It'll greatly reduce the time to walk the whole region.

Yes, the problem is memory_region_is_iommu_reply() not smart because:

- It doesn't understand large page
- try go over all possible iova

So I'm thinking to introduce something like iommu_ops->iova_iterate() which

1) accept an start iova and return the next exist map
2) understand large page
3) skip unmapped iova

>
> To achieve this, we provided a page walk helper to do that, invoking
> corresponding hook function when we found an page we are interested in.
> vtd_page_walk_level() is the core logic for the page walking. It's
> interface is designed to suite further use case, e.g., to invalidate a
> range of addresses.
>
> Signed-off-by: Peter Xu<peterx@redhat.com>

For intel iommu, since we intercept all map and unmap, a more tricky 
ieda is to we can record the mappings internally in something like a 
rbtree which could be iterated during replay. This saves possible guest 
io page table traversal, but drawback is it may not survive from OOM 
attacker.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling
  2017-01-13  9:13     ` Peter Xu
@ 2017-01-13  9:33       ` Jason Wang
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-13  9:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月13日 17:13, Peter Xu wrote:
> On Fri, Jan 13, 2017 at 03:46:31PM +0800, Jason Wang wrote:
>>
>> On 2017年01月13日 11:06, Peter Xu wrote:
>>> VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
>>> good, and we should end the day when we need to recompile the code
>>> before getting useful debugging information for vt-d. Time to switch to
>>> the trace system.
>>>
>>> This is the first patch to do it.
>>>
>>> Generally, the rule of mine is:
>>>
>>> - for the old GENERAL typed message, I use error_report() directly if
>>>    apply. Those are something shouldn't happen, and we should print those
>>>    errors in all cases, even without enabling debug and tracing.
>> Looks like some were guest trigger-able. If yes, let's try not use
>> error_report() for not being flooded.
> Yes, it's intended. Most of the error_report()s in this patch can be
> triggered by guest, but only by illegal guest behaviors (e.g.,
> non-zero reserved fields, or illegal descriptors, etc.). In that
> sense, shall we keep them even guest can trigger them? Since people
> will never see them if they are running generic and good kernels. More
> importantly, these error_report()s can be good hints when guest
> encounters issues, for better debugging and triaging.
>
> Actually we have such usage in existing QEMU as well. For example,
> when we maintain the DMA mapping in vfio-pci, it's possible that the
> shadow page table is mapped illegally due to some reason (that depends
> on the guest as well, may not be guest kernel, but DPDK applications
> inside guest), and the map() can fail. Here we have:
>
>      ret = vfio_dma_map(container, iova,
>                          iotlb->addr_mask + 1, vaddr,
>                          !(iotlb->perm & IOMMU_WO) || mr->readonly);
>      if (ret) {
>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
>                          container, iova,
>                          iotlb->addr_mask + 1, vaddr, ret);
>      }
>
> Which I think is playing the same role here - we will never see these
> lines if the guest is normal, and these lines will be useful when bad
> things happen.
>
> So I would slightly prefer that we keep these error_reports() for now,
> as long as they won't flush the screen for most of the users. (during
> the time I played with this series, none of them jumped out :)

I think the point is just surviving from malicious guests. So we need 
avoid guest trigger-able thing likes this, consider if we redirect 
stderr to a log file, malicious guest may exhaust disk space which is a 
DOS. So we'd better avoid them.

Thanks

>
> Thanks,
>
> -- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier
  2017-01-13  9:23     ` Peter Xu
@ 2017-01-13  9:37       ` Jason Wang
  2017-01-13 10:22         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-13  9:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月13日 17:23, Peter Xu wrote:
> On Fri, Jan 13, 2017 at 03:55:22PM +0800, Jason Wang wrote:
>>
>> On 2017年01月13日 11:06, Peter Xu wrote:
>>> In this patch, IOMMUNotifier.{start|end} are introduced to store section
>>> information for a specific notifier. When notification occurs, we not
>>> only check the notification type (MAP|UNMAP), but also check whether the
>>> notified iova is in the range of specific IOMMU notifier, and skip those
>>> notifiers if not in the listened range.
>>>
>>> When removing an region, we need to make sure we removed the correct
>>> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
>>>
>>> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>   hw/vfio/common.c      | 7 ++++++-
>>>   include/exec/memory.h | 3 +++
>>>   memory.c              | 4 +++-
>>>   3 files changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 801578b..6f648da 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -455,6 +455,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>           giommu->container = container;
>>>           giommu->n.notify = vfio_iommu_map_notify;
>>>           giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
>>> +        giommu->n.start = section->offset_within_region;
>>> +        llend = int128_add(int128_make64(giommu->n.start), section->size);
>>> +        llend = int128_sub(llend, int128_one());
>>> +        giommu->n.end = int128_get64(llend);
>>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>>           memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>>> @@ -525,7 +529,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>           VFIOGuestIOMMU *giommu;
>>>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>> -            if (giommu->iommu == section->mr) {
>>> +            if (giommu->iommu == section->mr &&
>>> +                giommu->n.start == section->offset_within_region) {
>>>                   memory_region_unregister_iommu_notifier(giommu->iommu,
>>>                                                           &giommu->n);
>>>                   QLIST_REMOVE(giommu, giommu_next);
>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>> index bec9756..7649e74 100644
>>> --- a/include/exec/memory.h
>>> +++ b/include/exec/memory.h
>>> @@ -84,6 +84,9 @@ typedef enum {
>>>   struct IOMMUNotifier {
>>>       void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
>>>       IOMMUNotifierFlag notifier_flags;
>>> +    /* Notify for address space range start <= addr <= end */
>>> +    hwaddr start;
>>> +    hwaddr end;
>>>       QLIST_ENTRY(IOMMUNotifier) node;
>>>   };
>>>   typedef struct IOMMUNotifier IOMMUNotifier;
>>> diff --git a/memory.c b/memory.c
>>> index 2bfc37f..e88bb54 100644
>>> --- a/memory.c
>>> +++ b/memory.c
>>> @@ -1671,7 +1671,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>>>       }
>>>       QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
>>> -        if (iommu_notifier->notifier_flags & request_flags) {
>>> +        if (iommu_notifier->notifier_flags & request_flags &&
>>> +            iommu_notifier->start <= entry.iova &&
>>> +            iommu_notifier->end >= entry.iova) {
>>>               iommu_notifier->notify(iommu_notifier, &entry);
>>>           }
>>>       }
>> This seems breaks vhost device IOTLB. How about keep the the behavior
>> somehow?
> Thanks to point out. How about I squash this into this patch?
>
> --------8<--------
> diff --git a/memory.c b/memory.c
> index e88bb54..6de02dd 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1608,8 +1608,14 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
>           return;
>       }
>   
> +    if (n->start == 0 && n->end == 0) {
> +        /* If these are not specified, we listen to the whole range */
> +        n->end = (hwaddr)(-1);
> +    }
> +
>       /* We need to register for at least one bitfield */
>       assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
> +    assert(n->start <= n->end);
>       QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
>       memory_region_update_iommu_notify_flags(mr);
>   }
> -------->8--------
>
> -- peterx

This should work, or you can introduce a 
memory_region_iommu_notifier_init() to force user to explicitly 
initialize start and end.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier
  2017-01-13  9:37       ` Jason Wang
@ 2017-01-13 10:22         ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-13 10:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Fri, Jan 13, 2017 at 05:37:43PM +0800, Jason Wang wrote:

[...]

> >>>diff --git a/memory.c b/memory.c
> >>>index 2bfc37f..e88bb54 100644
> >>>--- a/memory.c
> >>>+++ b/memory.c
> >>>@@ -1671,7 +1671,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
> >>>      }
> >>>      QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
> >>>-        if (iommu_notifier->notifier_flags & request_flags) {
> >>>+        if (iommu_notifier->notifier_flags & request_flags &&
> >>>+            iommu_notifier->start <= entry.iova &&
> >>>+            iommu_notifier->end >= entry.iova) {
> >>>              iommu_notifier->notify(iommu_notifier, &entry);
> >>>          }
> >>>      }
> >>This seems breaks vhost device IOTLB. How about keep the the behavior
> >>somehow?
> >Thanks to point out. How about I squash this into this patch?
> >
> >--------8<--------
> >diff --git a/memory.c b/memory.c
> >index e88bb54..6de02dd 100644
> >--- a/memory.c
> >+++ b/memory.c
> >@@ -1608,8 +1608,14 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
> >          return;
> >      }
> >+    if (n->start == 0 && n->end == 0) {
> >+        /* If these are not specified, we listen to the whole range */
> >+        n->end = (hwaddr)(-1);
> >+    }
> >+
> >      /* We need to register for at least one bitfield */
> >      assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
> >+    assert(n->start <= n->end);
> >      QLIST_INSERT_HEAD(&mr->iommu_notify, n, node);
> >      memory_region_update_iommu_notify_flags(mr);
> >  }
> >-------->8--------
> >
> >-- peterx
> 
> This should work, or you can introduce a memory_region_iommu_notifier_init()
> to force user to explicitly initialize start and end.

Hmm, this sounds better, considering that IOMMUNotifier is getting
more fields to be inited. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances
  2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
                   ` (13 preceding siblings ...)
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices Peter Xu
@ 2017-01-13 15:58 ` Michael S. Tsirkin
  2017-01-14  2:59   ` Peter Xu
  14 siblings, 1 reply; 93+ messages in thread
From: Michael S. Tsirkin @ 2017-01-13 15:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 13, 2017 at 11:06:26AM +0800, Peter Xu wrote:
> v3:
> - fix style error reported by patchew
> - fix comment in domain switch patch: use "IOMMU address space" rather
>   than "IOMMU region" [Kevin]
> - add ack-by for Paolo in patch:
>   "memory: add section range info for IOMMU notifier"
>   (this is seperately collected besides this thread)
> - remove 3 patches which are merged already (from Jason)
> - rebase to master b6c0897

So 1-6 look like nice cleanups to me. Should I merge them now?

> v2:
> - change comment for "end" parameter in vtd_page_walk() [Tianyu]
> - change comment for "a iova" to "an iova" [Yi]
> - fix fault printed val for GPA address in vtd_page_walk_level (debug
>   only)
> - rebased to master (rather than Aviv's v6 series) and merged Aviv's
>   series v6: picked patch 1 (as patch 1 in this series), dropped patch
>   2, re-wrote patch 3 (as patch 17 of this series).
> - picked up two more bugfix patches from Jason's DMAR series
> - picked up the following patch as well:
>   "[PATCH v3] intel_iommu: allow dynamic switch of IOMMU region"
> 
> This RFC series is a re-work for Aviv B.D.'s vfio enablement series
> with vt-d:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html
> 
> Aviv has done a great job there, and what we still lack there are
> mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> This series should have solved the above two issues.
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement-v2
> 
> I would be glad to hear about any review comments for above patches.
> 
> =========
> Test Done
> =========
> 
> Build test passed for x86_64/arm/ppc64.
> 
> Simply tested with x86_64, assigning two PCI devices to a single VM,
> boot the VM using:
> 
> bin=x86_64-softmmu/qemu-system-x86_64
> $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
>      -device intel-iommu,intremap=on,eim=off,cache-mode=on \
>      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
>      -device virtio-net-pci,netdev=net0 \
>      -device vfio-pci,host=03:00.0 \
>      -device vfio-pci,host=02:00.0 \
>      -trace events=".trace.vfio" \
>      /var/lib/libvirt/images/vm1.qcow2
> 
> pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> vtd_page_walk*
> vtd_replay*
> vtd_inv_desc*
> 
> Then, in the guest, run the following tool:
> 
>   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> 
> With parameter:
> 
>   ./vfio-bind-group 00:03.0 00:04.0
> 
> Check host side trace log, I can see pages are replayed and mapped in
> 00:04.0 device address space, like:
> 
> ...
> vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x401 lo 0x38fe1001
> vtd_page_walk Page walk for ce (0x401, 0x38fe1001) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x38fe1000, level=3) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x35d31000, level=2) iova range 0x0 - 0x40000000
> vtd_page_walk_level Page walk (base=0x34979000, level=1) iova range 0x0 - 0x200000
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x22dc3000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x22e25000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x22e12000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x22e2d000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x12a49000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x129bb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x128db000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x12a80000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x12a7e000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x12b22000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x12b41000 mask 0xfff perm 3
> ...
> 
> =========
> Todo List
> =========
> 
> - error reporting for the assigned devices (as Tianyu has mentioned)
> 
> - per-domain address-space: A better solution in the future may be -
>   we maintain one address space per IOMMU domain in the guest (so
>   multiple devices can share a same address space if they are sharing
>   the same IOMMU domains in the guest), rather than one address space
>   per device (which is current implementation of vt-d). However that's
>   a step further than this series, and let's see whether we can first
>   provide a workable version of device assignment with vt-d
>   protection.
> 
> - more to come...
> 
> Thanks,
> 
> Aviv Ben-David (1):
>   IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to
>     guest
> 
> Peter Xu (13):
>   intel_iommu: simplify irq region translation
>   intel_iommu: renaming gpa to iova where proper
>   intel_iommu: fix trace for inv desc handling
>   intel_iommu: fix trace for addr translation
>   intel_iommu: vtd_slpt_level_shift check level
>   memory: add section range info for IOMMU notifier
>   memory: provide iommu_replay_all()
>   memory: introduce memory_region_notify_one()
>   memory: add MemoryRegionIOMMUOps.replay() callback
>   intel_iommu: provide its own replay() callback
>   intel_iommu: do replay when context invalidate
>   intel_iommu: allow dynamic switch of IOMMU region
>   intel_iommu: enable vfio devices
> 
>  hw/i386/intel_iommu.c          | 589 +++++++++++++++++++++++++++++++----------
>  hw/i386/intel_iommu_internal.h |   1 +
>  hw/i386/trace-events           |  28 ++
>  hw/vfio/common.c               |   7 +-
>  include/exec/memory.h          |  30 +++
>  include/hw/i386/intel_iommu.h  |  12 +
>  memory.c                       |  42 ++-
>  7 files changed, 557 insertions(+), 152 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances
  2017-01-13 15:58 ` [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
@ 2017-01-14  2:59   ` Peter Xu
  2017-01-17 15:07     ` Michael S. Tsirkin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-14  2:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, tianyu.lan, kevin.tian, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 13, 2017 at 05:58:02PM +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 13, 2017 at 11:06:26AM +0800, Peter Xu wrote:
> > v3:
> > - fix style error reported by patchew
> > - fix comment in domain switch patch: use "IOMMU address space" rather
> >   than "IOMMU region" [Kevin]
> > - add ack-by for Paolo in patch:
> >   "memory: add section range info for IOMMU notifier"
> >   (this is seperately collected besides this thread)
> > - remove 3 patches which are merged already (from Jason)
> > - rebase to master b6c0897
> 
> So 1-6 look like nice cleanups to me. Should I merge them now?

That'll be great if you'd like to merge them. Then I can further
shorten this series for the next post.

Regarding to the error_report() issue that Jason has mentioned, I can
touch them up in the future when needed - after all, most of the patch
content are about converting DPRINT()s into traces.

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate Peter Xu
@ 2017-01-16  5:53   ` Jason Wang
  2017-01-16  7:43     ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-16  5:53 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月13日 11:06, Peter Xu wrote:
> Before this one we only invalidate context cache when we receive context
> entry invalidations. However it's possible that the invalidation also
> contains a domain switch (only if cache-mode is enabled for vIOMMU).

So let's check for CM before replaying?

>   In
> that case we need to notify all the registered components about the new
> mapping.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 59bf683..fd75112 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1162,6 +1162,7 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                   trace_vtd_inv_desc_cc_device(bus_n, (devfn_it >> 3) & 0x1f,
>                                                devfn_it & 3);
>                   vtd_as->context_cache_entry.context_cache_gen = 0;
> +                memory_region_iommu_replay_all(&vtd_as->iommu);
>               }
>           }
>       }

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
@ 2017-01-16  6:20   ` Jason Wang
  2017-01-16  7:50     ` Peter Xu
  2017-01-16 19:53   ` Alex Williamson
  1 sibling, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-16  6:20 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月13日 11:06, Peter Xu wrote:
> This is preparation work to finally enabled dynamic switching ON/OFF for
> VT-d protection. The old VT-d codes is using static IOMMU address space,
> and that won't satisfy vfio-pci device listeners.
>
> Let me explain.
>
> vfio-pci devices depend on the memory region listener and IOMMU replay
> mechanism to make sure the device mapping is coherent with the guest
> even if there are domain switches. And there are two kinds of domain
> switches:
>
>    (1) switch from domain A -> B
>    (2) switch from domain A -> no domain (e.g., turn DMAR off)
>
> Case (1) is handled by the context entry invalidation handling by the
> VT-d replay logic. What the replay function should do here is to replay
> the existing page mappings in domain B.
>
> However for case (2), we don't want to replay any domain mappings - we
> just need the default GPA->HPA mappings (the address_space_memory
> mapping). And this patch helps on case (2) to build up the mapping
> automatically by leveraging the vfio-pci memory listeners.
>
> Another important thing that this patch does is to seperate
> IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> depend on the DMAR region (like before this patch). It should be a
> standalone region, and it should be able to be activated without
> DMAR (which is a common behavior of Linux kernel - by default it enables
> IR while disabled DMAR).
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v3:
> - fix another trivial style issue patchew reported but I missed in v2
>
> v2:
> - fix issues reported by patchew
> - switch domain by enable/disable memory regions [David]
> - provide vtd_switch_address_space{_all}()
> - provide a better comment on the memory regions
>
> test done: with intel_iommu device, boot vm with/without
> "intel_iommu=on" parameter.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c         | 78 ++++++++++++++++++++++++++++++++++++++++---
>   hw/i386/trace-events          |  2 +-
>   include/hw/i386/intel_iommu.h |  2 ++
>   3 files changed, 77 insertions(+), 5 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fd75112..2596f11 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>       vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>   }
>   
> +static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)

Looks like you can check s->dmar_enabled here?

> +{
> +    assert(as);
> +
> +    trace_vtd_switch_address_space(pci_bus_num(as->bus),
> +                                   VTD_PCI_SLOT(as->devfn),
> +                                   VTD_PCI_FUNC(as->devfn),
> +                                   iommu_enabled);
> +
> +    /* Turn off first then on the other */
> +    if (iommu_enabled) {
> +        memory_region_set_enabled(&as->sys_alias, false);
> +        memory_region_set_enabled(&as->iommu, true);
> +    } else {
> +        memory_region_set_enabled(&as->iommu, false);
> +        memory_region_set_enabled(&as->sys_alias, true);
> +    }
> +}
> +
> +static void vtd_switch_address_space_all(IntelIOMMUState *s, bool enabled)
> +{
> +    GHashTableIter iter;
> +    VTDBus *vtd_bus;
> +    int i;
> +
> +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> +            if (!vtd_bus->dev_as[i]) {
> +                continue;
> +            }
> +            vtd_switch_address_space(vtd_bus->dev_as[i], enabled);
> +        }
> +    }
> +}
> +
>   /* Handle Translation Enable/Disable */
>   static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>   {
> +    if (s->dmar_enabled == en) {
> +        return;
> +    }
> +
>       VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
>   
>       if (en) {
> @@ -1360,6 +1400,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>           /* Ok - report back to driver */
>           vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
>       }
> +
> +    vtd_switch_address_space_all(s, en);
>   }
>   
>   /* Handle Interrupt Remap Enable/Disable */
> @@ -2586,15 +2628,43 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>           vtd_dev_as->devfn = (uint8_t)devfn;
>           vtd_dev_as->iommu_state = s;
>           vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> +
> +        /*
> +         * Memory region relationships looks like (Address range shows
> +         * only lower 32 bits to make it short in length...):
> +         *
> +         * |-----------------+-------------------+----------|
> +         * | Name            | Address range     | Priority |
> +         * |-----------------+-------------------+----------+
> +         * | vtd_root        | 00000000-ffffffff |        0 |
> +         * |  intel_iommu    | 00000000-ffffffff |        1 |
> +         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
> +         * |  intel_iommu_ir | fee00000-feefffff |       64 |
> +         * |-----------------+-------------------+----------|
> +         *
> +         * We enable/disable DMAR by switching enablement for
> +         * vtd_sys_alias and intel_iommu regions. IR region is always
> +         * enabled.
> +         */
>           memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
>                                    &s->iommu_ops, "intel_iommu", UINT64_MAX);

Then it's better to name this as "intel_iommu_dmar"?

> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> +                                 "vtd_sys_alias", get_system_memory(),
> +                                 0, memory_region_size(get_system_memory()));
>           memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
>                                 &vtd_mem_ir_ops, s, "intel_iommu_ir",
>                                 VTD_INTERRUPT_ADDR_SIZE);
> -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> -                                    &vtd_dev_as->iommu_ir);
> -        address_space_init(&vtd_dev_as->as,
> -                           &vtd_dev_as->iommu, name);
> +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> +                           "vtd_root", UINT64_MAX);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> +                                            VTD_INTERRUPT_ADDR_FIRST,
> +                                            &vtd_dev_as->iommu_ir, 64);
> +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->sys_alias, 1);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->iommu, 1);
> +        vtd_switch_address_space(vtd_dev_as, s->dmar_enabled);
>       }
>       return vtd_dev_as;
>   }
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 92d210d..beaef61 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -11,7 +11,6 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
>   x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>   
>   # hw/i386/intel_iommu.c
> -vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>   vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
>   vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
>   vtd_inv_desc_cc_global(void) "context invalidate globally"
> @@ -37,6 +36,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
>   vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
>   vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
>   vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>   
>   # hw/i386/amd_iommu.c
>   amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 749eef9..9c3f6c0 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -83,6 +83,8 @@ struct VTDAddressSpace {
>       uint8_t devfn;
>       AddressSpace as;
>       MemoryRegion iommu;
> +    MemoryRegion root;
> +    MemoryRegion sys_alias;
>       MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>       IntelIOMMUState *iommu_state;
>       VTDContextCacheEntry context_cache_entry;

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices Peter Xu
@ 2017-01-16  6:30   ` Jason Wang
  2017-01-16  9:18     ` Peter Xu
  2017-01-16  9:20     ` Peter Xu
  0 siblings, 2 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-16  6:30 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, alex.williamson, bd.aviv



On 2017年01月13日 11:06, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
>
>    "IOMMU: enable intel_iommu map and unmap notifiers"
>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>
> However I removed/fixed some content, and added my own codes.
>
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
>
> This patch enables vfio devices for VT-d emulation.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c         | 68 +++++++++++++++++++++++++++++++++++++------
>   include/hw/i386/intel_iommu.h |  8 +++++
>   2 files changed, 67 insertions(+), 9 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 2596f11..104200b 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -839,7 +839,8 @@ next:
>    * @private: private data for the hook function
>    */
>   static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> -                         vtd_page_walk_hook hook_fn, void *private)
> +                         vtd_page_walk_hook hook_fn, void *private,
> +                         bool notify_unmap)
>   {
>       dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
>       uint32_t level = vtd_get_level_from_context_entry(ce);
> @@ -858,7 +859,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>       trace_vtd_page_walk(ce->hi, ce->lo, start, end);
>   
>       return vtd_page_walk_level(addr, start, end, hook_fn, private,
> -                               level, true, true, NULL, false);
> +                               level, true, true, NULL, notify_unmap);
>   }
>   
>   /* Map a device to its corresponding domain (context-entry) */
> @@ -1212,6 +1213,34 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>                                   &domain_id);
>   }
>   
> +static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
> +                                           void *private)
> +{
> +    memory_region_notify_iommu((MemoryRegion *)private, *entry);
> +    return 0;
> +}
> +
> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> +                                           uint16_t domain_id, hwaddr addr,
> +                                           uint8_t am)
> +{
> +    IntelIOMMUNotifierNode *node;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> +        VTDAddressSpace *vtd_as = node->vtd_as;
> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> +                                       vtd_as->devfn, &ce);
> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> +                          vtd_page_invalidate_notify_hook,
> +                          (void *)&vtd_as->iommu, true);
> +        }
> +    }
> +}
> +
> +
>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>                                         hwaddr addr, uint8_t am)
>   {
> @@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>       info.addr = addr;
>       info.mask = ~((1 << am) - 1);
>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);

Is the case of GLOBAL or DSI flush missed, or we don't care it at all?

Thanks

>   }
>   
>   /* Flush IOTLB
> @@ -2244,15 +2274,34 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu,
>                                             IOMMUNotifierFlag new)
>   {
>       VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    IntelIOMMUNotifierNode *node = NULL;
> +    IntelIOMMUNotifierNode *next_node = NULL;
>   
> -    if (new & IOMMU_NOTIFIER_MAP) {
> -        error_report("Device at bus %s addr %02x.%d requires iommu "
> -                     "notifier which is currently not supported by "
> -                     "intel-iommu emulation",
> -                     vtd_as->bus->qbus.name, PCI_SLOT(vtd_as->devfn),
> -                     PCI_FUNC(vtd_as->devfn));
> +    if (!s->cache_mode_enabled && new & IOMMU_NOTIFIER_MAP) {
> +        error_report("We need to set cache_mode=1 for intel-iommu to enable "
> +                     "device assignment with IOMMU protection.");
>           exit(1);
>       }
> +
> +    /* Add new ndoe if no mapping was exising before this call */

"node"?

> +    if (old == IOMMU_NOTIFIER_NONE) {
> +        node = g_malloc0(sizeof(*node));
> +        node->vtd_as = vtd_as;
> +        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
> +        return;
> +    }
> +
> +    /* update notifier node with new flags */
> +    QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
> +        if (node->vtd_as == vtd_as) {
> +            if (new == IOMMU_NOTIFIER_NONE) {
> +                QLIST_REMOVE(node, next);
> +                g_free(node);
> +            }
> +            return;
> +        }
> +    }
>   }
>   
>   static const VMStateDescription vtd_vmstate = {
> @@ -2689,7 +2738,7 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
>            */
>           trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
>                                     PCI_FUNC(vtd_as->devfn), ce.hi, ce.lo);
> -        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
> +        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n, false);
>       } else {
>           trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
>                                       PCI_FUNC(vtd_as->devfn));
> @@ -2871,6 +2920,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>           return;
>       }
>   
> +    QLIST_INIT(&s->notifiers_list);
>       memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
>       memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
>                             "intel_iommu", DMAR_REG_SIZE);
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 9c3f6c0..832cfc9 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -63,6 +63,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>   typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>   typedef struct VTDIrq VTDIrq;
>   typedef struct VTD_MSIMessage VTD_MSIMessage;
> +typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
>   
>   /* Context-Entry */
>   struct VTDContextEntry {
> @@ -249,6 +250,11 @@ struct VTD_MSIMessage {
>   /* When IR is enabled, all MSI/MSI-X data bits should be zero */
>   #define VTD_IR_MSI_DATA          (0)
>   
> +struct IntelIOMMUNotifierNode {
> +    VTDAddressSpace *vtd_as;
> +    QLIST_ENTRY(IntelIOMMUNotifierNode) next;
> +};
> +
>   /* The iommu (DMAR) device state struct */
>   struct IntelIOMMUState {
>       X86IOMMUState x86_iommu;
> @@ -286,6 +292,8 @@ struct IntelIOMMUState {
>       MemoryRegionIOMMUOps iommu_ops;
>       GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>       VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +    /* list of registered notifiers */
> +    QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
>   
>       /* interrupt remapping */
>       bool intr_enabled;              /* Whether guest enabled IR */

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one()
  2017-01-13  7:58   ` Jason Wang
@ 2017-01-16  7:08     ` Peter Xu
  2017-01-16  7:38       ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  7:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Fri, Jan 13, 2017 at 03:58:59PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月13日 11:06, Peter Xu wrote:
> >Generalizing the notify logic in memory_region_notify_iommu() into a
> >single function. This can be further used in customized replay()
> >functions for IOMMUs.
> >
> >Signed-off-by: Peter Xu <peterx@redhat.com>
> >---
> >  include/exec/memory.h | 15 +++++++++++++++
> >  memory.c              | 29 ++++++++++++++++++-----------
> >  2 files changed, 33 insertions(+), 11 deletions(-)
> >
> >diff --git a/include/exec/memory.h b/include/exec/memory.h
> >index 2233f99..f367e54 100644
> >--- a/include/exec/memory.h
> >+++ b/include/exec/memory.h
> >@@ -669,6 +669,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
> >                                  IOMMUTLBEntry entry);
> >  /**
> >+ * memory_region_notify_one: notify a change in an IOMMU translation
> >+ *                           entry to a single notifier
> >+ *
> >+ * This works just like memory_region_notify_iommu(), but it only
> >+ * notifies a specific notifier, not all of them.
> >+ *
> >+ * @notifier: the notifier to be notified
> >+ * @entry: the new entry in the IOMMU translation table.  The entry
> >+ *         replaces all old entries for the same virtual I/O address range.
> >+ *         Deleted entries have .@perm == 0.
> >+ */
> >+void memory_region_notify_one(IOMMUNotifier *notifier,
> >+                              IOMMUTLBEntry *entry);
> >+
> >+/**
> >   * memory_region_register_iommu_notifier: register a notifier for changes to
> >   * IOMMU translation entries.
> >   *
> >diff --git a/memory.c b/memory.c
> >index df62bd1..6e4c872 100644
> >--- a/memory.c
> >+++ b/memory.c
> >@@ -1665,26 +1665,33 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
> >      memory_region_update_iommu_notify_flags(mr);
> >  }
> >-void memory_region_notify_iommu(MemoryRegion *mr,
> >-                                IOMMUTLBEntry entry)
> >+void memory_region_notify_one(IOMMUNotifier *notifier,
> >+                              IOMMUTLBEntry *entry)
> >  {
> >-    IOMMUNotifier *iommu_notifier;
> >      IOMMUNotifierFlag request_flags;
> >-    assert(memory_region_is_iommu(mr));
> >-
> >-    if (entry.perm & IOMMU_RW) {
> >+    if (entry->perm & IOMMU_RW) {
> >          request_flags = IOMMU_NOTIFIER_MAP;
> >      } else {
> >          request_flags = IOMMU_NOTIFIER_UNMAP;
> >      }
> 
> Nit: you can keep this outside the loop.

Yes, but this function is used in vtd_replay_hook() as well in latter
patch. If I keep the above outside loop (IIUC you mean the loop in
memory_region_notify_iommu()), I'll need to set it up as well in
future vtd_replay_hook(), then that'll be slightly awkward.
Considering that the notification will only happen at mapping changes,
I'll prefer to keep the interface cleaner like what this patch has
done.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-13  9:26   ` Jason Wang
@ 2017-01-16  7:31     ` Peter Xu
  2017-01-16  7:47       ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  7:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Fri, Jan 13, 2017 at 05:26:06PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月13日 11:06, Peter Xu wrote:
> >The default replay() don't work for VT-d since vt-d will have a huge
> >default memory region which covers address range 0-(2^64-1). This will
> >normally bring a dead loop when guest starts.
> 
> I think it just takes too much time instead of dead loop?

Hmm, I can touch the commit message above to make it more precise.

> 
> >
> >The solution is simple - we don't walk over all the regions. Instead, we
> >jump over the regions when we found that the page directories are empty.
> >It'll greatly reduce the time to walk the whole region.
> 
> Yes, the problem is memory_region_is_iommu_reply() not smart because:
> 
> - It doesn't understand large page
> - try go over all possible iova
> 
> So I'm thinking to introduce something like iommu_ops->iova_iterate() which
> 
> 1) accept an start iova and return the next exist map
> 2) understand large page
> 3) skip unmapped iova

Though I haven't tested with huge pages yet, but this patch should
both solve above issue? I don't know whether you went over the page
walk logic - it should both support huge page, and it will skip
unmapped iova range (at least that's my goal to have this patch). In
that case, looks like this patch is solving the same problem? :)
(though without introducing iova_iterate() interface)

Please correct me if I misunderstood it.

> 
> >
> >To achieve this, we provided a page walk helper to do that, invoking
> >corresponding hook function when we found an page we are interested in.
> >vtd_page_walk_level() is the core logic for the page walking. It's
> >interface is designed to suite further use case, e.g., to invalidate a
> >range of addresses.
> >
> >Signed-off-by: Peter Xu<peterx@redhat.com>
> 
> For intel iommu, since we intercept all map and unmap, a more tricky ieda is
> to we can record the mappings internally in something like a rbtree which
> could be iterated during replay. This saves possible guest io page table
> traversal, but drawback is it may not survive from OOM attacker.

I think the problem is that we need this rbtree per guest-iommu-domain
(because mapping can be different per domain). In that case, I failed
to understand how the tree can help here. :(

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one()
  2017-01-16  7:08     ` Peter Xu
@ 2017-01-16  7:38       ` Jason Wang
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-16  7:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月16日 15:08, Peter Xu wrote:
> On Fri, Jan 13, 2017 at 03:58:59PM +0800, Jason Wang wrote:
>>
>> On 2017年01月13日 11:06, Peter Xu wrote:
>>> Generalizing the notify logic in memory_region_notify_iommu() into a
>>> single function. This can be further used in customized replay()
>>> functions for IOMMUs.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>   include/exec/memory.h | 15 +++++++++++++++
>>>   memory.c              | 29 ++++++++++++++++++-----------
>>>   2 files changed, 33 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>> index 2233f99..f367e54 100644
>>> --- a/include/exec/memory.h
>>> +++ b/include/exec/memory.h
>>> @@ -669,6 +669,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>>>                                   IOMMUTLBEntry entry);
>>>   /**
>>> + * memory_region_notify_one: notify a change in an IOMMU translation
>>> + *                           entry to a single notifier
>>> + *
>>> + * This works just like memory_region_notify_iommu(), but it only
>>> + * notifies a specific notifier, not all of them.
>>> + *
>>> + * @notifier: the notifier to be notified
>>> + * @entry: the new entry in the IOMMU translation table.  The entry
>>> + *         replaces all old entries for the same virtual I/O address range.
>>> + *         Deleted entries have .@perm == 0.
>>> + */
>>> +void memory_region_notify_one(IOMMUNotifier *notifier,
>>> +                              IOMMUTLBEntry *entry);
>>> +
>>> +/**
>>>    * memory_region_register_iommu_notifier: register a notifier for changes to
>>>    * IOMMU translation entries.
>>>    *
>>> diff --git a/memory.c b/memory.c
>>> index df62bd1..6e4c872 100644
>>> --- a/memory.c
>>> +++ b/memory.c
>>> @@ -1665,26 +1665,33 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
>>>       memory_region_update_iommu_notify_flags(mr);
>>>   }
>>> -void memory_region_notify_iommu(MemoryRegion *mr,
>>> -                                IOMMUTLBEntry entry)
>>> +void memory_region_notify_one(IOMMUNotifier *notifier,
>>> +                              IOMMUTLBEntry *entry)
>>>   {
>>> -    IOMMUNotifier *iommu_notifier;
>>>       IOMMUNotifierFlag request_flags;
>>> -    assert(memory_region_is_iommu(mr));
>>> -
>>> -    if (entry.perm & IOMMU_RW) {
>>> +    if (entry->perm & IOMMU_RW) {
>>>           request_flags = IOMMU_NOTIFIER_MAP;
>>>       } else {
>>>           request_flags = IOMMU_NOTIFIER_UNMAP;
>>>       }
>> Nit: you can keep this outside the loop.
> Yes, but this function is used in vtd_replay_hook() as well in latter
> patch. If I keep the above outside loop (IIUC you mean the loop in
> memory_region_notify_iommu()), I'll need to set it up as well in
> future vtd_replay_hook(), then that'll be slightly awkward.
> Considering that the notification will only happen at mapping changes,
> I'll prefer to keep the interface cleaner like what this patch has
> done.
>
> Thanks,
>
> -- peterx

Ok, I see.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate
  2017-01-16  5:53   ` Jason Wang
@ 2017-01-16  7:43     ` Peter Xu
  2017-01-16  7:52       ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  7:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 16, 2017 at 01:53:54PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月13日 11:06, Peter Xu wrote:
> >Before this one we only invalidate context cache when we receive context
> >entry invalidations. However it's possible that the invalidation also
> >contains a domain switch (only if cache-mode is enabled for vIOMMU).
> 
> So let's check for CM before replaying?

When CM is not set, there should have no device needs
IOMMU_NOTIFIER_MAP notifies. So IMHO it won't hurt if we replay here
(so the notifier_list will only contain UNMAP notifiers at most, and
sending UNMAP to those devices should not affect them at all).

If we check CM before replay, it'll be faster when guest change iommu
domain for a specific device. But after all this kind of operation is
extremely rare, while if we check CM bit, we have a "assumption" in
the code that MAP is depending on CM. In that case, to make the codes
cleaner, I'd slightly prefer not check it here. How do you think?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-16  7:31     ` Peter Xu
@ 2017-01-16  7:47       ` Jason Wang
  2017-01-16  7:59         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-16  7:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月16日 15:31, Peter Xu wrote:
> On Fri, Jan 13, 2017 at 05:26:06PM +0800, Jason Wang wrote:
>>
>> On 2017年01月13日 11:06, Peter Xu wrote:
>>> The default replay() don't work for VT-d since vt-d will have a huge
>>> default memory region which covers address range 0-(2^64-1). This will
>>> normally bring a dead loop when guest starts.
>> I think it just takes too much time instead of dead loop?
> Hmm, I can touch the commit message above to make it more precise.
>
>>> The solution is simple - we don't walk over all the regions. Instead, we
>>> jump over the regions when we found that the page directories are empty.
>>> It'll greatly reduce the time to walk the whole region.
>> Yes, the problem is memory_region_is_iommu_reply() not smart because:
>>
>> - It doesn't understand large page
>> - try go over all possible iova
>>
>> So I'm thinking to introduce something like iommu_ops->iova_iterate() which
>>
>> 1) accept an start iova and return the next exist map
>> 2) understand large page
>> 3) skip unmapped iova
> Though I haven't tested with huge pages yet, but this patch should
> both solve above issue? I don't know whether you went over the page
> walk logic - it should both support huge page, and it will skip
> unmapped iova range (at least that's my goal to have this patch). In
> that case, looks like this patch is solving the same problem? :)
> (though without introducing iova_iterate() interface)
>
> Please correct me if I misunderstood it.

Kind of :) I'm fine with this patch, but just want:

- reuse most of the codes in the patch
- current memory_region_iommu_replay() logic

So what I'm suggesting is a just slight change of API which can let 
caller decide it need to do with each range of iova. So it could be 
reused for other things except for replaying.

But if you like to keep this patch as is, I don't object it.

>
>>> To achieve this, we provided a page walk helper to do that, invoking
>>> corresponding hook function when we found an page we are interested in.
>>> vtd_page_walk_level() is the core logic for the page walking. It's
>>> interface is designed to suite further use case, e.g., to invalidate a
>>> range of addresses.
>>>
>>> Signed-off-by: Peter Xu<peterx@redhat.com>
>> For intel iommu, since we intercept all map and unmap, a more tricky ieda is
>> to we can record the mappings internally in something like a rbtree which
>> could be iterated during replay. This saves possible guest io page table
>> traversal, but drawback is it may not survive from OOM attacker.
> I think the problem is that we need this rbtree per guest-iommu-domain
> (because mapping can be different per domain). In that case, I failed
> to understand how the tree can help here. :(

Right, I see.

Thanks

>
> Thanks,
>
> -- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16  6:20   ` Jason Wang
@ 2017-01-16  7:50     ` Peter Xu
  2017-01-16  8:01       ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  7:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 16, 2017 at 02:20:31PM +0800, Jason Wang wrote:

[...]

> >diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >index fd75112..2596f11 100644
> >--- a/hw/i386/intel_iommu.c
> >+++ b/hw/i386/intel_iommu.c
> >@@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
> >      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
> >  }
> >+static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
> 
> Looks like you can check s->dmar_enabled here?

Yes, we need to check old state in case we don't need a switch at all.
Actually I checked it...

> 
> >+{
> >+    assert(as);
> >+
> >+    trace_vtd_switch_address_space(pci_bus_num(as->bus),
> >+                                   VTD_PCI_SLOT(as->devfn),
> >+                                   VTD_PCI_FUNC(as->devfn),
> >+                                   iommu_enabled);
> >+
> >+    /* Turn off first then on the other */
> >+    if (iommu_enabled) {
> >+        memory_region_set_enabled(&as->sys_alias, false);
> >+        memory_region_set_enabled(&as->iommu, true);
> >+    } else {
> >+        memory_region_set_enabled(&as->iommu, false);
> >+        memory_region_set_enabled(&as->sys_alias, true);
> >+    }
> >+}

[...]

> >  /* Handle Translation Enable/Disable */
> >  static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
> >  {
> >+    if (s->dmar_enabled == en) {
> >+        return;
> >+    }
> >+

... here :) ... and ...

[...]

> >+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> >+                                 "vtd_sys_alias", get_system_memory(),
> >+                                 0, memory_region_size(get_system_memory()));
> >          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
> >                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
> >                                VTD_INTERRUPT_ADDR_SIZE);
> >-        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> >-                                    &vtd_dev_as->iommu_ir);
> >-        address_space_init(&vtd_dev_as->as,
> >-                           &vtd_dev_as->iommu, name);
> >+        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> >+                           "vtd_root", UINT64_MAX);
> >+        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> >+                                            VTD_INTERRUPT_ADDR_FIRST,
> >+                                            &vtd_dev_as->iommu_ir, 64);
> >+        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> >+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> >+                                            &vtd_dev_as->sys_alias, 1);
> >+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> >+                                            &vtd_dev_as->iommu, 1);
> >+        vtd_switch_address_space(vtd_dev_as, s->dmar_enabled);

... here I also used vtd_switch_address_space() to setup the init
state of the regions (in order to share the codes). So how about I
rename vtd_switch_address_space() into something like
vtd_setup_address_space(), to avoid misunderstanding?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate
  2017-01-16  7:43     ` Peter Xu
@ 2017-01-16  7:52       ` Jason Wang
  2017-01-16  8:02         ` Peter Xu
  2017-01-16  8:18         ` Peter Xu
  0 siblings, 2 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-16  7:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月16日 15:43, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 01:53:54PM +0800, Jason Wang wrote:
>>
>> On 2017年01月13日 11:06, Peter Xu wrote:
>>> Before this one we only invalidate context cache when we receive context
>>> entry invalidations. However it's possible that the invalidation also
>>> contains a domain switch (only if cache-mode is enabled for vIOMMU).
>> So let's check for CM before replaying?
> When CM is not set, there should have no device needs
> IOMMU_NOTIFIER_MAP notifies. So IMHO it won't hurt if we replay here
> (so the notifier_list will only contain UNMAP notifiers at most, and
> sending UNMAP to those devices should not affect them at all).
>
> If we check CM before replay, it'll be faster when guest change iommu
> domain for a specific device. But after all this kind of operation is
> extremely rare, while if we check CM bit, we have a "assumption" in
> the code that MAP is depending on CM. In that case, to make the codes
> cleaner, I'd slightly prefer not check it here. How do you think?

Ok, I think maybe it's better to add a comment here.

Thanks

>
> Thanks,
>
> -- peterx
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-16  7:47       ` Jason Wang
@ 2017-01-16  7:59         ` Peter Xu
  2017-01-16  8:03           ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  7:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 16, 2017 at 03:47:08PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月16日 15:31, Peter Xu wrote:
> >On Fri, Jan 13, 2017 at 05:26:06PM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月13日 11:06, Peter Xu wrote:
> >>>The default replay() don't work for VT-d since vt-d will have a huge
> >>>default memory region which covers address range 0-(2^64-1). This will
> >>>normally bring a dead loop when guest starts.
> >>I think it just takes too much time instead of dead loop?
> >Hmm, I can touch the commit message above to make it more precise.
> >
> >>>The solution is simple - we don't walk over all the regions. Instead, we
> >>>jump over the regions when we found that the page directories are empty.
> >>>It'll greatly reduce the time to walk the whole region.
> >>Yes, the problem is memory_region_is_iommu_reply() not smart because:
> >>
> >>- It doesn't understand large page
> >>- try go over all possible iova
> >>
> >>So I'm thinking to introduce something like iommu_ops->iova_iterate() which
> >>
> >>1) accept an start iova and return the next exist map
> >>2) understand large page
> >>3) skip unmapped iova
> >Though I haven't tested with huge pages yet, but this patch should
> >both solve above issue? I don't know whether you went over the page
> >walk logic - it should both support huge page, and it will skip
> >unmapped iova range (at least that's my goal to have this patch). In
> >that case, looks like this patch is solving the same problem? :)
> >(though without introducing iova_iterate() interface)
> >
> >Please correct me if I misunderstood it.
> 
> Kind of :) I'm fine with this patch, but just want:
> 
> - reuse most of the codes in the patch
> - current memory_region_iommu_replay() logic
> 
> So what I'm suggesting is a just slight change of API which can let caller
> decide it need to do with each range of iova. So it could be reused for
> other things except for replaying.
> 
> But if you like to keep this patch as is, I don't object it.

I see. Then I can understand your mean here. I had the same thought
before, that's why I exposed the vtd_page_walk with a hook. If you
check the page_walk function comment:

/**
 * vtd_page_walk - walk specific IOVA range, and call the hook
 *
 * @ce: context entry to walk upon
 * @start: IOVA address to start the walk
 * @end: IOVA range end address (start <= addr < end)
 * @hook_fn: the hook that to be called for each detected area
 * @private: private data for the hook function
 */

So I didn't implement the notification in page_walk at all - but in
the hook_fn. If any caller that is interested in doing something else
rather than the notification, we can just simply export the page walk
interface and provide his/her own "hook_fn", then it'll be triggered
for each valid page (no matter a huge/small one).

If we can have a more general interface in the future - no matter
whether we call it iova_iterate() or something else (I'll prefer the
hooker way to do it, so maybe a common page walker with a hook
function), we can do it simply (at least for Intel platform) based on
this vtd_page_walk thing.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16  7:50     ` Peter Xu
@ 2017-01-16  8:01       ` Jason Wang
  2017-01-16  8:12         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-16  8:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月16日 15:50, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 02:20:31PM +0800, Jason Wang wrote:
>
> [...]
>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index fd75112..2596f11 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>>>       vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>>>   }
>>> +static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
>> Looks like you can check s->dmar_enabled here?
> Yes, we need to check old state in case we don't need a switch at all.
> Actually I checked it...
>

I mean is there a chance that iommu_enabled( better name should be 
dmar_enabled) is not equal to s->dmar_enabled? Looks not.

vtd_handle_gcmd_te() did:

     ...
     if (en) {
         s->dmar_enabled = true;
         /* Ok - report back to driver */
         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
     } else {
         s->dmar_enabled = false;
     ...

You can vtd_switch_address_space_all(s, en) after this which will call 
this function. And another caller like you've pointed out has already 
call this through s->dmar_enabled. So en here is always s->dmar_enalbed?

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate
  2017-01-16  7:52       ` Jason Wang
@ 2017-01-16  8:02         ` Peter Xu
  2017-01-16  8:18         ` Peter Xu
  1 sibling, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-16  8:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 16, 2017 at 03:52:10PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月16日 15:43, Peter Xu wrote:
> >On Mon, Jan 16, 2017 at 01:53:54PM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月13日 11:06, Peter Xu wrote:
> >>>Before this one we only invalidate context cache when we receive context
> >>>entry invalidations. However it's possible that the invalidation also
> >>>contains a domain switch (only if cache-mode is enabled for vIOMMU).
> >>So let's check for CM before replaying?
> >When CM is not set, there should have no device needs
> >IOMMU_NOTIFIER_MAP notifies. So IMHO it won't hurt if we replay here
> >(so the notifier_list will only contain UNMAP notifiers at most, and
> >sending UNMAP to those devices should not affect them at all).
> >
> >If we check CM before replay, it'll be faster when guest change iommu
> >domain for a specific device. But after all this kind of operation is
> >extremely rare, while if we check CM bit, we have a "assumption" in
> >the code that MAP is depending on CM. In that case, to make the codes
> >cleaner, I'd slightly prefer not check it here. How do you think?
> 
> Ok, I think maybe it's better to add a comment here.

Will do it. Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-16  7:59         ` Peter Xu
@ 2017-01-16  8:03           ` Jason Wang
  2017-01-16  8:06             ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-16  8:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月16日 15:59, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 03:47:08PM +0800, Jason Wang wrote:
>>
>> On 2017年01月16日 15:31, Peter Xu wrote:
>>> On Fri, Jan 13, 2017 at 05:26:06PM +0800, Jason Wang wrote:
>>>> On 2017年01月13日 11:06, Peter Xu wrote:
>>>>> The default replay() don't work for VT-d since vt-d will have a huge
>>>>> default memory region which covers address range 0-(2^64-1). This will
>>>>> normally bring a dead loop when guest starts.
>>>> I think it just takes too much time instead of dead loop?
>>> Hmm, I can touch the commit message above to make it more precise.
>>>
>>>>> The solution is simple - we don't walk over all the regions. Instead, we
>>>>> jump over the regions when we found that the page directories are empty.
>>>>> It'll greatly reduce the time to walk the whole region.
>>>> Yes, the problem is memory_region_is_iommu_reply() not smart because:
>>>>
>>>> - It doesn't understand large page
>>>> - try go over all possible iova
>>>>
>>>> So I'm thinking to introduce something like iommu_ops->iova_iterate() which
>>>>
>>>> 1) accept an start iova and return the next exist map
>>>> 2) understand large page
>>>> 3) skip unmapped iova
>>> Though I haven't tested with huge pages yet, but this patch should
>>> both solve above issue? I don't know whether you went over the page
>>> walk logic - it should both support huge page, and it will skip
>>> unmapped iova range (at least that's my goal to have this patch). In
>>> that case, looks like this patch is solving the same problem? :)
>>> (though without introducing iova_iterate() interface)
>>>
>>> Please correct me if I misunderstood it.
>> Kind of :) I'm fine with this patch, but just want:
>>
>> - reuse most of the codes in the patch
>> - current memory_region_iommu_replay() logic
>>
>> So what I'm suggesting is a just slight change of API which can let caller
>> decide it need to do with each range of iova. So it could be reused for
>> other things except for replaying.
>>
>> But if you like to keep this patch as is, I don't object it.
> I see. Then I can understand your mean here. I had the same thought
> before, that's why I exposed the vtd_page_walk with a hook. If you
> check the page_walk function comment:
>
> /**
>   * vtd_page_walk - walk specific IOVA range, and call the hook
>   *
>   * @ce: context entry to walk upon
>   * @start: IOVA address to start the walk
>   * @end: IOVA range end address (start <= addr < end)
>   * @hook_fn: the hook that to be called for each detected area
>   * @private: private data for the hook function
>   */
>
> So I didn't implement the notification in page_walk at all - but in
> the hook_fn. If any caller that is interested in doing something else
> rather than the notification, we can just simply export the page walk
> interface and provide his/her own "hook_fn", then it'll be triggered
> for each valid page (no matter a huge/small one).
>
> If we can have a more general interface in the future - no matter
> whether we call it iova_iterate() or something else (I'll prefer the
> hooker way to do it, so maybe a common page walker with a hook
> function), we can do it simply (at least for Intel platform) based on
> this vtd_page_walk thing.
>
> Thanks,
>
> -- peterx

Yes but the problem is hook_fn is only visible inside intel iommu code.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-16  8:03           ` Jason Wang
@ 2017-01-16  8:06             ` Peter Xu
  2017-01-16  8:23               ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  8:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 16, 2017 at 04:03:22PM +0800, Jason Wang wrote:

[...]

> >>>Though I haven't tested with huge pages yet, but this patch should
> >>>both solve above issue? I don't know whether you went over the page
> >>>walk logic - it should both support huge page, and it will skip
> >>>unmapped iova range (at least that's my goal to have this patch). In
> >>>that case, looks like this patch is solving the same problem? :)
> >>>(though without introducing iova_iterate() interface)
> >>>
> >>>Please correct me if I misunderstood it.
> >>Kind of :) I'm fine with this patch, but just want:
> >>
> >>- reuse most of the codes in the patch
> >>- current memory_region_iommu_replay() logic
> >>
> >>So what I'm suggesting is a just slight change of API which can let caller
> >>decide it need to do with each range of iova. So it could be reused for
> >>other things except for replaying.
> >>
> >>But if you like to keep this patch as is, I don't object it.
> >I see. Then I can understand your mean here. I had the same thought
> >before, that's why I exposed the vtd_page_walk with a hook. If you
> >check the page_walk function comment:
> >
> >/**
> >  * vtd_page_walk - walk specific IOVA range, and call the hook
> >  *
> >  * @ce: context entry to walk upon
> >  * @start: IOVA address to start the walk
> >  * @end: IOVA range end address (start <= addr < end)
> >  * @hook_fn: the hook that to be called for each detected area
> >  * @private: private data for the hook function
> >  */
> >
> >So I didn't implement the notification in page_walk at all - but in
> >the hook_fn. If any caller that is interested in doing something else
> >rather than the notification, we can just simply export the page walk
> >interface and provide his/her own "hook_fn", then it'll be triggered
> >for each valid page (no matter a huge/small one).
> >
> >If we can have a more general interface in the future - no matter
> >whether we call it iova_iterate() or something else (I'll prefer the
> >hooker way to do it, so maybe a common page walker with a hook
> >function), we can do it simply (at least for Intel platform) based on
> >this vtd_page_walk thing.
> >
> >Thanks,
> >
> >-- peterx
> 
> Yes but the problem is hook_fn is only visible inside intel iommu code.

Right.

Btw, do we have existing issue that can leverage this interface
besides replay?

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16  8:01       ` Jason Wang
@ 2017-01-16  8:12         ` Peter Xu
  2017-01-16  8:25           ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  8:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 16, 2017 at 04:01:00PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月16日 15:50, Peter Xu wrote:
> >On Mon, Jan 16, 2017 at 02:20:31PM +0800, Jason Wang wrote:
> >
> >[...]
> >
> >>>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >>>index fd75112..2596f11 100644
> >>>--- a/hw/i386/intel_iommu.c
> >>>+++ b/hw/i386/intel_iommu.c
> >>>@@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
> >>>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
> >>>  }
> >>>+static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
> >>Looks like you can check s->dmar_enabled here?
> >Yes, we need to check old state in case we don't need a switch at all.
> >Actually I checked it...
> >
> 
> I mean is there a chance that iommu_enabled( better name should be
> dmar_enabled) is not equal to s->dmar_enabled? Looks not.
> 
> vtd_handle_gcmd_te() did:
> 
>     ...
>     if (en) {
>         s->dmar_enabled = true;
>         /* Ok - report back to driver */
>         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
>     } else {
>         s->dmar_enabled = false;
>     ...
> 
> You can vtd_switch_address_space_all(s, en) after this which will call this
> function. And another caller like you've pointed out has already call this
> through s->dmar_enabled. So en here is always s->dmar_enalbed?

Hmm, yes...

(I would still prefer keeping this parameter for readablility.
 Though, I prefer your suggestion to rename it to dmar_enabled)

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate
  2017-01-16  7:52       ` Jason Wang
  2017-01-16  8:02         ` Peter Xu
@ 2017-01-16  8:18         ` Peter Xu
  2017-01-16  8:28           ` Jason Wang
  1 sibling, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  8:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 16, 2017 at 03:52:10PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月16日 15:43, Peter Xu wrote:
> >On Mon, Jan 16, 2017 at 01:53:54PM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月13日 11:06, Peter Xu wrote:
> >>>Before this one we only invalidate context cache when we receive context
> >>>entry invalidations. However it's possible that the invalidation also
> >>>contains a domain switch (only if cache-mode is enabled for vIOMMU).
> >>So let's check for CM before replaying?
> >When CM is not set, there should have no device needs
> >IOMMU_NOTIFIER_MAP notifies. So IMHO it won't hurt if we replay here
> >(so the notifier_list will only contain UNMAP notifiers at most, and
> >sending UNMAP to those devices should not affect them at all).
> >
> >If we check CM before replay, it'll be faster when guest change iommu
> >domain for a specific device. But after all this kind of operation is
> >extremely rare, while if we check CM bit, we have a "assumption" in
> >the code that MAP is depending on CM. In that case, to make the codes
> >cleaner, I'd slightly prefer not check it here. How do you think?
> 
> Ok, I think maybe it's better to add a comment here.

How about this?

+                /*
+                 * So a device is moving out of (or moving into) a
+                 * domain, a replay() suites here to notify all the
+                 * IOMMU_NOTIFIER_MAP registers about this change.
+                 * This won't bring bad even if we have no such
+                 * notifier registered - the IOMMU notification
+                 * framework will skip MAP notifications if that
+                 * happened.
+                 */
                 memory_region_iommu_replay_all(&vtd_as->iommu);

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback
  2017-01-16  8:06             ` Peter Xu
@ 2017-01-16  8:23               ` Jason Wang
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-16  8:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月16日 16:06, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 04:03:22PM +0800, Jason Wang wrote:
>
> [...]
>
>>>>> Though I haven't tested with huge pages yet, but this patch should
>>>>> both solve above issue? I don't know whether you went over the page
>>>>> walk logic - it should both support huge page, and it will skip
>>>>> unmapped iova range (at least that's my goal to have this patch). In
>>>>> that case, looks like this patch is solving the same problem? :)
>>>>> (though without introducing iova_iterate() interface)
>>>>>
>>>>> Please correct me if I misunderstood it.
>>>> Kind of :) I'm fine with this patch, but just want:
>>>>
>>>> - reuse most of the codes in the patch
>>>> - current memory_region_iommu_replay() logic
>>>>
>>>> So what I'm suggesting is a just slight change of API which can let caller
>>>> decide it need to do with each range of iova. So it could be reused for
>>>> other things except for replaying.
>>>>
>>>> But if you like to keep this patch as is, I don't object it.
>>> I see. Then I can understand your mean here. I had the same thought
>>> before, that's why I exposed the vtd_page_walk with a hook. If you
>>> check the page_walk function comment:
>>>
>>> /**
>>>   * vtd_page_walk - walk specific IOVA range, and call the hook
>>>   *
>>>   * @ce: context entry to walk upon
>>>   * @start: IOVA address to start the walk
>>>   * @end: IOVA range end address (start <= addr < end)
>>>   * @hook_fn: the hook that to be called for each detected area
>>>   * @private: private data for the hook function
>>>   */
>>>
>>> So I didn't implement the notification in page_walk at all - but in
>>> the hook_fn. If any caller that is interested in doing something else
>>> rather than the notification, we can just simply export the page walk
>>> interface and provide his/her own "hook_fn", then it'll be triggered
>>> for each valid page (no matter a huge/small one).
>>>
>>> If we can have a more general interface in the future - no matter
>>> whether we call it iova_iterate() or something else (I'll prefer the
>>> hooker way to do it, so maybe a common page walker with a hook
>>> function), we can do it simply (at least for Intel platform) based on
>>> this vtd_page_walk thing.
>>>
>>> Thanks,
>>>
>>> -- peterx
>> Yes but the problem is hook_fn is only visible inside intel iommu code.
> Right.
>
> Btw, do we have existing issue that can leverage this interface
> besides replay?
>
> -- peterx

Seems not, so I'm fine with current code, just want to show the 
possibility for it to be reused in the future.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16  8:12         ` Peter Xu
@ 2017-01-16  8:25           ` Jason Wang
  2017-01-16  8:32             ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-16  8:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月16日 16:12, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 04:01:00PM +0800, Jason Wang wrote:
>>
>> On 2017年01月16日 15:50, Peter Xu wrote:
>>> On Mon, Jan 16, 2017 at 02:20:31PM +0800, Jason Wang wrote:
>>>
>>> [...]
>>>
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index fd75112..2596f11 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>>>>>       vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>>>>>   }
>>>>> +static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
>>>> Looks like you can check s->dmar_enabled here?
>>> Yes, we need to check old state in case we don't need a switch at all.
>>> Actually I checked it...
>>>
>> I mean is there a chance that iommu_enabled( better name should be
>> dmar_enabled) is not equal to s->dmar_enabled? Looks not.
>>
>> vtd_handle_gcmd_te() did:
>>
>>      ...
>>      if (en) {
>>          s->dmar_enabled = true;
>>          /* Ok - report back to driver */
>>          vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
>>      } else {
>>          s->dmar_enabled = false;
>>      ...
>>
>> You can vtd_switch_address_space_all(s, en) after this which will call this
>> function. And another caller like you've pointed out has already call this
>> through s->dmar_enabled. So en here is always s->dmar_enalbed?
> Hmm, yes...
>
> (I would still prefer keeping this parameter for readablility.
>   Though, I prefer your suggestion to rename it to dmar_enabled)
>
> -- peterx

I think this does not give more readability :) May I was wrong, let 
leave this for maintainer.

Thanks :)

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate
  2017-01-16  8:18         ` Peter Xu
@ 2017-01-16  8:28           ` Jason Wang
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-16  8:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月16日 16:18, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 03:52:10PM +0800, Jason Wang wrote:
>>
>> On 2017年01月16日 15:43, Peter Xu wrote:
>>> On Mon, Jan 16, 2017 at 01:53:54PM +0800, Jason Wang wrote:
>>>> On 2017年01月13日 11:06, Peter Xu wrote:
>>>>> Before this one we only invalidate context cache when we receive context
>>>>> entry invalidations. However it's possible that the invalidation also
>>>>> contains a domain switch (only if cache-mode is enabled for vIOMMU).
>>>> So let's check for CM before replaying?
>>> When CM is not set, there should have no device needs
>>> IOMMU_NOTIFIER_MAP notifies. So IMHO it won't hurt if we replay here
>>> (so the notifier_list will only contain UNMAP notifiers at most, and
>>> sending UNMAP to those devices should not affect them at all).
>>>
>>> If we check CM before replay, it'll be faster when guest change iommu
>>> domain for a specific device. But after all this kind of operation is
>>> extremely rare, while if we check CM bit, we have a "assumption" in
>>> the code that MAP is depending on CM. In that case, to make the codes
>>> cleaner, I'd slightly prefer not check it here. How do you think?
>> Ok, I think maybe it's better to add a comment here.
> How about this?
>
> +                /*
> +                 * So a device is moving out of (or moving into) a
> +                 * domain, a replay() suites here to notify all the
> +                 * IOMMU_NOTIFIER_MAP registers about this change.
> +                 * This won't bring bad even if we have no such
> +                 * notifier registered - the IOMMU notification
> +                 * framework will skip MAP notifications if that
> +                 * happened.
> +                 */
>                   memory_region_iommu_replay_all(&vtd_as->iommu);
>
> Thanks,
>
> -- peterx

I'm fine with this.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16  8:25           ` Jason Wang
@ 2017-01-16  8:32             ` Peter Xu
  2017-01-16 16:25               ` Michael S. Tsirkin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  8:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson

On Mon, Jan 16, 2017 at 04:25:35PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月16日 16:12, Peter Xu wrote:
> >On Mon, Jan 16, 2017 at 04:01:00PM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月16日 15:50, Peter Xu wrote:
> >>>On Mon, Jan 16, 2017 at 02:20:31PM +0800, Jason Wang wrote:
> >>>
> >>>[...]
> >>>
> >>>>>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >>>>>index fd75112..2596f11 100644
> >>>>>--- a/hw/i386/intel_iommu.c
> >>>>>+++ b/hw/i386/intel_iommu.c
> >>>>>@@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
> >>>>>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
> >>>>>  }
> >>>>>+static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
> >>>>Looks like you can check s->dmar_enabled here?
> >>>Yes, we need to check old state in case we don't need a switch at all.
> >>>Actually I checked it...
> >>>
> >>I mean is there a chance that iommu_enabled( better name should be
> >>dmar_enabled) is not equal to s->dmar_enabled? Looks not.
> >>
> >>vtd_handle_gcmd_te() did:
> >>
> >>     ...
> >>     if (en) {
> >>         s->dmar_enabled = true;
> >>         /* Ok - report back to driver */
> >>         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
> >>     } else {
> >>         s->dmar_enabled = false;
> >>     ...
> >>
> >>You can vtd_switch_address_space_all(s, en) after this which will call this
> >>function. And another caller like you've pointed out has already call this
> >>through s->dmar_enabled. So en here is always s->dmar_enalbed?
> >Hmm, yes...
> >
> >(I would still prefer keeping this parameter for readablility.
> >  Though, I prefer your suggestion to rename it to dmar_enabled)
> >
> >-- peterx
> 
> I think this does not give more readability :) May I was wrong, let leave
> this for maintainer.
> 
> Thanks :)

Thanks for reviewing this series so fast!

I have no strong opinion as well. Maybe you are right. :-)

Michael, please let me know if you dislike this, so I can remove this
parameter (it equals to as->iommu_state->dmar_enabled).

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-16  6:30   ` Jason Wang
@ 2017-01-16  9:18     ` Peter Xu
  2017-01-16  9:54       ` Jason Wang
  2017-01-16  9:20     ` Peter Xu
  1 sibling, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-16  9:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 16, 2017 at 02:30:20PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月13日 11:06, Peter Xu wrote:
> >This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> >upstream:
> >
> >   "IOMMU: enable intel_iommu map and unmap notifiers"
> >   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
> >
> >However I removed/fixed some content, and added my own codes.
> >
> >Instead of translate() every page for iotlb invalidations (which is
> >slower), we walk the pages when needed and notify in a hook function.
> >
> >This patch enables vfio devices for VT-d emulation.
> >
> >Signed-off-by: Peter Xu <peterx@redhat.com>
> >---
> >  hw/i386/intel_iommu.c         | 68 +++++++++++++++++++++++++++++++++++++------
> >  include/hw/i386/intel_iommu.h |  8 +++++
> >  2 files changed, 67 insertions(+), 9 deletions(-)
> >
> >diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >index 2596f11..104200b 100644
> >--- a/hw/i386/intel_iommu.c
> >+++ b/hw/i386/intel_iommu.c
> >@@ -839,7 +839,8 @@ next:
> >   * @private: private data for the hook function
> >   */
> >  static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> >-                         vtd_page_walk_hook hook_fn, void *private)
> >+                         vtd_page_walk_hook hook_fn, void *private,
> >+                         bool notify_unmap)
> >  {
> >      dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> >      uint32_t level = vtd_get_level_from_context_entry(ce);
> >@@ -858,7 +859,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> >      trace_vtd_page_walk(ce->hi, ce->lo, start, end);
> >      return vtd_page_walk_level(addr, start, end, hook_fn, private,
> >-                               level, true, true, NULL, false);
> >+                               level, true, true, NULL, notify_unmap);
> >  }
> >  /* Map a device to its corresponding domain (context-entry) */
> >@@ -1212,6 +1213,34 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
> >                                  &domain_id);
> >  }
> >+static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
> >+                                           void *private)
> >+{
> >+    memory_region_notify_iommu((MemoryRegion *)private, *entry);
> >+    return 0;
> >+}
> >+
> >+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >+                                           uint16_t domain_id, hwaddr addr,
> >+                                           uint8_t am)
> >+{
> >+    IntelIOMMUNotifierNode *node;
> >+    VTDContextEntry ce;
> >+    int ret;
> >+
> >+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >+        VTDAddressSpace *vtd_as = node->vtd_as;
> >+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >+                                       vtd_as->devfn, &ce);
> >+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >+                          vtd_page_invalidate_notify_hook,
> >+                          (void *)&vtd_as->iommu, true);
> >+        }
> >+    }
> >+}
> >+
> >+
> >  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >                                        hwaddr addr, uint8_t am)
> >  {
> >@@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >      info.addr = addr;
> >      info.mask = ~((1 << am) - 1);
> >      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> >+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> 
> Is the case of GLOBAL or DSI flush missed, or we don't care it at all?

IMHO we don't. For device assignment, since we are having CM=1 here,
we should have explicit page invalidations even if guest sends
global/domain invalidations.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-16  6:30   ` Jason Wang
  2017-01-16  9:18     ` Peter Xu
@ 2017-01-16  9:20     ` Peter Xu
  1 sibling, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-16  9:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 16, 2017 at 02:30:20PM +0800, Jason Wang wrote:

[...]

> >  }
> >  /* Flush IOTLB
> >@@ -2244,15 +2274,34 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu,
> >                                            IOMMUNotifierFlag new)
> >  {
> >      VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
> >+    IntelIOMMUState *s = vtd_as->iommu_state;
> >+    IntelIOMMUNotifierNode *node = NULL;
> >+    IntelIOMMUNotifierNode *next_node = NULL;
> >-    if (new & IOMMU_NOTIFIER_MAP) {
> >-        error_report("Device at bus %s addr %02x.%d requires iommu "
> >-                     "notifier which is currently not supported by "
> >-                     "intel-iommu emulation",
> >-                     vtd_as->bus->qbus.name, PCI_SLOT(vtd_as->devfn),
> >-                     PCI_FUNC(vtd_as->devfn));
> >+    if (!s->cache_mode_enabled && new & IOMMU_NOTIFIER_MAP) {
> >+        error_report("We need to set cache_mode=1 for intel-iommu to enable "
> >+                     "device assignment with IOMMU protection.");
> >          exit(1);
> >      }
> >+
> >+    /* Add new ndoe if no mapping was exising before this call */
> 
> "node"?

Sorry I missed this one - let me just remove above comment since it
just describes what the codes has done below.

Thanks,

> 
> >+    if (old == IOMMU_NOTIFIER_NONE) {
> >+        node = g_malloc0(sizeof(*node));
> >+        node->vtd_as = vtd_as;
> >+        QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
> >+        return;
> >+    }

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-16  9:18     ` Peter Xu
@ 2017-01-16  9:54       ` Jason Wang
  2017-01-17 14:45         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-16  9:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月16日 17:18, Peter Xu wrote:
>>>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>                                         hwaddr addr, uint8_t am)
>>>   {
>>> @@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>       info.addr = addr;
>>>       info.mask = ~((1 << am) - 1);
>>>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
>>> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
>> Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> IMHO we don't. For device assignment, since we are having CM=1 here,
> we should have explicit page invalidations even if guest sends
> global/domain invalidations.
>
> Thanks,
>
> -- peterx

Is this spec required? Btw, it looks to me that both DSI and GLOBAL are 
indeed explicit flush.

Just have a quick go through on driver codes and find this something 
interesting in intel_iommu_flush_iotlb_psi():

...
     /*
      * Fallback to domain selective flush if no PSI support or the size is
      * too big.
      * PSI requires page size to be 2 ^ x, and the base address is 
naturally
      * aligned to the size
      */
     if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
         iommu->flush.flush_iotlb(iommu, did, 0, 0,
                         DMA_TLB_DSI_FLUSH);
     else
         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
                         DMA_TLB_PSI_FLUSH);
...

It looks like DSI_FLUSH is possible even for CM on.

And in flush_unmaps():

...
         /* In caching mode, global flushes turn emulation expensive */
         if (!cap_caching_mode(iommu->cap))
             iommu->flush.flush_iotlb(iommu, 0, 0, 0,
                      DMA_TLB_GLOBAL_FLUSH);
...

If I understand the comments correctly, GLOBAL is ok for CM too (though 
the code did not do it for performance reason).

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16  8:32             ` Peter Xu
@ 2017-01-16 16:25               ` Michael S. Tsirkin
  2017-01-17 14:53                 ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Michael S. Tsirkin @ 2017-01-16 16:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, tianyu.lan, kevin.tian, jan.kiszka, bd.aviv,
	qemu-devel, alex.williamson

On Mon, Jan 16, 2017 at 04:32:24PM +0800, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 04:25:35PM +0800, Jason Wang wrote:
> > 
> > 
> > On 2017年01月16日 16:12, Peter Xu wrote:
> > >On Mon, Jan 16, 2017 at 04:01:00PM +0800, Jason Wang wrote:
> > >>
> > >>On 2017年01月16日 15:50, Peter Xu wrote:
> > >>>On Mon, Jan 16, 2017 at 02:20:31PM +0800, Jason Wang wrote:
> > >>>
> > >>>[...]
> > >>>
> > >>>>>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > >>>>>index fd75112..2596f11 100644
> > >>>>>--- a/hw/i386/intel_iommu.c
> > >>>>>+++ b/hw/i386/intel_iommu.c
> > >>>>>@@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
> > >>>>>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
> > >>>>>  }
> > >>>>>+static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
> > >>>>Looks like you can check s->dmar_enabled here?
> > >>>Yes, we need to check old state in case we don't need a switch at all.
> > >>>Actually I checked it...
> > >>>
> > >>I mean is there a chance that iommu_enabled( better name should be
> > >>dmar_enabled) is not equal to s->dmar_enabled? Looks not.
> > >>
> > >>vtd_handle_gcmd_te() did:
> > >>
> > >>     ...
> > >>     if (en) {
> > >>         s->dmar_enabled = true;
> > >>         /* Ok - report back to driver */
> > >>         vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_TES);
> > >>     } else {
> > >>         s->dmar_enabled = false;
> > >>     ...
> > >>
> > >>You can vtd_switch_address_space_all(s, en) after this which will call this
> > >>function. And another caller like you've pointed out has already call this
> > >>through s->dmar_enabled. So en here is always s->dmar_enalbed?
> > >Hmm, yes...
> > >
> > >(I would still prefer keeping this parameter for readablility.
> > >  Though, I prefer your suggestion to rename it to dmar_enabled)
> > >
> > >-- peterx
> > 
> > I think this does not give more readability :) May I was wrong, let leave
> > this for maintainer.
> > 
> > Thanks :)
> 
> Thanks for reviewing this series so fast!
> 
> I have no strong opinion as well. Maybe you are right. :-)
> 
> Michael, please let me know if you dislike this, so I can remove this
> parameter (it equals to as->iommu_state->dmar_enabled).
> 
> Thanks,
> 
> -- peterx

I prefer not to duplicate data, yes.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
  2017-01-16  6:20   ` Jason Wang
@ 2017-01-16 19:53   ` Alex Williamson
  2017-01-17 14:00     ` Peter Xu
  1 sibling, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2017-01-16 19:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Fri, 13 Jan 2017 11:06:39 +0800
Peter Xu <peterx@redhat.com> wrote:

> This is preparation work to finally enabled dynamic switching ON/OFF for
> VT-d protection. The old VT-d codes is using static IOMMU address space,
> and that won't satisfy vfio-pci device listeners.
> 
> Let me explain.
> 
> vfio-pci devices depend on the memory region listener and IOMMU replay
> mechanism to make sure the device mapping is coherent with the guest
> even if there are domain switches. And there are two kinds of domain
> switches:
> 
>   (1) switch from domain A -> B
>   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> 
> Case (1) is handled by the context entry invalidation handling by the
> VT-d replay logic. What the replay function should do here is to replay
> the existing page mappings in domain B.

There's really 2 steps here, right?  Invalidate A, replay B.  I think
the code handles this, but I want to make sure.  We don't want to end
up with a superset of both A & B.

On the invalidation, a future optimization when disabling an entire
memory region might also be to invalidate the entire range at once
rather than each individual mapping within the range, which I think is
what happens now, right?

> However for case (2), we don't want to replay any domain mappings - we
> just need the default GPA->HPA mappings (the address_space_memory
> mapping). And this patch helps on case (2) to build up the mapping
> automatically by leveraging the vfio-pci memory listeners.

Have you thought about using this address space switching to emulate
ecap.PT?  ie. advertise hardware based passthrough so that the guest
doesn't need to waste pagetable entries for a direct mapped, static
identity domain.

Otherwise the series looks pretty good to me.  Thanks,

Alex

> Another important thing that this patch does is to seperate
> IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not
> depend on the DMAR region (like before this patch). It should be a
> standalone region, and it should be able to be activated without
> DMAR (which is a common behavior of Linux kernel - by default it enables
> IR while disabled DMAR).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v3:
> - fix another trivial style issue patchew reported but I missed in v2
> 
> v2:
> - fix issues reported by patchew
> - switch domain by enable/disable memory regions [David]
> - provide vtd_switch_address_space{_all}()
> - provide a better comment on the memory regions
> 
> test done: with intel_iommu device, boot vm with/without
> "intel_iommu=on" parameter.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c         | 78 ++++++++++++++++++++++++++++++++++++++++---
>  hw/i386/trace-events          |  2 +-
>  include/hw/i386/intel_iommu.h |  2 ++
>  3 files changed, 77 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fd75112..2596f11 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1343,9 +1343,49 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
>  }
>  
> +static void vtd_switch_address_space(VTDAddressSpace *as, bool iommu_enabled)
> +{
> +    assert(as);
> +
> +    trace_vtd_switch_address_space(pci_bus_num(as->bus),
> +                                   VTD_PCI_SLOT(as->devfn),
> +                                   VTD_PCI_FUNC(as->devfn),
> +                                   iommu_enabled);
> +
> +    /* Turn off first then on the other */
> +    if (iommu_enabled) {
> +        memory_region_set_enabled(&as->sys_alias, false);
> +        memory_region_set_enabled(&as->iommu, true);
> +    } else {
> +        memory_region_set_enabled(&as->iommu, false);
> +        memory_region_set_enabled(&as->sys_alias, true);
> +    }
> +}
> +
> +static void vtd_switch_address_space_all(IntelIOMMUState *s, bool enabled)
> +{
> +    GHashTableIter iter;
> +    VTDBus *vtd_bus;
> +    int i;
> +
> +    g_hash_table_iter_init(&iter, s->vtd_as_by_busptr);
> +    while (g_hash_table_iter_next(&iter, NULL, (void **)&vtd_bus)) {
> +        for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) {
> +            if (!vtd_bus->dev_as[i]) {
> +                continue;
> +            }
> +            vtd_switch_address_space(vtd_bus->dev_as[i], enabled);
> +        }
> +    }
> +}
> +
>  /* Handle Translation Enable/Disable */
>  static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>  {
> +    if (s->dmar_enabled == en) {
> +        return;
> +    }
> +
>      VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off"));
>  
>      if (en) {
> @@ -1360,6 +1400,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>          /* Ok - report back to driver */
>          vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0);
>      }
> +
> +    vtd_switch_address_space_all(s, en);
>  }
>  
>  /* Handle Interrupt Remap Enable/Disable */
> @@ -2586,15 +2628,43 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>          vtd_dev_as->devfn = (uint8_t)devfn;
>          vtd_dev_as->iommu_state = s;
>          vtd_dev_as->context_cache_entry.context_cache_gen = 0;
> +
> +        /*
> +         * Memory region relationships looks like (Address range shows
> +         * only lower 32 bits to make it short in length...):
> +         *
> +         * |-----------------+-------------------+----------|
> +         * | Name            | Address range     | Priority |
> +         * |-----------------+-------------------+----------+
> +         * | vtd_root        | 00000000-ffffffff |        0 |
> +         * |  intel_iommu    | 00000000-ffffffff |        1 |
> +         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
> +         * |  intel_iommu_ir | fee00000-feefffff |       64 |
> +         * |-----------------+-------------------+----------|
> +         *
> +         * We enable/disable DMAR by switching enablement for
> +         * vtd_sys_alias and intel_iommu regions. IR region is always
> +         * enabled.
> +         */
>          memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s),
>                                   &s->iommu_ops, "intel_iommu", UINT64_MAX);
> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
> +                                 "vtd_sys_alias", get_system_memory(),
> +                                 0, memory_region_size(get_system_memory()));
>          memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
>                                &vtd_mem_ir_ops, s, "intel_iommu_ir",
>                                VTD_INTERRUPT_ADDR_SIZE);
> -        memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST,
> -                                    &vtd_dev_as->iommu_ir);
> -        address_space_init(&vtd_dev_as->as,
> -                           &vtd_dev_as->iommu, name);
> +        memory_region_init(&vtd_dev_as->root, OBJECT(s),
> +                           "vtd_root", UINT64_MAX);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root,
> +                                            VTD_INTERRUPT_ADDR_FIRST,
> +                                            &vtd_dev_as->iommu_ir, 64);
> +        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->sys_alias, 1);
> +        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
> +                                            &vtd_dev_as->iommu, 1);
> +        vtd_switch_address_space(vtd_dev_as, s->dmar_enabled);
>      }
>      return vtd_dev_as;
>  }
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 92d210d..beaef61 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -11,7 +11,6 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
>  x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
>  
>  # hw/i386/intel_iommu.c
> -vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>  vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
>  vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
>  vtd_inv_desc_cc_global(void) "context invalidate globally"
> @@ -37,6 +36,7 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
>  vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
>  vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
>  vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
> +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
>  
>  # hw/i386/amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 749eef9..9c3f6c0 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -83,6 +83,8 @@ struct VTDAddressSpace {
>      uint8_t devfn;
>      AddressSpace as;
>      MemoryRegion iommu;
> +    MemoryRegion root;
> +    MemoryRegion sys_alias;
>      MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16 19:53   ` Alex Williamson
@ 2017-01-17 14:00     ` Peter Xu
  2017-01-17 15:46       ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-17 14:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Mon, Jan 16, 2017 at 12:53:57PM -0700, Alex Williamson wrote:
> On Fri, 13 Jan 2017 11:06:39 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > This is preparation work to finally enabled dynamic switching ON/OFF for
> > VT-d protection. The old VT-d codes is using static IOMMU address space,
> > and that won't satisfy vfio-pci device listeners.
> > 
> > Let me explain.
> > 
> > vfio-pci devices depend on the memory region listener and IOMMU replay
> > mechanism to make sure the device mapping is coherent with the guest
> > even if there are domain switches. And there are two kinds of domain
> > switches:
> > 
> >   (1) switch from domain A -> B
> >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > 
> > Case (1) is handled by the context entry invalidation handling by the
> > VT-d replay logic. What the replay function should do here is to replay
> > the existing page mappings in domain B.
> 
> There's really 2 steps here, right?  Invalidate A, replay B.  I think
> the code handles this, but I want to make sure.  We don't want to end
> up with a superset of both A & B.

First of all, this discussion should be beyond this patch's scope,
since this patch is currently only handling the case when guest
disables DMAR in general.

Then, my understanding for above question: when we do A -> B domain
switch, guest will not send specific context entry invalidations for
A, but will for sure send one when context entry is ready for B. In
that sense, IMO we don't have a clear "two steps", only one, which is
the latter "replay B". We do correct unmap based on the PSIs
(page-selective invalidations) of A when guest unmaps the pages in A.

So, for the use case of nested device assignment (which is the goal of
this series for now):

- L1 guest put device D1,D2,... of L2 guest into domain A
- L1 guest map the L2 memory into L1 address space (L2GPA -> L1GPA)
- ... (L2 guest runs, until it stops running)
- L1 guest unmap all the pages in domain A
- L1 guest move device D1,D2,... of L2 guest outside domain A

This series should work for above, since before any device leaves its
domain, the domain will be clean and without unmapped pages.

However, if we have the following scenario (which I don't know whether
this's achievable):

- guest iommu domain A has device D1, D2
- guest iommu domain B has device D3
- move device D2 from domain A into B

Here when D2 move from A to B, IIUC our current Linux IOMMU driver
code will not send any PSI (page-selected invalidations) for D2 or
domain A because domain A still has device in it, guest should only
send a context entry invalidation for device D2, telling that D2 has
switched to domain B. In that case, I am not sure whether current
series can work properly, and IMHO we may need to have the domain
knowledge in VT-d emulation code (while we don't have it yet) in the
future to further support this kind of domain switches.

> 
> On the invalidation, a future optimization when disabling an entire
> memory region might also be to invalidate the entire range at once
> rather than each individual mapping within the range, which I think is
> what happens now, right?

Right. IIUC this can be an enhancement to current page walk logic - we
can coalesce continuous IOTLB with same property and notify only once
for these coalesced entries.

Noted in my todo list.

> 
> > However for case (2), we don't want to replay any domain mappings - we
> > just need the default GPA->HPA mappings (the address_space_memory
> > mapping). And this patch helps on case (2) to build up the mapping
> > automatically by leveraging the vfio-pci memory listeners.
> 
> Have you thought about using this address space switching to emulate
> ecap.PT?  ie. advertise hardware based passthrough so that the guest
> doesn't need to waste pagetable entries for a direct mapped, static
> identity domain.

Kind of. Currently we still don't have iommu=pt for the emulated code.
We can achieve that by leveraging this patch.

> 
> Otherwise the series looks pretty good to me.  Thanks,

Your review comment is really important to me. Thanks!

I'll see whether we can get to a consensus on above issue, then repost
with existing fixes.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-16  9:54       ` Jason Wang
@ 2017-01-17 14:45         ` Peter Xu
  2017-01-18  3:10           ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-17 14:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月16日 17:18, Peter Xu wrote:
> >>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >>>                                        hwaddr addr, uint8_t am)
> >>>  {
> >>>@@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >>>      info.addr = addr;
> >>>      info.mask = ~((1 << am) - 1);
> >>>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> >>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> >>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> >IMHO we don't. For device assignment, since we are having CM=1 here,
> >we should have explicit page invalidations even if guest sends
> >global/domain invalidations.
> >
> >Thanks,
> >
> >-- peterx
> 
> Is this spec required?

I think not. IMO the spec is very coarse grained on describing cache
mode...

> Btw, it looks to me that both DSI and GLOBAL are
> indeed explicit flush.

Actually when cache mode is on, it is unclear to me on how we should
treat domain/global invalidations, at least from the spec (as
mentioned earlier). My understanding is that they are not "explicit
flushes", which IMHO should only mean page selective IOTLB
invalidations.

> 
> Just have a quick go through on driver codes and find this something
> interesting in intel_iommu_flush_iotlb_psi():
> 
> ...
>     /*
>      * Fallback to domain selective flush if no PSI support or the size is
>      * too big.
>      * PSI requires page size to be 2 ^ x, and the base address is naturally
>      * aligned to the size
>      */
>     if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
>         iommu->flush.flush_iotlb(iommu, did, 0, 0,
>                         DMA_TLB_DSI_FLUSH);
>     else
>         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
>                         DMA_TLB_PSI_FLUSH);
> ...

I think this is interesting... and I doubt its correctness while with
cache mode enabled.

If so (sending domain invalidation instead of a big range of page
invalidations), how should we capture which pages are unmapped in
emulated IOMMU?

> 
> It looks like DSI_FLUSH is possible even for CM on.
> 
> And in flush_unmaps():
> 
> ...
>         /* In caching mode, global flushes turn emulation expensive */
>         if (!cap_caching_mode(iommu->cap))
>             iommu->flush.flush_iotlb(iommu, 0, 0, 0,
>                      DMA_TLB_GLOBAL_FLUSH);
> ...
> 
> If I understand the comments correctly, GLOBAL is ok for CM too (though the
> code did not do it for performance reason).

I think it should be okay to send global flush for CM, but not sure
whether we should notify anything when we receive it. Hmm, anyway, I
think I need some more reading to make sure I understand the whole
thing correctly. :)

For example, when I see this commit:

commit 78d5f0f500e6ba8f6cfd0673475ff4d941d705a2
Author: Nadav Amit <nadav.amit@gmail.com>
Date:   Thu Apr 8 23:00:41 2010 +0300

    intel-iommu: Avoid global flushes with caching mode.
    
    While it may be efficient on real hardware, emulation of global
    invalidations is very expensive as all shadow entries must be examined.
    This patch changes the behaviour when caching mode is enabled (which is
    the case when IOMMU emulation takes place). In this case, page specific
    invalidation is used instead.

Before I ask someone else besides qemu-devel, I am curious about
whether there is existing VT-d emulation code (outside QEMU, of
course) so that I can have a reference? Quick search didn't answer me.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-16 16:25               ` Michael S. Tsirkin
@ 2017-01-17 14:53                 ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-17 14:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, tianyu.lan, kevin.tian, jan.kiszka, bd.aviv,
	qemu-devel, alex.williamson

On Mon, Jan 16, 2017 at 06:25:32PM +0200, Michael S. Tsirkin wrote:

[...]

> > > I think this does not give more readability :) May I was wrong, let leave
> > > this for maintainer.
> > > 
> > > Thanks :)
> > 
> > Thanks for reviewing this series so fast!
> > 
> > I have no strong opinion as well. Maybe you are right. :-)
> > 
> > Michael, please let me know if you dislike this, so I can remove this
> > parameter (it equals to as->iommu_state->dmar_enabled).
> > 
> > Thanks,
> > 
> > -- peterx
> 
> I prefer not to duplicate data, yes.

Let me remove it then. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances
  2017-01-14  2:59   ` Peter Xu
@ 2017-01-17 15:07     ` Michael S. Tsirkin
  2017-01-18  7:34       ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Michael S. Tsirkin @ 2017-01-17 15:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Sat, Jan 14, 2017 at 10:59:58AM +0800, Peter Xu wrote:
> On Fri, Jan 13, 2017 at 05:58:02PM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 13, 2017 at 11:06:26AM +0800, Peter Xu wrote:
> > > v3:
> > > - fix style error reported by patchew
> > > - fix comment in domain switch patch: use "IOMMU address space" rather
> > >   than "IOMMU region" [Kevin]
> > > - add ack-by for Paolo in patch:
> > >   "memory: add section range info for IOMMU notifier"
> > >   (this is seperately collected besides this thread)
> > > - remove 3 patches which are merged already (from Jason)
> > > - rebase to master b6c0897
> > 
> > So 1-6 look like nice cleanups to me. Should I merge them now?
> 
> That'll be great if you'd like to merge them. Then I can further
> shorten this series for the next post.
> 
> Regarding to the error_report() issue that Jason has mentioned, I can
> touch them up in the future when needed - after all, most of the patch
> content are about converting DPRINT()s into traces.
> 
> Thanks!
> 
> -- peterx

I think I agree with Jason, it's best not to have guest behaviour
trigger error_report. So pls address and I'll merge.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-17 14:00     ` Peter Xu
@ 2017-01-17 15:46       ` Alex Williamson
  2017-01-18  7:49         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2017-01-17 15:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Tue, 17 Jan 2017 22:00:00 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jan 16, 2017 at 12:53:57PM -0700, Alex Williamson wrote:
> > On Fri, 13 Jan 2017 11:06:39 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > This is preparation work to finally enabled dynamic switching ON/OFF for
> > > VT-d protection. The old VT-d codes is using static IOMMU address space,
> > > and that won't satisfy vfio-pci device listeners.
> > > 
> > > Let me explain.
> > > 
> > > vfio-pci devices depend on the memory region listener and IOMMU replay
> > > mechanism to make sure the device mapping is coherent with the guest
> > > even if there are domain switches. And there are two kinds of domain
> > > switches:
> > > 
> > >   (1) switch from domain A -> B
> > >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > > 
> > > Case (1) is handled by the context entry invalidation handling by the
> > > VT-d replay logic. What the replay function should do here is to replay
> > > the existing page mappings in domain B.  
> > 
> > There's really 2 steps here, right?  Invalidate A, replay B.  I think
> > the code handles this, but I want to make sure.  We don't want to end
> > up with a superset of both A & B.  
> 
> First of all, this discussion should be beyond this patch's scope,
> since this patch is currently only handling the case when guest
> disables DMAR in general.
> 
> Then, my understanding for above question: when we do A -> B domain
> switch, guest will not send specific context entry invalidations for
> A, but will for sure send one when context entry is ready for B. In
> that sense, IMO we don't have a clear "two steps", only one, which is
> the latter "replay B". We do correct unmap based on the PSIs
> (page-selective invalidations) of A when guest unmaps the pages in A.
> 
> So, for the use case of nested device assignment (which is the goal of
> this series for now):
> 
> - L1 guest put device D1,D2,... of L2 guest into domain A
> - L1 guest map the L2 memory into L1 address space (L2GPA -> L1GPA)
> - ... (L2 guest runs, until it stops running)
> - L1 guest unmap all the pages in domain A
> - L1 guest move device D1,D2,... of L2 guest outside domain A
> 
> This series should work for above, since before any device leaves its
> domain, the domain will be clean and without unmapped pages.
> 
> However, if we have the following scenario (which I don't know whether
> this's achievable):
> 
> - guest iommu domain A has device D1, D2
> - guest iommu domain B has device D3
> - move device D2 from domain A into B
> 
> Here when D2 move from A to B, IIUC our current Linux IOMMU driver
> code will not send any PSI (page-selected invalidations) for D2 or
> domain A because domain A still has device in it, guest should only
> send a context entry invalidation for device D2, telling that D2 has
> switched to domain B. In that case, I am not sure whether current
> series can work properly, and IMHO we may need to have the domain
> knowledge in VT-d emulation code (while we don't have it yet) in the
> future to further support this kind of domain switches.

This is a serious issue that needs to be resolved.  The context entry
invalidation when D2 is switched from A->B must unmap anything from
domain A before the replay of domain B.  Your example is easily
achieved, for instance what if domain A is the SI (static identity)
domain for the L1 guest, domain B is the device assignment domain for
the L2 guest with current device D3.  The user hot adds device D2 into
the L2 guest moving it from the L1 SI domain to the device assignment
domain.  vfio will not override existing mappings on replay, it will
error, giving the L2 guest a device with access to the static identity
mappings of the L1 host.  This isn't acceptable.
 
> > On the invalidation, a future optimization when disabling an entire
> > memory region might also be to invalidate the entire range at once
> > rather than each individual mapping within the range, which I think is
> > what happens now, right?  
> 
> Right. IIUC this can be an enhancement to current page walk logic - we
> can coalesce continuous IOTLB with same property and notify only once
> for these coalesced entries.
> 
> Noted in my todo list.

A context entry invalidation as in the example above might make use of
this to skip any sort of page walk logic, simply invalidate the entire
address space.

> >   
> > > However for case (2), we don't want to replay any domain mappings - we
> > > just need the default GPA->HPA mappings (the address_space_memory
> > > mapping). And this patch helps on case (2) to build up the mapping
> > > automatically by leveraging the vfio-pci memory listeners.  
> > 
> > Have you thought about using this address space switching to emulate
> > ecap.PT?  ie. advertise hardware based passthrough so that the guest
> > doesn't need to waste pagetable entries for a direct mapped, static
> > identity domain.  
> 
> Kind of. Currently we still don't have iommu=pt for the emulated code.
> We can achieve that by leveraging this patch.

Well, we have iommu=pt, but the L1 guest will implement this as a fully
populated SI domain rather than as a bit in the context entry to do
hardware direct translation.  Given the mapping overhead through vfio,
the L1 guest will always want to use iommu=pt as dynamic mapping
performance is going to be horrid.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-17 14:45         ` Peter Xu
@ 2017-01-18  3:10           ` Jason Wang
  2017-01-18  8:11             ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-18  3:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月17日 22:45, Peter Xu wrote:
> On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
>>
>> On 2017年01月16日 17:18, Peter Xu wrote:
>>>>>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>>>                                         hwaddr addr, uint8_t am)
>>>>>   {
>>>>> @@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>>>       info.addr = addr;
>>>>>       info.mask = ~((1 << am) - 1);
>>>>>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
>>>>> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
>>>> Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
>>> IMHO we don't. For device assignment, since we are having CM=1 here,
>>> we should have explicit page invalidations even if guest sends
>>> global/domain invalidations.
>>>
>>> Thanks,
>>>
>>> -- peterx
>> Is this spec required?
> I think not. IMO the spec is very coarse grained on describing cache
> mode...
>
>> Btw, it looks to me that both DSI and GLOBAL are
>> indeed explicit flush.
> Actually when cache mode is on, it is unclear to me on how we should
> treat domain/global invalidations, at least from the spec (as
> mentioned earlier). My understanding is that they are not "explicit
> flushes", which IMHO should only mean page selective IOTLB
> invalidations.

Probably not, at least from the view of performance. DSI and global 
should be more efficient in some cases.

>
>> Just have a quick go through on driver codes and find this something
>> interesting in intel_iommu_flush_iotlb_psi():
>>
>> ...
>>      /*
>>       * Fallback to domain selective flush if no PSI support or the size is
>>       * too big.
>>       * PSI requires page size to be 2 ^ x, and the base address is naturally
>>       * aligned to the size
>>       */
>>      if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
>>          iommu->flush.flush_iotlb(iommu, did, 0, 0,
>>                          DMA_TLB_DSI_FLUSH);
>>      else
>>          iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
>>                          DMA_TLB_PSI_FLUSH);
>> ...
> I think this is interesting... and I doubt its correctness while with
> cache mode enabled.
>
> If so (sending domain invalidation instead of a big range of page
> invalidations), how should we capture which pages are unmapped in
> emulated IOMMU?

We don't need to track individual pages here, since all pages for a 
specific domain were unmapped I believe?

>
>> It looks like DSI_FLUSH is possible even for CM on.
>>
>> And in flush_unmaps():
>>
>> ...
>>          /* In caching mode, global flushes turn emulation expensive */
>>          if (!cap_caching_mode(iommu->cap))
>>              iommu->flush.flush_iotlb(iommu, 0, 0, 0,
>>                       DMA_TLB_GLOBAL_FLUSH);
>> ...
>>
>> If I understand the comments correctly, GLOBAL is ok for CM too (though the
>> code did not do it for performance reason).
> I think it should be okay to send global flush for CM, but not sure
> whether we should notify anything when we receive it. Hmm, anyway, I
> think I need some more reading to make sure I understand the whole
> thing correctly. :)
>
> For example, when I see this commit:
>
> commit 78d5f0f500e6ba8f6cfd0673475ff4d941d705a2
> Author: Nadav Amit <nadav.amit@gmail.com>
> Date:   Thu Apr 8 23:00:41 2010 +0300
>
>      intel-iommu: Avoid global flushes with caching mode.
>      
>      While it may be efficient on real hardware, emulation of global
>      invalidations is very expensive as all shadow entries must be examined.
>      This patch changes the behaviour when caching mode is enabled (which is
>      the case when IOMMU emulation takes place). In this case, page specific
>      invalidation is used instead.
>
> Before I ask someone else besides qemu-devel, I am curious about
> whether there is existing VT-d emulation code (outside QEMU, of
> course) so that I can have a reference?

Yes, it has. The author of this patch - Nadav has done lots of research 
on emulated IOMMU. See following papers:

https://hal.inria.fr/inria-00493752/document
http://www.cse.iitd.ac.in/~sbansal/csl862-virt/readings/vIOMMU.pdf

Thanks

> Quick search didn't answer me.
>
> Thanks,
>
> -- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances
  2017-01-17 15:07     ` Michael S. Tsirkin
@ 2017-01-18  7:34       ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-18  7:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, tianyu.lan, kevin.tian, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Tue, Jan 17, 2017 at 05:07:27PM +0200, Michael S. Tsirkin wrote:
> On Sat, Jan 14, 2017 at 10:59:58AM +0800, Peter Xu wrote:
> > On Fri, Jan 13, 2017 at 05:58:02PM +0200, Michael S. Tsirkin wrote:
> > > On Fri, Jan 13, 2017 at 11:06:26AM +0800, Peter Xu wrote:
> > > > v3:
> > > > - fix style error reported by patchew
> > > > - fix comment in domain switch patch: use "IOMMU address space" rather
> > > >   than "IOMMU region" [Kevin]
> > > > - add ack-by for Paolo in patch:
> > > >   "memory: add section range info for IOMMU notifier"
> > > >   (this is seperately collected besides this thread)
> > > > - remove 3 patches which are merged already (from Jason)
> > > > - rebase to master b6c0897
> > > 
> > > So 1-6 look like nice cleanups to me. Should I merge them now?
> > 
> > That'll be great if you'd like to merge them. Then I can further
> > shorten this series for the next post.
> > 
> > Regarding to the error_report() issue that Jason has mentioned, I can
> > touch them up in the future when needed - after all, most of the patch
> > content are about converting DPRINT()s into traces.
> > 
> > Thanks!
> > 
> > -- peterx
> 
> I think I agree with Jason, it's best not to have guest behaviour
> trigger error_report. So pls address and I'll merge.

Will fix. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-17 15:46       ` Alex Williamson
@ 2017-01-18  7:49         ` Peter Xu
  2017-01-19  8:20           ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-18  7:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Tue, Jan 17, 2017 at 08:46:04AM -0700, Alex Williamson wrote:
> On Tue, 17 Jan 2017 22:00:00 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jan 16, 2017 at 12:53:57PM -0700, Alex Williamson wrote:
> > > On Fri, 13 Jan 2017 11:06:39 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >   
> > > > This is preparation work to finally enabled dynamic switching ON/OFF for
> > > > VT-d protection. The old VT-d codes is using static IOMMU address space,
> > > > and that won't satisfy vfio-pci device listeners.
> > > > 
> > > > Let me explain.
> > > > 
> > > > vfio-pci devices depend on the memory region listener and IOMMU replay
> > > > mechanism to make sure the device mapping is coherent with the guest
> > > > even if there are domain switches. And there are two kinds of domain
> > > > switches:
> > > > 
> > > >   (1) switch from domain A -> B
> > > >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > > > 
> > > > Case (1) is handled by the context entry invalidation handling by the
> > > > VT-d replay logic. What the replay function should do here is to replay
> > > > the existing page mappings in domain B.  
> > > 
> > > There's really 2 steps here, right?  Invalidate A, replay B.  I think
> > > the code handles this, but I want to make sure.  We don't want to end
> > > up with a superset of both A & B.  
> > 
> > First of all, this discussion should be beyond this patch's scope,
> > since this patch is currently only handling the case when guest
> > disables DMAR in general.
> > 
> > Then, my understanding for above question: when we do A -> B domain
> > switch, guest will not send specific context entry invalidations for
> > A, but will for sure send one when context entry is ready for B. In
> > that sense, IMO we don't have a clear "two steps", only one, which is
> > the latter "replay B". We do correct unmap based on the PSIs
> > (page-selective invalidations) of A when guest unmaps the pages in A.
> > 
> > So, for the use case of nested device assignment (which is the goal of
> > this series for now):
> > 
> > - L1 guest put device D1,D2,... of L2 guest into domain A
> > - L1 guest map the L2 memory into L1 address space (L2GPA -> L1GPA)
> > - ... (L2 guest runs, until it stops running)
> > - L1 guest unmap all the pages in domain A
> > - L1 guest move device D1,D2,... of L2 guest outside domain A
> > 
> > This series should work for above, since before any device leaves its
> > domain, the domain will be clean and without unmapped pages.
> > 
> > However, if we have the following scenario (which I don't know whether
> > this's achievable):
> > 
> > - guest iommu domain A has device D1, D2
> > - guest iommu domain B has device D3
> > - move device D2 from domain A into B
> > 
> > Here when D2 move from A to B, IIUC our current Linux IOMMU driver
> > code will not send any PSI (page-selected invalidations) for D2 or
> > domain A because domain A still has device in it, guest should only
> > send a context entry invalidation for device D2, telling that D2 has
> > switched to domain B. In that case, I am not sure whether current
> > series can work properly, and IMHO we may need to have the domain
> > knowledge in VT-d emulation code (while we don't have it yet) in the
> > future to further support this kind of domain switches.
> 
> This is a serious issue that needs to be resolved.  The context entry
> invalidation when D2 is switched from A->B must unmap anything from
> domain A before the replay of domain B.  Your example is easily
> achieved, for instance what if domain A is the SI (static identity)
> domain for the L1 guest, domain B is the device assignment domain for
> the L2 guest with current device D3.  The user hot adds device D2 into
> the L2 guest moving it from the L1 SI domain to the device assignment
> domain.  vfio will not override existing mappings on replay, it will
> error, giving the L2 guest a device with access to the static identity
> mappings of the L1 host.  This isn't acceptable.
>  
> > > On the invalidation, a future optimization when disabling an entire
> > > memory region might also be to invalidate the entire range at once
> > > rather than each individual mapping within the range, which I think is
> > > what happens now, right?  
> > 
> > Right. IIUC this can be an enhancement to current page walk logic - we
> > can coalesce continuous IOTLB with same property and notify only once
> > for these coalesced entries.
> > 
> > Noted in my todo list.
> 
> A context entry invalidation as in the example above might make use of
> this to skip any sort of page walk logic, simply invalidate the entire
> address space.

Alex, I got one more thing to ask:

I was trying to invalidate the entire address space by sending a big
IOTLB notification to vfio-pci, which looks like:

  IOMMUTLBEntry entry = {
      .target_as = &address_space_memory,
      .iova = 0,
      .translated_addr = 0,
      .addr_mask = (1 << 63) - 1,
      .perm = IOMMU_NONE,     /* UNMAP */
  };

Then I feed this entry to vfio-pci IOMMU notifier.

However, this was blocked in vfio_iommu_map_notify(), with error:

  qemu-system-x86_64: iommu has granularity incompatible with target AS

Since we have:

  /*
   * The IOMMU TLB entry we have just covers translation through
   * this IOMMU to its immediate target.  We need to translate
   * it the rest of the way through to memory.
   */
  rcu_read_lock();
  mr = address_space_translate(&address_space_memory,
                               iotlb->translated_addr,
                               &xlat, &len, iotlb->perm & IOMMU_WO);
  if (!memory_region_is_ram(mr)) {
      error_report("iommu map to non memory area %"HWADDR_PRIx"",
                   xlat);
      goto out;
  }
  /*
   * Translation truncates length to the IOMMU page size,
   * check that it did not truncate too much.
   */
  if (len & iotlb->addr_mask) {
      error_report("iommu has granularity incompatible with target AS");
      goto out;
  }

In my case len == 0xa0000 (that's the translation result), and
iotlb->addr_mask == (1<<63)-1. So looks like the translation above
splitted the big region and a simple big UNMAP doesn't work for me.

Do you have any suggestion on how I can solve this? In what case will
we need the above address_space_translate()?

> 
> > >   
> > > > However for case (2), we don't want to replay any domain mappings - we
> > > > just need the default GPA->HPA mappings (the address_space_memory
> > > > mapping). And this patch helps on case (2) to build up the mapping
> > > > automatically by leveraging the vfio-pci memory listeners.  
> > > 
> > > Have you thought about using this address space switching to emulate
> > > ecap.PT?  ie. advertise hardware based passthrough so that the guest
> > > doesn't need to waste pagetable entries for a direct mapped, static
> > > identity domain.  
> > 
> > Kind of. Currently we still don't have iommu=pt for the emulated code.
> > We can achieve that by leveraging this patch.
> 
> Well, we have iommu=pt, but the L1 guest will implement this as a fully
> populated SI domain rather than as a bit in the context entry to do
> hardware direct translation.  Given the mapping overhead through vfio,
> the L1 guest will always want to use iommu=pt as dynamic mapping
> performance is going to be horrid.  Thanks,

I see, so we have iommu=pt in guest even VT-d emulation does not
provide that bit. Anyway, supporting ecap.pt is in my todo list.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18  3:10           ` Jason Wang
@ 2017-01-18  8:11             ` Peter Xu
  2017-01-18  8:36               ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-18  8:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月17日 22:45, Peter Xu wrote:
> >On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月16日 17:18, Peter Xu wrote:
> >>>>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >>>>>                                        hwaddr addr, uint8_t am)
> >>>>>  {
> >>>>>@@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >>>>>      info.addr = addr;
> >>>>>      info.mask = ~((1 << am) - 1);
> >>>>>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> >>>>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> >>>>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> >>>IMHO we don't. For device assignment, since we are having CM=1 here,
> >>>we should have explicit page invalidations even if guest sends
> >>>global/domain invalidations.
> >>>
> >>>Thanks,
> >>>
> >>>-- peterx
> >>Is this spec required?
> >I think not. IMO the spec is very coarse grained on describing cache
> >mode...
> >
> >>Btw, it looks to me that both DSI and GLOBAL are
> >>indeed explicit flush.
> >Actually when cache mode is on, it is unclear to me on how we should
> >treat domain/global invalidations, at least from the spec (as
> >mentioned earlier). My understanding is that they are not "explicit
> >flushes", which IMHO should only mean page selective IOTLB
> >invalidations.
> 
> Probably not, at least from the view of performance. DSI and global should
> be more efficient in some cases.

I agree with you that DSI/GLOBAL flushes are more efficient in some
way. But IMHO that does not mean these invalidations are "explicit
invalidations", and I suspect whether cache mode has to coop with it.

But here I should add one more thing besides PSI - context entry
invalidation should be one of "the explicit invalidations" as well,
which we need to handle just like PSI when cache mode is on.

> 
> >
> >>Just have a quick go through on driver codes and find this something
> >>interesting in intel_iommu_flush_iotlb_psi():
> >>
> >>...
> >>     /*
> >>      * Fallback to domain selective flush if no PSI support or the size is
> >>      * too big.
> >>      * PSI requires page size to be 2 ^ x, and the base address is naturally
> >>      * aligned to the size
> >>      */
> >>     if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
> >>         iommu->flush.flush_iotlb(iommu, did, 0, 0,
> >>                         DMA_TLB_DSI_FLUSH);
> >>     else
> >>         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
> >>                         DMA_TLB_PSI_FLUSH);
> >>...
> >I think this is interesting... and I doubt its correctness while with
> >cache mode enabled.
> >
> >If so (sending domain invalidation instead of a big range of page
> >invalidations), how should we capture which pages are unmapped in
> >emulated IOMMU?
> 
> We don't need to track individual pages here, since all pages for a specific
> domain were unmapped I believe?

IMHO this might not be the correct behavior.

If we receive one domain specific invalidation, I agree that we should
invalidate the IOTLB cache for all the devices inside the domain.
However, when cache mode is on, we should be depending on the PSIs to
unmap each page (unless we want to unmap the whole address space, in
this case it's very possible that the guest is just unmapping a range,
not the entire space). If we convert several PSIs into one big DSI,
IMHO we will leave those pages mapped/unmapped while we should
unmap/map them.

> 
> >
> >>It looks like DSI_FLUSH is possible even for CM on.
> >>
> >>And in flush_unmaps():
> >>
> >>...
> >>         /* In caching mode, global flushes turn emulation expensive */
> >>         if (!cap_caching_mode(iommu->cap))
> >>             iommu->flush.flush_iotlb(iommu, 0, 0, 0,
> >>                      DMA_TLB_GLOBAL_FLUSH);
> >>...
> >>
> >>If I understand the comments correctly, GLOBAL is ok for CM too (though the
> >>code did not do it for performance reason).
> >I think it should be okay to send global flush for CM, but not sure
> >whether we should notify anything when we receive it. Hmm, anyway, I
> >think I need some more reading to make sure I understand the whole
> >thing correctly. :)
> >
> >For example, when I see this commit:
> >
> >commit 78d5f0f500e6ba8f6cfd0673475ff4d941d705a2
> >Author: Nadav Amit <nadav.amit@gmail.com>
> >Date:   Thu Apr 8 23:00:41 2010 +0300
> >
> >     intel-iommu: Avoid global flushes with caching mode.
> >     While it may be efficient on real hardware, emulation of global
> >     invalidations is very expensive as all shadow entries must be examined.
> >     This patch changes the behaviour when caching mode is enabled (which is
> >     the case when IOMMU emulation takes place). In this case, page specific
> >     invalidation is used instead.
> >
> >Before I ask someone else besides qemu-devel, I am curious about
> >whether there is existing VT-d emulation code (outside QEMU, of
> >course) so that I can have a reference?
> 
> Yes, it has. The author of this patch - Nadav has done lots of research on
> emulated IOMMU. See following papers:
> 
> https://hal.inria.fr/inria-00493752/document
> http://www.cse.iitd.ac.in/~sbansal/csl862-virt/readings/vIOMMU.pdf

Thanks for these good materials. I will google the author for sure
next time. :)

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18  8:11             ` Peter Xu
@ 2017-01-18  8:36               ` Jason Wang
  2017-01-18  8:46                 ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-18  8:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv



On 2017年01月18日 16:11, Peter Xu wrote:
> On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
>>
>> On 2017年01月17日 22:45, Peter Xu wrote:
>>> On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
>>>> On 2017年01月16日 17:18, Peter Xu wrote:
>>>>>>>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>>>>>                                         hwaddr addr, uint8_t am)
>>>>>>>   {
>>>>>>> @@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>>>>>       info.addr = addr;
>>>>>>>       info.mask = ~((1 << am) - 1);
>>>>>>>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
>>>>>>> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
>>>>>> Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
>>>>> IMHO we don't. For device assignment, since we are having CM=1 here,
>>>>> we should have explicit page invalidations even if guest sends
>>>>> global/domain invalidations.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -- peterx
>>>> Is this spec required?
>>> I think not. IMO the spec is very coarse grained on describing cache
>>> mode...
>>>
>>>> Btw, it looks to me that both DSI and GLOBAL are
>>>> indeed explicit flush.
>>> Actually when cache mode is on, it is unclear to me on how we should
>>> treat domain/global invalidations, at least from the spec (as
>>> mentioned earlier). My understanding is that they are not "explicit
>>> flushes", which IMHO should only mean page selective IOTLB
>>> invalidations.
>> Probably not, at least from the view of performance. DSI and global should
>> be more efficient in some cases.
> I agree with you that DSI/GLOBAL flushes are more efficient in some
> way. But IMHO that does not mean these invalidations are "explicit
> invalidations", and I suspect whether cache mode has to coop with it.

Well, the spec does not forbid DSI/GLOBAL with CM and the driver codes 
had used them for almost ten years. I can hardly believe it's wrong.

>
> But here I should add one more thing besides PSI - context entry
> invalidation should be one of "the explicit invalidations" as well,
> which we need to handle just like PSI when cache mode is on.
>
>>>> Just have a quick go through on driver codes and find this something
>>>> interesting in intel_iommu_flush_iotlb_psi():
>>>>
>>>> ...
>>>>      /*
>>>>       * Fallback to domain selective flush if no PSI support or the size is
>>>>       * too big.
>>>>       * PSI requires page size to be 2 ^ x, and the base address is naturally
>>>>       * aligned to the size
>>>>       */
>>>>      if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
>>>>          iommu->flush.flush_iotlb(iommu, did, 0, 0,
>>>>                          DMA_TLB_DSI_FLUSH);
>>>>      else
>>>>          iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
>>>>                          DMA_TLB_PSI_FLUSH);
>>>> ...
>>> I think this is interesting... and I doubt its correctness while with
>>> cache mode enabled.
>>>
>>> If so (sending domain invalidation instead of a big range of page
>>> invalidations), how should we capture which pages are unmapped in
>>> emulated IOMMU?
>> We don't need to track individual pages here, since all pages for a specific
>> domain were unmapped I believe?
> IMHO this might not be the correct behavior.
>
> If we receive one domain specific invalidation, I agree that we should
> invalidate the IOTLB cache for all the devices inside the domain.
> However, when cache mode is on, we should be depending on the PSIs to
> unmap each page (unless we want to unmap the whole address space, in
> this case it's very possible that the guest is just unmapping a range,
> not the entire space). If we convert several PSIs into one big DSI,
> IMHO we will leave those pages mapped/unmapped while we should
> unmap/map them.

Confused, do you have an example for this? (I fail to understand why DSI 
can't work, at least implementation can convert DSI to several PSIs 
internally).

Thanks

>
>>>> It looks like DSI_FLUSH is possible even for CM on.
>>>>
>>>> And in flush_unmaps():
>>>>
>>>> ...
>>>>          /* In caching mode, global flushes turn emulation expensive */
>>>>          if (!cap_caching_mode(iommu->cap))
>>>>              iommu->flush.flush_iotlb(iommu, 0, 0, 0,
>>>>                       DMA_TLB_GLOBAL_FLUSH);
>>>> ...
>>>>
>>>> If I understand the comments correctly, GLOBAL is ok for CM too (though the
>>>> code did not do it for performance reason).
>>> I think it should be okay to send global flush for CM, but not sure
>>> whether we should notify anything when we receive it. Hmm, anyway, I
>>> think I need some more reading to make sure I understand the whole
>>> thing correctly. :)
>>>
>>> For example, when I see this commit:
>>>
>>> commit 78d5f0f500e6ba8f6cfd0673475ff4d941d705a2
>>> Author: Nadav Amit <nadav.amit@gmail.com>
>>> Date:   Thu Apr 8 23:00:41 2010 +0300
>>>
>>>      intel-iommu: Avoid global flushes with caching mode.
>>>      While it may be efficient on real hardware, emulation of global
>>>      invalidations is very expensive as all shadow entries must be examined.
>>>      This patch changes the behaviour when caching mode is enabled (which is
>>>      the case when IOMMU emulation takes place). In this case, page specific
>>>      invalidation is used instead.
>>>
>>> Before I ask someone else besides qemu-devel, I am curious about
>>> whether there is existing VT-d emulation code (outside QEMU, of
>>> course) so that I can have a reference?
>> Yes, it has. The author of this patch - Nadav has done lots of research on
>> emulated IOMMU. See following papers:
>>
>> https://hal.inria.fr/inria-00493752/document
>> http://www.cse.iitd.ac.in/~sbansal/csl862-virt/readings/vIOMMU.pdf
> Thanks for these good materials. I will google the author for sure
> next time. :)
>
> -- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18  8:36               ` Jason Wang
@ 2017-01-18  8:46                 ` Peter Xu
  2017-01-18  9:38                   ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-18  8:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka,
	alex.williamson, bd.aviv

On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月18日 16:11, Peter Xu wrote:
> >On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月17日 22:45, Peter Xu wrote:
> >>>On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> >>>>On 2017年01月16日 17:18, Peter Xu wrote:
> >>>>>>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >>>>>>>                                        hwaddr addr, uint8_t am)
> >>>>>>>  {
> >>>>>>>@@ -1222,6 +1251,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >>>>>>>      info.addr = addr;
> >>>>>>>      info.mask = ~((1 << am) - 1);
> >>>>>>>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> >>>>>>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> >>>>>>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> >>>>>IMHO we don't. For device assignment, since we are having CM=1 here,
> >>>>>we should have explicit page invalidations even if guest sends
> >>>>>global/domain invalidations.
> >>>>>
> >>>>>Thanks,
> >>>>>
> >>>>>-- peterx
> >>>>Is this spec required?
> >>>I think not. IMO the spec is very coarse grained on describing cache
> >>>mode...
> >>>
> >>>>Btw, it looks to me that both DSI and GLOBAL are
> >>>>indeed explicit flush.
> >>>Actually when cache mode is on, it is unclear to me on how we should
> >>>treat domain/global invalidations, at least from the spec (as
> >>>mentioned earlier). My understanding is that they are not "explicit
> >>>flushes", which IMHO should only mean page selective IOTLB
> >>>invalidations.
> >>Probably not, at least from the view of performance. DSI and global should
> >>be more efficient in some cases.
> >I agree with you that DSI/GLOBAL flushes are more efficient in some
> >way. But IMHO that does not mean these invalidations are "explicit
> >invalidations", and I suspect whether cache mode has to coop with it.
> 
> Well, the spec does not forbid DSI/GLOBAL with CM and the driver codes had
> used them for almost ten years. I can hardly believe it's wrong.

I think we have misunderstanding here. :)

I never thought we should not send DSI/GLOBAL invalidations with cache
mode. I just think we should not do anything special even if we have
cache mode on when we receive these signals.

In the spec, "explicit invalidation" is mentioned in the cache mode
chapter:

    The Caching Mode (CM) field in Capability Register indicates if
    the hardware implementation caches not-present or erroneous
    translation-structure entries. When the CM field is reported as
    Set, any software updates to any remapping structures (including
    updates to not-present entries or present entries whose
    programming resulted in translation faults) requires explicit
    invalidation of the caches.

And I thought we were discussion about "what is explicit invalidation"
mentioned above.

> 
> >
> >But here I should add one more thing besides PSI - context entry
> >invalidation should be one of "the explicit invalidations" as well,
> >which we need to handle just like PSI when cache mode is on.
> >
> >>>>Just have a quick go through on driver codes and find this something
> >>>>interesting in intel_iommu_flush_iotlb_psi():
> >>>>
> >>>>...
> >>>>     /*
> >>>>      * Fallback to domain selective flush if no PSI support or the size is
> >>>>      * too big.
> >>>>      * PSI requires page size to be 2 ^ x, and the base address is naturally
> >>>>      * aligned to the size
> >>>>      */
> >>>>     if (!cap_pgsel_inv(iommu->cap) || mask > cap_max_amask_val(iommu->cap))
> >>>>         iommu->flush.flush_iotlb(iommu, did, 0, 0,
> >>>>                         DMA_TLB_DSI_FLUSH);
> >>>>     else
> >>>>         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
> >>>>                         DMA_TLB_PSI_FLUSH);
> >>>>...
> >>>I think this is interesting... and I doubt its correctness while with
> >>>cache mode enabled.
> >>>
> >>>If so (sending domain invalidation instead of a big range of page
> >>>invalidations), how should we capture which pages are unmapped in
> >>>emulated IOMMU?
> >>We don't need to track individual pages here, since all pages for a specific
> >>domain were unmapped I believe?
> >IMHO this might not be the correct behavior.
> >
> >If we receive one domain specific invalidation, I agree that we should
> >invalidate the IOTLB cache for all the devices inside the domain.
> >However, when cache mode is on, we should be depending on the PSIs to
> >unmap each page (unless we want to unmap the whole address space, in
> >this case it's very possible that the guest is just unmapping a range,
> >not the entire space). If we convert several PSIs into one big DSI,
> >IMHO we will leave those pages mapped/unmapped while we should
> >unmap/map them.
> 
> Confused, do you have an example for this? (I fail to understand why DSI
> can't work, at least implementation can convert DSI to several PSIs
> internally).

That's how I understand it. It might be wrong. Btw, could you
elaborate a bit on how can we convert a DSI into several PSIs?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18  8:46                 ` Peter Xu
@ 2017-01-18  9:38                   ` Tian, Kevin
  2017-01-18 10:06                     ` Jason Wang
                                       ` (2 more replies)
  0 siblings, 3 replies; 93+ messages in thread
From: Tian, Kevin @ 2017-01-18  9:38 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, alex.williamson,
	bd.aviv, Raj, Ashok

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, January 18, 2017 4:46 PM
> 
> On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
> >
> >
> > On 2017年01月18日 16:11, Peter Xu wrote:
> > >On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
> > >>
> > >>On 2017年01月17日 22:45, Peter Xu wrote:
> > >>>On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> > >>>>On 2017年01月16日 17:18, Peter Xu wrote:
> > >>>>>>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t
> domain_id,
> > >>>>>>>                                        hwaddr addr, uint8_t am)
> > >>>>>>>  {
> > >>>>>>>@@ -1222,6 +1251,7 @@ static void
> vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> > >>>>>>>      info.addr = addr;
> > >>>>>>>      info.mask = ~((1 << am) - 1);
> > >>>>>>>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page,
> &info);
> > >>>>>>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> > >>>>>>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> > >>>>>IMHO we don't. For device assignment, since we are having CM=1 here,
> > >>>>>we should have explicit page invalidations even if guest sends
> > >>>>>global/domain invalidations.
> > >>>>>
> > >>>>>Thanks,
> > >>>>>
> > >>>>>-- peterx
> > >>>>Is this spec required?
> > >>>I think not. IMO the spec is very coarse grained on describing cache
> > >>>mode...
> > >>>
> > >>>>Btw, it looks to me that both DSI and GLOBAL are
> > >>>>indeed explicit flush.
> > >>>Actually when cache mode is on, it is unclear to me on how we should
> > >>>treat domain/global invalidations, at least from the spec (as
> > >>>mentioned earlier). My understanding is that they are not "explicit
> > >>>flushes", which IMHO should only mean page selective IOTLB
> > >>>invalidations.
> > >>Probably not, at least from the view of performance. DSI and global should
> > >>be more efficient in some cases.
> > >I agree with you that DSI/GLOBAL flushes are more efficient in some
> > >way. But IMHO that does not mean these invalidations are "explicit
> > >invalidations", and I suspect whether cache mode has to coop with it.
> >
> > Well, the spec does not forbid DSI/GLOBAL with CM and the driver codes had
> > used them for almost ten years. I can hardly believe it's wrong.
> 
> I think we have misunderstanding here. :)
> 
> I never thought we should not send DSI/GLOBAL invalidations with cache
> mode. I just think we should not do anything special even if we have
> cache mode on when we receive these signals.
> 
> In the spec, "explicit invalidation" is mentioned in the cache mode
> chapter:
> 
>     The Caching Mode (CM) field in Capability Register indicates if
>     the hardware implementation caches not-present or erroneous
>     translation-structure entries. When the CM field is reported as
>     Set, any software updates to any remapping structures (including
>     updates to not-present entries or present entries whose
>     programming resulted in translation faults) requires explicit
>     invalidation of the caches.
> 
> And I thought we were discussion about "what is explicit invalidation"
> mentioned above.

Check 6.5.3.1 Implicit Invalidation on Page Requests

	In addition to the explicit invalidation through invalidation commands 
	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation 
	messages (see Section 6.5.4), identified above, Page Requests from 
	endpoint devices invalidate entries in the IOTLBs and paging-structure 
	caches.

My impression is that above indirectly defines invalidation commands (
PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly
issued by driver. Then section 6.5.3.1 further describes implicit
invalidations caused by other VT-d operations.

I will check with VT-d spec owner to clarify.

> 
> >
> > >
> > >But here I should add one more thing besides PSI - context entry
> > >invalidation should be one of "the explicit invalidations" as well,
> > >which we need to handle just like PSI when cache mode is on.
> > >
> > >>>>Just have a quick go through on driver codes and find this something
> > >>>>interesting in intel_iommu_flush_iotlb_psi():
> > >>>>
> > >>>>...
> > >>>>     /*
> > >>>>      * Fallback to domain selective flush if no PSI support or the size is
> > >>>>      * too big.
> > >>>>      * PSI requires page size to be 2 ^ x, and the base address is naturally
> > >>>>      * aligned to the size
> > >>>>      */
> > >>>>     if (!cap_pgsel_inv(iommu->cap) || mask >
> cap_max_amask_val(iommu->cap))
> > >>>>         iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > >>>>                         DMA_TLB_DSI_FLUSH);
> > >>>>     else
> > >>>>         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
> > >>>>                         DMA_TLB_PSI_FLUSH);
> > >>>>...
> > >>>I think this is interesting... and I doubt its correctness while with
> > >>>cache mode enabled.
> > >>>
> > >>>If so (sending domain invalidation instead of a big range of page
> > >>>invalidations), how should we capture which pages are unmapped in
> > >>>emulated IOMMU?
> > >>We don't need to track individual pages here, since all pages for a specific
> > >>domain were unmapped I believe?
> > >IMHO this might not be the correct behavior.
> > >
> > >If we receive one domain specific invalidation, I agree that we should
> > >invalidate the IOTLB cache for all the devices inside the domain.
> > >However, when cache mode is on, we should be depending on the PSIs to
> > >unmap each page (unless we want to unmap the whole address space, in
> > >this case it's very possible that the guest is just unmapping a range,
> > >not the entire space). If we convert several PSIs into one big DSI,
> > >IMHO we will leave those pages mapped/unmapped while we should
> > >unmap/map them.
> >
> > Confused, do you have an example for this? (I fail to understand why DSI
> > can't work, at least implementation can convert DSI to several PSIs
> > internally).
> 
> That's how I understand it. It might be wrong. Btw, could you
> elaborate a bit on how can we convert a DSI into several PSIs?
> 
> Thanks,

If my understanding above is correct, there is nothing wrong with 
above IOMMU driver code - actually it makes sense on bare metal
when CM is disabled.

But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
We rely on cache invalidations to indirectly capture remapping structure 
change. PSI provides accurate info, while DSI/GLOBAL doesn't. To 
emulate correct behavior of DSI/GLOBAL, we have to pretend that
the whole address space (iova=0, size=agaw) needs to be unmapped
(for GLOBAL it further means multiple address spaces)

Though not efficient, it doesn't mean it's wrong since guest driver
follows spec. We can ask for linux IOMMU driver change (CC Ashok)
to not use above optimization when cache mode is enabled, but 
anyway we need emulate correct DSI/GLOBAL behavior to follow
spec, because:

- even when driver fix is in place, old version still has this logic;

- there is still scenario where guest IOMMU driver does want to
invalidate the whole address space, e.g. when changing context
entry. Asking guest driver to use PSI for such purpose is another
bad thing.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18  9:38                   ` Tian, Kevin
@ 2017-01-18 10:06                     ` Jason Wang
  2017-01-19  3:32                       ` Peter Xu
  2017-01-19  3:16                     ` Peter Xu
  2017-01-19  6:44                     ` Liu, Yi L
  2 siblings, 1 reply; 93+ messages in thread
From: Jason Wang @ 2017-01-18 10:06 UTC (permalink / raw)
  To: Tian, Kevin, Peter Xu
  Cc: Lan, Tianyu, Raj, Ashok, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月18日 17:38, Tian, Kevin wrote:
>> From: Peter Xu [mailto:peterx@redhat.com]
>> Sent: Wednesday, January 18, 2017 4:46 PM
>>
>> On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
>>>
>>> On 2017年01月18日 16:11, Peter Xu wrote:
>>>> On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
>>>>> On 2017年01月17日 22:45, Peter Xu wrote:
>>>>>> On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
>>>>>>> On 2017年01月16日 17:18, Peter Xu wrote:
>>>>>>>>>>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t
>> domain_id,
>>>>>>>>>>                                         hwaddr addr, uint8_t am)
>>>>>>>>>>   {
>>>>>>>>>> @@ -1222,6 +1251,7 @@ static void
>> vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>>>>>>>>       info.addr = addr;
>>>>>>>>>>       info.mask = ~((1 << am) - 1);
>>>>>>>>>>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page,
>> &info);
>>>>>>>>>> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
>>>>>>>>> Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
>>>>>>>> IMHO we don't. For device assignment, since we are having CM=1 here,
>>>>>>>> we should have explicit page invalidations even if guest sends
>>>>>>>> global/domain invalidations.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -- peterx
>>>>>>> Is this spec required?
>>>>>> I think not. IMO the spec is very coarse grained on describing cache
>>>>>> mode...
>>>>>>
>>>>>>> Btw, it looks to me that both DSI and GLOBAL are
>>>>>>> indeed explicit flush.
>>>>>> Actually when cache mode is on, it is unclear to me on how we should
>>>>>> treat domain/global invalidations, at least from the spec (as
>>>>>> mentioned earlier). My understanding is that they are not "explicit
>>>>>> flushes", which IMHO should only mean page selective IOTLB
>>>>>> invalidations.
>>>>> Probably not, at least from the view of performance. DSI and global should
>>>>> be more efficient in some cases.
>>>> I agree with you that DSI/GLOBAL flushes are more efficient in some
>>>> way. But IMHO that does not mean these invalidations are "explicit
>>>> invalidations", and I suspect whether cache mode has to coop with it.
>>> Well, the spec does not forbid DSI/GLOBAL with CM and the driver codes had
>>> used them for almost ten years. I can hardly believe it's wrong.
>> I think we have misunderstanding here. :)
>>
>> I never thought we should not send DSI/GLOBAL invalidations with cache
>> mode. I just think we should not do anything special even if we have
>> cache mode on when we receive these signals.
>>
>> In the spec, "explicit invalidation" is mentioned in the cache mode
>> chapter:
>>
>>      The Caching Mode (CM) field in Capability Register indicates if
>>      the hardware implementation caches not-present or erroneous
>>      translation-structure entries. When the CM field is reported as
>>      Set, any software updates to any remapping structures (including
>>      updates to not-present entries or present entries whose
>>      programming resulted in translation faults) requires explicit
>>      invalidation of the caches.
>>
>> And I thought we were discussion about "what is explicit invalidation"
>> mentioned above.
> Check 6.5.3.1 Implicit Invalidation on Page Requests
>
> 	In addition to the explicit invalidation through invalidation commands
> 	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation
> 	messages (see Section 6.5.4), identified above, Page Requests from
> 	endpoint devices invalidate entries in the IOTLBs and paging-structure
> 	caches.
>
> My impression is that above indirectly defines invalidation commands (
> PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly
> issued by driver. Then section 6.5.3.1 further describes implicit
> invalidations caused by other VT-d operations.
>
> I will check with VT-d spec owner to clarify.

Good to hear from you.

So I think we should implement DSI and GLOBAL for vfio in this case. We 
can first try to implement it through current VFIO API which can accepts 
a range of iova. If not possible, let's discuss for other possible 
solutions.

>
>>>> But here I should add one more thing besides PSI - context entry
>>>> invalidation should be one of "the explicit invalidations" as well,
>>>> which we need to handle just like PSI when cache mode is on.
>>>>
>>>>>>> Just have a quick go through on driver codes and find this something
>>>>>>> interesting in intel_iommu_flush_iotlb_psi():
>>>>>>>
>>>>>>> ...
>>>>>>>      /*
>>>>>>>       * Fallback to domain selective flush if no PSI support or the size is
>>>>>>>       * too big.
>>>>>>>       * PSI requires page size to be 2 ^ x, and the base address is naturally
>>>>>>>       * aligned to the size
>>>>>>>       */
>>>>>>>      if (!cap_pgsel_inv(iommu->cap) || mask >
>> cap_max_amask_val(iommu->cap))
>>>>>>>          iommu->flush.flush_iotlb(iommu, did, 0, 0,
>>>>>>>                          DMA_TLB_DSI_FLUSH);
>>>>>>>      else
>>>>>>>          iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
>>>>>>>                          DMA_TLB_PSI_FLUSH);
>>>>>>> ...
>>>>>> I think this is interesting... and I doubt its correctness while with
>>>>>> cache mode enabled.
>>>>>>
>>>>>> If so (sending domain invalidation instead of a big range of page
>>>>>> invalidations), how should we capture which pages are unmapped in
>>>>>> emulated IOMMU?
>>>>> We don't need to track individual pages here, since all pages for a specific
>>>>> domain were unmapped I believe?
>>>> IMHO this might not be the correct behavior.
>>>>
>>>> If we receive one domain specific invalidation, I agree that we should
>>>> invalidate the IOTLB cache for all the devices inside the domain.
>>>> However, when cache mode is on, we should be depending on the PSIs to
>>>> unmap each page (unless we want to unmap the whole address space, in
>>>> this case it's very possible that the guest is just unmapping a range,
>>>> not the entire space). If we convert several PSIs into one big DSI,
>>>> IMHO we will leave those pages mapped/unmapped while we should
>>>> unmap/map them.
>>> Confused, do you have an example for this? (I fail to understand why DSI
>>> can't work, at least implementation can convert DSI to several PSIs
>>> internally).
>> That's how I understand it. It might be wrong. Btw, could you
>> elaborate a bit on how can we convert a DSI into several PSIs?
>>
>> Thanks,
> If my understanding above is correct, there is nothing wrong with
> above IOMMU driver code - actually it makes sense on bare metal
> when CM is disabled.
>
> But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
> We rely on cache invalidations to indirectly capture remapping structure
> change. PSI provides accurate info, while DSI/GLOBAL doesn't. To
> emulate correct behavior of DSI/GLOBAL, we have to pretend that
> the whole address space (iova=0, size=agaw) needs to be unmapped
> (for GLOBAL it further means multiple address spaces)

Maybe a trick to have accurate info is virtual Device IOTLB.

>
> Though not efficient, it doesn't mean it's wrong since guest driver
> follows spec. We can ask for linux IOMMU driver change (CC Ashok)
> to not use above optimization when cache mode is enabled, but
> anyway we need emulate correct DSI/GLOBAL behavior to follow
> spec, because:
>
> - even when driver fix is in place, old version still has this logic;
>
> - there is still scenario where guest IOMMU driver does want to
> invalidate the whole address space, e.g. when changing context
> entry. Asking guest driver to use PSI for such purpose is another
> bad thing.
>
> Thanks
> Kevin

Agree.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18  9:38                   ` Tian, Kevin
  2017-01-18 10:06                     ` Jason Wang
@ 2017-01-19  3:16                     ` Peter Xu
  2017-01-19  6:22                       ` Tian, Kevin
  2017-01-19  6:44                     ` Liu, Yi L
  2 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-19  3:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, qemu-devel, Lan, Tianyu, mst, jan.kiszka,
	alex.williamson, bd.aviv, Raj, Ashok

On Wed, Jan 18, 2017 at 09:38:55AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, January 18, 2017 4:46 PM
> > 
> > On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
> > >
> > >
> > > On 2017年01月18日 16:11, Peter Xu wrote:
> > > >On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
> > > >>
> > > >>On 2017年01月17日 22:45, Peter Xu wrote:
> > > >>>On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> > > >>>>On 2017年01月16日 17:18, Peter Xu wrote:
> > > >>>>>>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t
> > domain_id,
> > > >>>>>>>                                        hwaddr addr, uint8_t am)
> > > >>>>>>>  {
> > > >>>>>>>@@ -1222,6 +1251,7 @@ static void
> > vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> > > >>>>>>>      info.addr = addr;
> > > >>>>>>>      info.mask = ~((1 << am) - 1);
> > > >>>>>>>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page,
> > &info);
> > > >>>>>>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> > > >>>>>>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> > > >>>>>IMHO we don't. For device assignment, since we are having CM=1 here,
> > > >>>>>we should have explicit page invalidations even if guest sends
> > > >>>>>global/domain invalidations.
> > > >>>>>
> > > >>>>>Thanks,
> > > >>>>>
> > > >>>>>-- peterx
> > > >>>>Is this spec required?
> > > >>>I think not. IMO the spec is very coarse grained on describing cache
> > > >>>mode...
> > > >>>
> > > >>>>Btw, it looks to me that both DSI and GLOBAL are
> > > >>>>indeed explicit flush.
> > > >>>Actually when cache mode is on, it is unclear to me on how we should
> > > >>>treat domain/global invalidations, at least from the spec (as
> > > >>>mentioned earlier). My understanding is that they are not "explicit
> > > >>>flushes", which IMHO should only mean page selective IOTLB
> > > >>>invalidations.
> > > >>Probably not, at least from the view of performance. DSI and global should
> > > >>be more efficient in some cases.
> > > >I agree with you that DSI/GLOBAL flushes are more efficient in some
> > > >way. But IMHO that does not mean these invalidations are "explicit
> > > >invalidations", and I suspect whether cache mode has to coop with it.
> > >
> > > Well, the spec does not forbid DSI/GLOBAL with CM and the driver codes had
> > > used them for almost ten years. I can hardly believe it's wrong.
> > 
> > I think we have misunderstanding here. :)
> > 
> > I never thought we should not send DSI/GLOBAL invalidations with cache
> > mode. I just think we should not do anything special even if we have
> > cache mode on when we receive these signals.
> > 
> > In the spec, "explicit invalidation" is mentioned in the cache mode
> > chapter:
> > 
> >     The Caching Mode (CM) field in Capability Register indicates if
> >     the hardware implementation caches not-present or erroneous
> >     translation-structure entries. When the CM field is reported as
> >     Set, any software updates to any remapping structures (including
> >     updates to not-present entries or present entries whose
> >     programming resulted in translation faults) requires explicit
> >     invalidation of the caches.
> > 
> > And I thought we were discussion about "what is explicit invalidation"
> > mentioned above.
> 
> Check 6.5.3.1 Implicit Invalidation on Page Requests
> 
> 	In addition to the explicit invalidation through invalidation commands 
> 	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation 
> 	messages (see Section 6.5.4), identified above, Page Requests from 
> 	endpoint devices invalidate entries in the IOTLBs and paging-structure 
> 	caches.
> 
> My impression is that above indirectly defines invalidation commands (
> PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly
> issued by driver. Then section 6.5.3.1 further describes implicit
> invalidations caused by other VT-d operations.
> 
> I will check with VT-d spec owner to clarify.

Above spec is clear to me. So now I agree that both DSI/GLOBAL iotlb
invalidations are explicit invalidations.

> 
> > 
> > >
> > > >
> > > >But here I should add one more thing besides PSI - context entry
> > > >invalidation should be one of "the explicit invalidations" as well,
> > > >which we need to handle just like PSI when cache mode is on.
> > > >
> > > >>>>Just have a quick go through on driver codes and find this something
> > > >>>>interesting in intel_iommu_flush_iotlb_psi():
> > > >>>>
> > > >>>>...
> > > >>>>     /*
> > > >>>>      * Fallback to domain selective flush if no PSI support or the size is
> > > >>>>      * too big.
> > > >>>>      * PSI requires page size to be 2 ^ x, and the base address is naturally
> > > >>>>      * aligned to the size
> > > >>>>      */
> > > >>>>     if (!cap_pgsel_inv(iommu->cap) || mask >
> > cap_max_amask_val(iommu->cap))
> > > >>>>         iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > > >>>>                         DMA_TLB_DSI_FLUSH);
> > > >>>>     else
> > > >>>>         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
> > > >>>>                         DMA_TLB_PSI_FLUSH);
> > > >>>>...
> > > >>>I think this is interesting... and I doubt its correctness while with
> > > >>>cache mode enabled.
> > > >>>
> > > >>>If so (sending domain invalidation instead of a big range of page
> > > >>>invalidations), how should we capture which pages are unmapped in
> > > >>>emulated IOMMU?
> > > >>We don't need to track individual pages here, since all pages for a specific
> > > >>domain were unmapped I believe?
> > > >IMHO this might not be the correct behavior.
> > > >
> > > >If we receive one domain specific invalidation, I agree that we should
> > > >invalidate the IOTLB cache for all the devices inside the domain.
> > > >However, when cache mode is on, we should be depending on the PSIs to
> > > >unmap each page (unless we want to unmap the whole address space, in
> > > >this case it's very possible that the guest is just unmapping a range,
> > > >not the entire space). If we convert several PSIs into one big DSI,
> > > >IMHO we will leave those pages mapped/unmapped while we should
> > > >unmap/map them.
> > >
> > > Confused, do you have an example for this? (I fail to understand why DSI
> > > can't work, at least implementation can convert DSI to several PSIs
> > > internally).
> > 
> > That's how I understand it. It might be wrong. Btw, could you
> > elaborate a bit on how can we convert a DSI into several PSIs?
> > 
> > Thanks,
> 
> If my understanding above is correct, there is nothing wrong with 
> above IOMMU driver code - actually it makes sense on bare metal
> when CM is disabled.
> 
> But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
> We rely on cache invalidations to indirectly capture remapping structure 
> change. PSI provides accurate info, while DSI/GLOBAL doesn't. To 
> emulate correct behavior of DSI/GLOBAL, we have to pretend that
> the whole address space (iova=0, size=agaw) needs to be unmapped
> (for GLOBAL it further means multiple address spaces)
> 
> Though not efficient, it doesn't mean it's wrong since guest driver
> follows spec. We can ask for linux IOMMU driver change (CC Ashok)
> to not use above optimization when cache mode is enabled, but 
> anyway we need emulate correct DSI/GLOBAL behavior to follow
> spec, because:
> 
> - even when driver fix is in place, old version still has this logic;
> 
> - there is still scenario where guest IOMMU driver does want to
> invalidate the whole address space, e.g. when changing context
> entry. Asking guest driver to use PSI for such purpose is another
> bad thing.

Thanks for the thorough explanation. It did answered my above
question.

Btw, I never meant to ask guest IOMMU driver to send PSIs instead of
context entry invalidations, considering that the series is using
context entry invalidations to replay the region. But I admit I may
have misunderstood the spec a bit. :-)

I'll consider this issue in the next post, and handle domain/global
invalidations properly (though it might be slower).

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18 10:06                     ` Jason Wang
@ 2017-01-19  3:32                       ` Peter Xu
  2017-01-19  3:36                         ` Jason Wang
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-19  3:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Lan, Tianyu, Raj, Ashok, mst, jan.kiszka, bd.aviv,
	qemu-devel, alex.williamson

On Wed, Jan 18, 2017 at 06:06:57PM +0800, Jason Wang wrote:

[...]

> So I think we should implement DSI and GLOBAL for vfio in this case. We can
> first try to implement it through current VFIO API which can accepts a range
> of iova. If not possible, let's discuss for other possible solutions.

Do you mean VFIO_IOMMU_UNMAP_DMA here?

[...]

> >If my understanding above is correct, there is nothing wrong with
> >above IOMMU driver code - actually it makes sense on bare metal
> >when CM is disabled.
> >
> >But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
> >We rely on cache invalidations to indirectly capture remapping structure
> >change. PSI provides accurate info, while DSI/GLOBAL doesn't. To
> >emulate correct behavior of DSI/GLOBAL, we have to pretend that
> >the whole address space (iova=0, size=agaw) needs to be unmapped
> >(for GLOBAL it further means multiple address spaces)
> 
> Maybe a trick to have accurate info is virtual Device IOTLB.

Could you elaborate a bit on this?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-19  3:32                       ` Peter Xu
@ 2017-01-19  3:36                         ` Jason Wang
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-19  3:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Lan, Tianyu, Raj, Ashok, mst, jan.kiszka, bd.aviv,
	qemu-devel, alex.williamson



On 2017年01月19日 11:32, Peter Xu wrote:
> On Wed, Jan 18, 2017 at 06:06:57PM +0800, Jason Wang wrote:
>
> [...]
>
>> So I think we should implement DSI and GLOBAL for vfio in this case. We can
>> first try to implement it through current VFIO API which can accepts a range
>> of iova. If not possible, let's discuss for other possible solutions.
> Do you mean VFIO_IOMMU_UNMAP_DMA here?
>
> [...]

Yes.

>
>>> If my understanding above is correct, there is nothing wrong with
>>> above IOMMU driver code - actually it makes sense on bare metal
>>> when CM is disabled.
>>>
>>> But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
>>> We rely on cache invalidations to indirectly capture remapping structure
>>> change. PSI provides accurate info, while DSI/GLOBAL doesn't. To
>>> emulate correct behavior of DSI/GLOBAL, we have to pretend that
>>> the whole address space (iova=0, size=agaw) needs to be unmapped
>>> (for GLOBAL it further means multiple address spaces)
>> Maybe a trick to have accurate info is virtual Device IOTLB.
> Could you elaborate a bit on this?
>
> Thanks,
>
> -- peterx

I think the trick is if guest knows device has device IOTLB, it will 
explicit flush with accurate iova range.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-19  3:16                     ` Peter Xu
@ 2017-01-19  6:22                       ` Tian, Kevin
  2017-01-19  9:38                         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-19  6:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Wang, qemu-devel, Lan, Tianyu, mst, jan.kiszka,
	alex.williamson, bd.aviv, Raj, Ashok

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Thursday, January 19, 2017 11:17 AM
> 
> On Wed, Jan 18, 2017 at 09:38:55AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, January 18, 2017 4:46 PM
> > >
> > > On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
> > > >
> > > >
> > > > On 2017年01月18日 16:11, Peter Xu wrote:
> > > > >On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
> > > > >>
> > > > >>On 2017年01月17日 22:45, Peter Xu wrote:
> > > > >>>On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> > > > >>>>On 2017年01月16日 17:18, Peter Xu wrote:
> > > > >>>>>>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t
> > > domain_id,
> > > > >>>>>>>                                        hwaddr addr, uint8_t am)
> > > > >>>>>>>  {
> > > > >>>>>>>@@ -1222,6 +1251,7 @@ static void
> > > vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> > > > >>>>>>>      info.addr = addr;
> > > > >>>>>>>      info.mask = ~((1 << am) - 1);
> > > > >>>>>>>      g_hash_table_foreach_remove(s->iotlb,
> vtd_hash_remove_by_page,
> > > &info);
> > > > >>>>>>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> > > > >>>>>>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> > > > >>>>>IMHO we don't. For device assignment, since we are having CM=1 here,
> > > > >>>>>we should have explicit page invalidations even if guest sends
> > > > >>>>>global/domain invalidations.
> > > > >>>>>
> > > > >>>>>Thanks,
> > > > >>>>>
> > > > >>>>>-- peterx
> > > > >>>>Is this spec required?
> > > > >>>I think not. IMO the spec is very coarse grained on describing cache
> > > > >>>mode...
> > > > >>>
> > > > >>>>Btw, it looks to me that both DSI and GLOBAL are
> > > > >>>>indeed explicit flush.
> > > > >>>Actually when cache mode is on, it is unclear to me on how we should
> > > > >>>treat domain/global invalidations, at least from the spec (as
> > > > >>>mentioned earlier). My understanding is that they are not "explicit
> > > > >>>flushes", which IMHO should only mean page selective IOTLB
> > > > >>>invalidations.
> > > > >>Probably not, at least from the view of performance. DSI and global should
> > > > >>be more efficient in some cases.
> > > > >I agree with you that DSI/GLOBAL flushes are more efficient in some
> > > > >way. But IMHO that does not mean these invalidations are "explicit
> > > > >invalidations", and I suspect whether cache mode has to coop with it.
> > > >
> > > > Well, the spec does not forbid DSI/GLOBAL with CM and the driver codes had
> > > > used them for almost ten years. I can hardly believe it's wrong.
> > >
> > > I think we have misunderstanding here. :)
> > >
> > > I never thought we should not send DSI/GLOBAL invalidations with cache
> > > mode. I just think we should not do anything special even if we have
> > > cache mode on when we receive these signals.
> > >
> > > In the spec, "explicit invalidation" is mentioned in the cache mode
> > > chapter:
> > >
> > >     The Caching Mode (CM) field in Capability Register indicates if
> > >     the hardware implementation caches not-present or erroneous
> > >     translation-structure entries. When the CM field is reported as
> > >     Set, any software updates to any remapping structures (including
> > >     updates to not-present entries or present entries whose
> > >     programming resulted in translation faults) requires explicit
> > >     invalidation of the caches.
> > >
> > > And I thought we were discussion about "what is explicit invalidation"
> > > mentioned above.
> >
> > Check 6.5.3.1 Implicit Invalidation on Page Requests
> >
> > 	In addition to the explicit invalidation through invalidation commands
> > 	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation
> > 	messages (see Section 6.5.4), identified above, Page Requests from
> > 	endpoint devices invalidate entries in the IOTLBs and paging-structure
> > 	caches.
> >
> > My impression is that above indirectly defines invalidation commands (
> > PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly
> > issued by driver. Then section 6.5.3.1 further describes implicit
> > invalidations caused by other VT-d operations.
> >
> > I will check with VT-d spec owner to clarify.
> 
> Above spec is clear to me. So now I agree that both DSI/GLOBAL iotlb
> invalidations are explicit invalidations.
> 

still copy response from spec owner here:-)

	Explicit invalidation is anytime software is explicitly invalidating something (
	through any descriptor) as opposed to something hardware does implicitly.  
	The only time hardware does implicit invalidation is during the handling of a page 
	request (recoverable page-fault) from an endpoint device.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-18  9:38                   ` Tian, Kevin
  2017-01-18 10:06                     ` Jason Wang
  2017-01-19  3:16                     ` Peter Xu
@ 2017-01-19  6:44                     ` Liu, Yi L
  2017-01-19  7:02                       ` Jason Wang
  2017-01-19  7:02                       ` Peter Xu
  2 siblings, 2 replies; 93+ messages in thread
From: Liu, Yi L @ 2017-01-19  6:44 UTC (permalink / raw)
  To: Tian, Kevin, Peter Xu, Jason Wang
  Cc: Lan, Tianyu, Raj, Ashok, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson, Liu, Yi L

> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org]
> On Behalf Of Tian, Kevin
> Sent: Wednesday, January 18, 2017 5:39 PM
> To: Peter Xu <peterx@redhat.com>; Jason Wang <jasowang@redhat.com>
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Raj, Ashok <ashok.raj@intel.com>;
> mst@redhat.com; jan.kiszka@siemens.com; bd.aviv@gmail.com; qemu-
> devel@nongnu.org; alex.williamson@redhat.com
> Subject: Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio
> devices
> 
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, January 18, 2017 4:46 PM
> >
> > On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
> > >
> > >
> > > On 2017年01月18日 16:11, Peter Xu wrote:
> > > >On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
> > > >>
> > > >>On 2017年01月17日 22:45, Peter Xu wrote:
> > > >>>On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> > > >>>>On 2017年01月16日 17:18, Peter Xu wrote:
> > > >>>>>>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s,
> > > >>>>>>> uint16_t
> > domain_id,
> > > >>>>>>>                                        hwaddr addr, uint8_t
> > > >>>>>>>am)
> > > >>>>>>>  {
> > > >>>>>>>@@ -1222,6 +1251,7 @@ static void
> > vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> > > >>>>>>>      info.addr = addr;
> > > >>>>>>>      info.mask = ~((1 << am) - 1);
> > > >>>>>>>      g_hash_table_foreach_remove(s->iotlb,
> > > >>>>>>> vtd_hash_remove_by_page,
> > &info);
> > > >>>>>>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr,
> > > >>>>>>>+ am);
> > > >>>>>>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> > > >>>>>IMHO we don't. For device assignment, since we are having CM=1
> > > >>>>>here, we should have explicit page invalidations even if guest
> > > >>>>>sends global/domain invalidations.
> > > >>>>>
> > > >>>>>Thanks,
> > > >>>>>
> > > >>>>>-- peterx
> > > >>>>Is this spec required?
> > > >>>I think not. IMO the spec is very coarse grained on describing
> > > >>>cache mode...
> > > >>>
> > > >>>>Btw, it looks to me that both DSI and GLOBAL are indeed explicit
> > > >>>>flush.
> > > >>>Actually when cache mode is on, it is unclear to me on how we
> > > >>>should treat domain/global invalidations, at least from the spec
> > > >>>(as mentioned earlier). My understanding is that they are not
> > > >>>"explicit flushes", which IMHO should only mean page selective
> > > >>>IOTLB invalidations.
> > > >>Probably not, at least from the view of performance. DSI and
> > > >>global should be more efficient in some cases.
> > > >I agree with you that DSI/GLOBAL flushes are more efficient in some
> > > >way. But IMHO that does not mean these invalidations are "explicit
> > > >invalidations", and I suspect whether cache mode has to coop with it.
> > >
> > > Well, the spec does not forbid DSI/GLOBAL with CM and the driver
> > > codes had used them for almost ten years. I can hardly believe it's wrong.
> >
> > I think we have misunderstanding here. :)
> >
> > I never thought we should not send DSI/GLOBAL invalidations with cache
> > mode. I just think we should not do anything special even if we have
> > cache mode on when we receive these signals.
> >
> > In the spec, "explicit invalidation" is mentioned in the cache mode
> > chapter:
> >
> >     The Caching Mode (CM) field in Capability Register indicates if
> >     the hardware implementation caches not-present or erroneous
> >     translation-structure entries. When the CM field is reported as
> >     Set, any software updates to any remapping structures (including
> >     updates to not-present entries or present entries whose
> >     programming resulted in translation faults) requires explicit
> >     invalidation of the caches.
> >
> > And I thought we were discussion about "what is explicit invalidation"
> > mentioned above.
> 
> Check 6.5.3.1 Implicit Invalidation on Page Requests
> 
> 	In addition to the explicit invalidation through invalidation commands
> 	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation
> 	messages (see Section 6.5.4), identified above, Page Requests from
> 	endpoint devices invalidate entries in the IOTLBs and paging-structure
> 	caches.
> 
> My impression is that above indirectly defines invalidation commands (
> PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly issued by
> driver. Then section 6.5.3.1 further describes implicit invalidations caused by
> other VT-d operations.
> 
> I will check with VT-d spec owner to clarify.
> 
> >
> > >
> > > >
> > > >But here I should add one more thing besides PSI - context entry
> > > >invalidation should be one of "the explicit invalidations" as well,
> > > >which we need to handle just like PSI when cache mode is on.
> > > >
> > > >>>>Just have a quick go through on driver codes and find this
> > > >>>>something interesting in intel_iommu_flush_iotlb_psi():
> > > >>>>
> > > >>>>...
> > > >>>>     /*
> > > >>>>      * Fallback to domain selective flush if no PSI support or the size is
> > > >>>>      * too big.
> > > >>>>      * PSI requires page size to be 2 ^ x, and the base address is
> naturally
> > > >>>>      * aligned to the size
> > > >>>>      */
> > > >>>>     if (!cap_pgsel_inv(iommu->cap) || mask >
> > cap_max_amask_val(iommu->cap))
> > > >>>>         iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > > >>>>                         DMA_TLB_DSI_FLUSH);
> > > >>>>     else
> > > >>>>         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
> > > >>>>                         DMA_TLB_PSI_FLUSH); ...
> > > >>>I think this is interesting... and I doubt its correctness while
> > > >>>with cache mode enabled.
> > > >>>
> > > >>>If so (sending domain invalidation instead of a big range of page
> > > >>>invalidations), how should we capture which pages are unmapped in
> > > >>>emulated IOMMU?
> > > >>We don't need to track individual pages here, since all pages for
> > > >>a specific domain were unmapped I believe?
> > > >IMHO this might not be the correct behavior.
> > > >
> > > >If we receive one domain specific invalidation, I agree that we
> > > >should invalidate the IOTLB cache for all the devices inside the domain.
> > > >However, when cache mode is on, we should be depending on the PSIs
> > > >to unmap each page (unless we want to unmap the whole address
> > > >space, in this case it's very possible that the guest is just
> > > >unmapping a range, not the entire space). If we convert several
> > > >PSIs into one big DSI, IMHO we will leave those pages
> > > >mapped/unmapped while we should unmap/map them.
> > >
> > > Confused, do you have an example for this? (I fail to understand why
> > > DSI can't work, at least implementation can convert DSI to several
> > > PSIs internally).
> >
> > That's how I understand it. It might be wrong. Btw, could you
> > elaborate a bit on how can we convert a DSI into several PSIs?
> >
> > Thanks,
> 
> If my understanding above is correct, there is nothing wrong with above
> IOMMU driver code - actually it makes sense on bare metal when CM is
> disabled.
> 
> But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
> We rely on cache invalidations to indirectly capture remapping structure change.
> PSI provides accurate info, while DSI/GLOBAL doesn't. To emulate correct
> behavior of DSI/GLOBAL, we have to pretend that the whole address space
> (iova=0, size=agaw) needs to be unmapped (for GLOBAL it further means
> multiple address spaces)
> 
> Though not efficient, it doesn't mean it's wrong since guest driver follows spec.
> We can ask for linux IOMMU driver change (CC Ashok) to not use above
> optimization when cache mode is enabled, but anyway we need emulate correct
> DSI/GLOBAL behavior to follow spec, because:
> 
> - even when driver fix is in place, old version still has this logic;
> 
> - there is still scenario where guest IOMMU driver does want to invalidate the
> whole address space, e.g. when changing context entry. Asking guest driver to
> use PSI for such purpose is another bad thing.

Hi Kevin/Peter/Jason,

I agree we should think DSI/GLOBAL. Herby, I guess there may be a chance to ignore
DSI/GLOBAL flush if the following assumption is correct.

It seems like that all DSI/GLOBAL flush would always be after a context entry invalidation. 

With this assumption, I remember Peter added memory_replay in context invalidation.
This memory_replay would walk guest second-level page table and do map. So the
second-level page table in host should be able to get the latest mapping info. Guest
IOMMU driver would issue an DSI/GLOBAL flush after changing context. Since the
mapping info has updated in host, then there is no need to deal this DSI/GLOBAL flush.

So gentlemen, pls help judge if the assumption is correct. If it is correct, then Peter's patch
may just work without special process against DSI/GLOBAL flush.
 
Regards,
Yi L 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-19  6:44                     ` Liu, Yi L
@ 2017-01-19  7:02                       ` Jason Wang
  2017-01-19  7:02                       ` Peter Xu
  1 sibling, 0 replies; 93+ messages in thread
From: Jason Wang @ 2017-01-19  7:02 UTC (permalink / raw)
  To: Liu, Yi L, Tian, Kevin, Peter Xu
  Cc: Lan, Tianyu, Raj, Ashok, mst, jan.kiszka, bd.aviv, qemu-devel,
	alex.williamson



On 2017年01月19日 14:44, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org]
>> On Behalf Of Tian, Kevin
>> Sent: Wednesday, January 18, 2017 5:39 PM
>> To: Peter Xu <peterx@redhat.com>; Jason Wang <jasowang@redhat.com>
>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Raj, Ashok <ashok.raj@intel.com>;
>> mst@redhat.com; jan.kiszka@siemens.com; bd.aviv@gmail.com; qemu-
>> devel@nongnu.org; alex.williamson@redhat.com
>> Subject: Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio
>> devices
>>
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Wednesday, January 18, 2017 4:46 PM
>>>
>>> On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
>>>>
>>>> On 2017年01月18日 16:11, Peter Xu wrote:
>>>>> On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
>>>>>> On 2017年01月17日 22:45, Peter Xu wrote:
>>>>>>> On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
>>>>>>>> On 2017年01月16日 17:18, Peter Xu wrote:
>>>>>>>>>>>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s,
>>>>>>>>>>> uint16_t
>>> domain_id,
>>>>>>>>>>>                                         hwaddr addr, uint8_t
>>>>>>>>>>> am)
>>>>>>>>>>>   {
>>>>>>>>>>> @@ -1222,6 +1251,7 @@ static void
>>> vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>>>>>>>>>       info.addr = addr;
>>>>>>>>>>>       info.mask = ~((1 << am) - 1);
>>>>>>>>>>>       g_hash_table_foreach_remove(s->iotlb,
>>>>>>>>>>> vtd_hash_remove_by_page,
>>> &info);
>>>>>>>>>>> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr,
>>>>>>>>>>> + am);
>>>>>>>>>> Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
>>>>>>>>> IMHO we don't. For device assignment, since we are having CM=1
>>>>>>>>> here, we should have explicit page invalidations even if guest
>>>>>>>>> sends global/domain invalidations.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -- peterx
>>>>>>>> Is this spec required?
>>>>>>> I think not. IMO the spec is very coarse grained on describing
>>>>>>> cache mode...
>>>>>>>
>>>>>>>> Btw, it looks to me that both DSI and GLOBAL are indeed explicit
>>>>>>>> flush.
>>>>>>> Actually when cache mode is on, it is unclear to me on how we
>>>>>>> should treat domain/global invalidations, at least from the spec
>>>>>>> (as mentioned earlier). My understanding is that they are not
>>>>>>> "explicit flushes", which IMHO should only mean page selective
>>>>>>> IOTLB invalidations.
>>>>>> Probably not, at least from the view of performance. DSI and
>>>>>> global should be more efficient in some cases.
>>>>> I agree with you that DSI/GLOBAL flushes are more efficient in some
>>>>> way. But IMHO that does not mean these invalidations are "explicit
>>>>> invalidations", and I suspect whether cache mode has to coop with it.
>>>> Well, the spec does not forbid DSI/GLOBAL with CM and the driver
>>>> codes had used them for almost ten years. I can hardly believe it's wrong.
>>> I think we have misunderstanding here. :)
>>>
>>> I never thought we should not send DSI/GLOBAL invalidations with cache
>>> mode. I just think we should not do anything special even if we have
>>> cache mode on when we receive these signals.
>>>
>>> In the spec, "explicit invalidation" is mentioned in the cache mode
>>> chapter:
>>>
>>>      The Caching Mode (CM) field in Capability Register indicates if
>>>      the hardware implementation caches not-present or erroneous
>>>      translation-structure entries. When the CM field is reported as
>>>      Set, any software updates to any remapping structures (including
>>>      updates to not-present entries or present entries whose
>>>      programming resulted in translation faults) requires explicit
>>>      invalidation of the caches.
>>>
>>> And I thought we were discussion about "what is explicit invalidation"
>>> mentioned above.
>> Check 6.5.3.1 Implicit Invalidation on Page Requests
>>
>> 	In addition to the explicit invalidation through invalidation commands
>> 	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation
>> 	messages (see Section 6.5.4), identified above, Page Requests from
>> 	endpoint devices invalidate entries in the IOTLBs and paging-structure
>> 	caches.
>>
>> My impression is that above indirectly defines invalidation commands (
>> PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly issued by
>> driver. Then section 6.5.3.1 further describes implicit invalidations caused by
>> other VT-d operations.
>>
>> I will check with VT-d spec owner to clarify.
>>
>>>>> But here I should add one more thing besides PSI - context entry
>>>>> invalidation should be one of "the explicit invalidations" as well,
>>>>> which we need to handle just like PSI when cache mode is on.
>>>>>
>>>>>>>> Just have a quick go through on driver codes and find this
>>>>>>>> something interesting in intel_iommu_flush_iotlb_psi():
>>>>>>>>
>>>>>>>> ...
>>>>>>>>      /*
>>>>>>>>       * Fallback to domain selective flush if no PSI support or the size is
>>>>>>>>       * too big.
>>>>>>>>       * PSI requires page size to be 2 ^ x, and the base address is
>> naturally
>>>>>>>>       * aligned to the size
>>>>>>>>       */
>>>>>>>>      if (!cap_pgsel_inv(iommu->cap) || mask >
>>> cap_max_amask_val(iommu->cap))
>>>>>>>>          iommu->flush.flush_iotlb(iommu, did, 0, 0,
>>>>>>>>                          DMA_TLB_DSI_FLUSH);
>>>>>>>>      else
>>>>>>>>          iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
>>>>>>>>                          DMA_TLB_PSI_FLUSH); ...
>>>>>>> I think this is interesting... and I doubt its correctness while
>>>>>>> with cache mode enabled.
>>>>>>>
>>>>>>> If so (sending domain invalidation instead of a big range of page
>>>>>>> invalidations), how should we capture which pages are unmapped in
>>>>>>> emulated IOMMU?
>>>>>> We don't need to track individual pages here, since all pages for
>>>>>> a specific domain were unmapped I believe?
>>>>> IMHO this might not be the correct behavior.
>>>>>
>>>>> If we receive one domain specific invalidation, I agree that we
>>>>> should invalidate the IOTLB cache for all the devices inside the domain.
>>>>> However, when cache mode is on, we should be depending on the PSIs
>>>>> to unmap each page (unless we want to unmap the whole address
>>>>> space, in this case it's very possible that the guest is just
>>>>> unmapping a range, not the entire space). If we convert several
>>>>> PSIs into one big DSI, IMHO we will leave those pages
>>>>> mapped/unmapped while we should unmap/map them.
>>>> Confused, do you have an example for this? (I fail to understand why
>>>> DSI can't work, at least implementation can convert DSI to several
>>>> PSIs internally).
>>> That's how I understand it. It might be wrong. Btw, could you
>>> elaborate a bit on how can we convert a DSI into several PSIs?
>>>
>>> Thanks,
>> If my understanding above is correct, there is nothing wrong with above
>> IOMMU driver code - actually it makes sense on bare metal when CM is
>> disabled.
>>
>> But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
>> We rely on cache invalidations to indirectly capture remapping structure change.
>> PSI provides accurate info, while DSI/GLOBAL doesn't. To emulate correct
>> behavior of DSI/GLOBAL, we have to pretend that the whole address space
>> (iova=0, size=agaw) needs to be unmapped (for GLOBAL it further means
>> multiple address spaces)
>>
>> Though not efficient, it doesn't mean it's wrong since guest driver follows spec.
>> We can ask for linux IOMMU driver change (CC Ashok) to not use above
>> optimization when cache mode is enabled, but anyway we need emulate correct
>> DSI/GLOBAL behavior to follow spec, because:
>>
>> - even when driver fix is in place, old version still has this logic;
>>
>> - there is still scenario where guest IOMMU driver does want to invalidate the
>> whole address space, e.g. when changing context entry. Asking guest driver to
>> use PSI for such purpose is another bad thing.
> Hi Kevin/Peter/Jason,
>
> I agree we should think DSI/GLOBAL. Herby, I guess there may be a chance to ignore
> DSI/GLOBAL flush if the following assumption is correct.
>
> It seems like that all DSI/GLOBAL flush would always be after a context entry invalidation.

Well it looks like at least for DSI, flush could happen when the size is 
too big?

>
> With this assumption, I remember Peter added memory_replay in context invalidation.
> This memory_replay would walk guest second-level page table and do map. So the
> second-level page table in host should be able to get the latest mapping info. Guest
> IOMMU driver would issue an DSI/GLOBAL flush after changing context. Since the
> mapping info has updated in host, then there is no need to deal this DSI/GLOBAL flush.
>
> So gentlemen, pls help judge if the assumption is correct. If it is correct, then Peter's patch
> may just work without special process against DSI/GLOBAL flush.
>   
> Regards,
> Yi L

Even if this may be the usual case, I think we'd better not make the 
codes depends on (usual) guest behaviors.

Thanks

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-19  6:44                     ` Liu, Yi L
  2017-01-19  7:02                       ` Jason Wang
@ 2017-01-19  7:02                       ` Peter Xu
  1 sibling, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-19  7:02 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jason Wang, Lan, Tianyu, Raj, Ashok, mst,
	jan.kiszka, bd.aviv, qemu-devel, alex.williamson

On Thu, Jan 19, 2017 at 06:44:06AM +0000, Liu, Yi L wrote:
> > -----Original Message-----
> > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org]
> > On Behalf Of Tian, Kevin
> > Sent: Wednesday, January 18, 2017 5:39 PM
> > To: Peter Xu <peterx@redhat.com>; Jason Wang <jasowang@redhat.com>
> > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Raj, Ashok <ashok.raj@intel.com>;
> > mst@redhat.com; jan.kiszka@siemens.com; bd.aviv@gmail.com; qemu-
> > devel@nongnu.org; alex.williamson@redhat.com
> > Subject: Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio
> > devices
> > 
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, January 18, 2017 4:46 PM
> > >
> > > On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
> > > >
> > > >
> > > > On 2017年01月18日 16:11, Peter Xu wrote:
> > > > >On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
> > > > >>
> > > > >>On 2017年01月17日 22:45, Peter Xu wrote:
> > > > >>>On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
> > > > >>>>On 2017年01月16日 17:18, Peter Xu wrote:
> > > > >>>>>>>  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s,
> > > > >>>>>>> uint16_t
> > > domain_id,
> > > > >>>>>>>                                        hwaddr addr, uint8_t
> > > > >>>>>>>am)
> > > > >>>>>>>  {
> > > > >>>>>>>@@ -1222,6 +1251,7 @@ static void
> > > vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> > > > >>>>>>>      info.addr = addr;
> > > > >>>>>>>      info.mask = ~((1 << am) - 1);
> > > > >>>>>>>      g_hash_table_foreach_remove(s->iotlb,
> > > > >>>>>>> vtd_hash_remove_by_page,
> > > &info);
> > > > >>>>>>>+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr,
> > > > >>>>>>>+ am);
> > > > >>>>>>Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
> > > > >>>>>IMHO we don't. For device assignment, since we are having CM=1
> > > > >>>>>here, we should have explicit page invalidations even if guest
> > > > >>>>>sends global/domain invalidations.
> > > > >>>>>
> > > > >>>>>Thanks,
> > > > >>>>>
> > > > >>>>>-- peterx
> > > > >>>>Is this spec required?
> > > > >>>I think not. IMO the spec is very coarse grained on describing
> > > > >>>cache mode...
> > > > >>>
> > > > >>>>Btw, it looks to me that both DSI and GLOBAL are indeed explicit
> > > > >>>>flush.
> > > > >>>Actually when cache mode is on, it is unclear to me on how we
> > > > >>>should treat domain/global invalidations, at least from the spec
> > > > >>>(as mentioned earlier). My understanding is that they are not
> > > > >>>"explicit flushes", which IMHO should only mean page selective
> > > > >>>IOTLB invalidations.
> > > > >>Probably not, at least from the view of performance. DSI and
> > > > >>global should be more efficient in some cases.
> > > > >I agree with you that DSI/GLOBAL flushes are more efficient in some
> > > > >way. But IMHO that does not mean these invalidations are "explicit
> > > > >invalidations", and I suspect whether cache mode has to coop with it.
> > > >
> > > > Well, the spec does not forbid DSI/GLOBAL with CM and the driver
> > > > codes had used them for almost ten years. I can hardly believe it's wrong.
> > >
> > > I think we have misunderstanding here. :)
> > >
> > > I never thought we should not send DSI/GLOBAL invalidations with cache
> > > mode. I just think we should not do anything special even if we have
> > > cache mode on when we receive these signals.
> > >
> > > In the spec, "explicit invalidation" is mentioned in the cache mode
> > > chapter:
> > >
> > >     The Caching Mode (CM) field in Capability Register indicates if
> > >     the hardware implementation caches not-present or erroneous
> > >     translation-structure entries. When the CM field is reported as
> > >     Set, any software updates to any remapping structures (including
> > >     updates to not-present entries or present entries whose
> > >     programming resulted in translation faults) requires explicit
> > >     invalidation of the caches.
> > >
> > > And I thought we were discussion about "what is explicit invalidation"
> > > mentioned above.
> > 
> > Check 6.5.3.1 Implicit Invalidation on Page Requests
> > 
> > 	In addition to the explicit invalidation through invalidation commands
> > 	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation
> > 	messages (see Section 6.5.4), identified above, Page Requests from
> > 	endpoint devices invalidate entries in the IOTLBs and paging-structure
> > 	caches.
> > 
> > My impression is that above indirectly defines invalidation commands (
> > PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly issued by
> > driver. Then section 6.5.3.1 further describes implicit invalidations caused by
> > other VT-d operations.
> > 
> > I will check with VT-d spec owner to clarify.
> > 
> > >
> > > >
> > > > >
> > > > >But here I should add one more thing besides PSI - context entry
> > > > >invalidation should be one of "the explicit invalidations" as well,
> > > > >which we need to handle just like PSI when cache mode is on.
> > > > >
> > > > >>>>Just have a quick go through on driver codes and find this
> > > > >>>>something interesting in intel_iommu_flush_iotlb_psi():
> > > > >>>>
> > > > >>>>...
> > > > >>>>     /*
> > > > >>>>      * Fallback to domain selective flush if no PSI support or the size is
> > > > >>>>      * too big.
> > > > >>>>      * PSI requires page size to be 2 ^ x, and the base address is
> > naturally
> > > > >>>>      * aligned to the size
> > > > >>>>      */
> > > > >>>>     if (!cap_pgsel_inv(iommu->cap) || mask >
> > > cap_max_amask_val(iommu->cap))
> > > > >>>>         iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > > > >>>>                         DMA_TLB_DSI_FLUSH);
> > > > >>>>     else
> > > > >>>>         iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
> > > > >>>>                         DMA_TLB_PSI_FLUSH); ...
> > > > >>>I think this is interesting... and I doubt its correctness while
> > > > >>>with cache mode enabled.
> > > > >>>
> > > > >>>If so (sending domain invalidation instead of a big range of page
> > > > >>>invalidations), how should we capture which pages are unmapped in
> > > > >>>emulated IOMMU?
> > > > >>We don't need to track individual pages here, since all pages for
> > > > >>a specific domain were unmapped I believe?
> > > > >IMHO this might not be the correct behavior.
> > > > >
> > > > >If we receive one domain specific invalidation, I agree that we
> > > > >should invalidate the IOTLB cache for all the devices inside the domain.
> > > > >However, when cache mode is on, we should be depending on the PSIs
> > > > >to unmap each page (unless we want to unmap the whole address
> > > > >space, in this case it's very possible that the guest is just
> > > > >unmapping a range, not the entire space). If we convert several
> > > > >PSIs into one big DSI, IMHO we will leave those pages
> > > > >mapped/unmapped while we should unmap/map them.
> > > >
> > > > Confused, do you have an example for this? (I fail to understand why
> > > > DSI can't work, at least implementation can convert DSI to several
> > > > PSIs internally).
> > >
> > > That's how I understand it. It might be wrong. Btw, could you
> > > elaborate a bit on how can we convert a DSI into several PSIs?
> > >
> > > Thanks,
> > 
> > If my understanding above is correct, there is nothing wrong with above
> > IOMMU driver code - actually it makes sense on bare metal when CM is
> > disabled.
> > 
> > But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
> > We rely on cache invalidations to indirectly capture remapping structure change.
> > PSI provides accurate info, while DSI/GLOBAL doesn't. To emulate correct
> > behavior of DSI/GLOBAL, we have to pretend that the whole address space
> > (iova=0, size=agaw) needs to be unmapped (for GLOBAL it further means
> > multiple address spaces)
> > 
> > Though not efficient, it doesn't mean it's wrong since guest driver follows spec.
> > We can ask for linux IOMMU driver change (CC Ashok) to not use above
> > optimization when cache mode is enabled, but anyway we need emulate correct
> > DSI/GLOBAL behavior to follow spec, because:
> > 
> > - even when driver fix is in place, old version still has this logic;
> > 
> > - there is still scenario where guest IOMMU driver does want to invalidate the
> > whole address space, e.g. when changing context entry. Asking guest driver to
> > use PSI for such purpose is another bad thing.
> 
> Hi Kevin/Peter/Jason,
> 
> I agree we should think DSI/GLOBAL. Herby, I guess there may be a chance to ignore
> DSI/GLOBAL flush if the following assumption is correct.
> 
> It seems like that all DSI/GLOBAL flush would always be after a context entry invalidation. 
> 
> With this assumption, I remember Peter added memory_replay in context invalidation.
> This memory_replay would walk guest second-level page table and do map. So the
> second-level page table in host should be able to get the latest mapping info. Guest
> IOMMU driver would issue an DSI/GLOBAL flush after changing context. Since the
> mapping info has updated in host, then there is no need to deal this DSI/GLOBAL flush.
> 
> So gentlemen, pls help judge if the assumption is correct. If it is correct, then Peter's patch
> may just work without special process against DSI/GLOBAL flush.

Actually above is exactly what I thought before (I think I may not
have explaint clearly though :).

But I won't disagree on strictly following the spec, as Jason/Kevin
has suggested. The problem is, whether the spec is "strict enough to
be strictly followed", especially on caching mode part... :(

For example, logically, it is legal for a guest to send multiple PSIs
for a single page. When without caching mode, it never hurts, since
these PSIs will just let IOMMU flush cache for multiple times, that's
fine. However, if with caching mode, multiple PSIs means multiple
UNMAPs. That's a problem.

If to solve it, we need a per-device tree in QEMU to maintain the IOVA
address space, just like what vfio has done per-domain inside kernel.
Then when we see the 2nd and Nth unmap, we ignore it. But I guess
that's an overkill. A better way maybe we just restrict the guest to
send the invalidation once for each entry update. But I guess we don't
have such a requirement in spec now.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
  2017-01-18  7:49         ` Peter Xu
@ 2017-01-19  8:20           ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-19  8:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang, bd.aviv

On Wed, Jan 18, 2017 at 03:49:44PM +0800, Peter Xu wrote:

[...]

> I was trying to invalidate the entire address space by sending a big
> IOTLB notification to vfio-pci, which looks like:
> 
>   IOMMUTLBEntry entry = {
>       .target_as = &address_space_memory,
>       .iova = 0,
>       .translated_addr = 0,
>       .addr_mask = (1 << 63) - 1,
>       .perm = IOMMU_NONE,     /* UNMAP */
>   };
> 
> Then I feed this entry to vfio-pci IOMMU notifier.
> 
> However, this was blocked in vfio_iommu_map_notify(), with error:
> 
>   qemu-system-x86_64: iommu has granularity incompatible with target AS
> 
> Since we have:
> 
>   /*
>    * The IOMMU TLB entry we have just covers translation through
>    * this IOMMU to its immediate target.  We need to translate
>    * it the rest of the way through to memory.
>    */
>   rcu_read_lock();
>   mr = address_space_translate(&address_space_memory,
>                                iotlb->translated_addr,
>                                &xlat, &len, iotlb->perm & IOMMU_WO);
>   if (!memory_region_is_ram(mr)) {
>       error_report("iommu map to non memory area %"HWADDR_PRIx"",
>                    xlat);
>       goto out;
>   }
>   /*
>    * Translation truncates length to the IOMMU page size,
>    * check that it did not truncate too much.
>    */
>   if (len & iotlb->addr_mask) {
>       error_report("iommu has granularity incompatible with target AS");
>       goto out;
>   }
> 
> In my case len == 0xa0000 (that's the translation result), and
> iotlb->addr_mask == (1<<63)-1. So looks like the translation above
> splitted the big region and a simple big UNMAP doesn't work for me.
> 
> Do you have any suggestion on how I can solve this? In what case will
> we need the above address_space_translate()?

Hmm... it should be checking that the translated address range is RAM.

However if with this, IOMMU notifiers won't be able to leverage the
vfio driver feature to unmap a very big region.

IMHO the check should only be meaningful for map operations. I'll try
to post a RFC patch for vfio-pci to allow unmap of very big regions,
to see whether that'll be a workable approach.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
  2017-01-19  6:22                       ` Tian, Kevin
@ 2017-01-19  9:38                         ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-19  9:38 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, qemu-devel, Lan, Tianyu, mst, jan.kiszka,
	alex.williamson, bd.aviv, Raj, Ashok

On Thu, Jan 19, 2017 at 06:22:48AM +0000, Tian, Kevin wrote:

[...]

> still copy response from spec owner here:-)
> 
> 	Explicit invalidation is anytime software is explicitly invalidating something (
> 	through any descriptor) as opposed to something hardware does implicitly.  
> 	The only time hardware does implicit invalidation is during the handling of a page 
> 	request (recoverable page-fault) from an endpoint device.

Thanks for the confirmation!

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation Peter Xu
@ 2017-01-20  8:22   ` Tian, Kevin
  2017-01-20  9:05     ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  8:22 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Lan, Tianyu, mst, jan.kiszka, jasowang, alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 13, 2017 11:06 AM
> 
> Before we have int-remap, we need to bypass interrupt write requests.
> That's not necessary now - we have supported int-remap, and all the irq
> region requests should be redirected there. Cleaning up the block with
> an assertion instead.

This comment is not accurate. According to code, the reason why you
can do such simplification is because we have standalone memory
region now for interrupt addresses. There should be nothing to do 
with int-remap, which can be disabled by guest... Maybe the standalone
region was added when developing int-remap, but functionally they
are not related. :-)

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 28 ++++++----------------------
>  1 file changed, 6 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 2868e37..77d467a 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -818,28 +818,12 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as,
> PCIBus *bus,
>      bool writes = true;
>      VTDIOTLBEntry *iotlb_entry;
> 
> -    /* Check if the request is in interrupt address range */
> -    if (vtd_is_interrupt_addr(addr)) {
> -        if (is_write) {
> -            /* FIXME: since we don't know the length of the access here, we
> -             * treat Non-DWORD length write requests without PASID as
> -             * interrupt requests, too. Withoud interrupt remapping support,
> -             * we just use 1:1 mapping.
> -             */
> -            VTD_DPRINTF(MMU, "write request to interrupt address "
> -                        "gpa 0x%"PRIx64, addr);
> -            entry->iova = addr & VTD_PAGE_MASK_4K;
> -            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
> -            entry->addr_mask = ~VTD_PAGE_MASK_4K;
> -            entry->perm = IOMMU_WO;
> -            return;
> -        } else {
> -            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
> -                        "gpa 0x%"PRIx64, addr);
> -            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ, is_write);
> -            return;
> -        }
> -    }
> +    /*
> +     * We have standalone memory region for interrupt addresses, we
> +     * should never receive translation requests in this region.
> +     */
> +    assert(!vtd_is_interrupt_addr(addr));
> +
>      /* Try to fetch slpte form IOTLB */
>      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>      if (iotlb_entry) {
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper Peter Xu
@ 2017-01-20  8:27   ` Tian, Kevin
  2017-01-20  9:23     ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  8:27 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Lan, Tianyu, mst, jan.kiszka, jasowang, alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 13, 2017 11:06 AM
> 
> There are lots of places in current intel_iommu.c codes that named
> "iova" as "gpa". It is really confusing to use a name "gpa" in these
> places (which is very easily to be understood as "Guest Physical
> Address", while it's not). To make the codes (much) easier to be read, I
> decided to do this once and for all.
> 
> No functional change is made. Only literal ones.

If looking at VT-d spec (3.2 Domains and Address Translation)

	Remapping hardware treats the address in inbound requests as DMA 
	Address. Depending on the software usage model, the DMA address 
	space may be the Guest-Physical Address (GPA) space of the virtual 
	machine to which the device is assigned, or application Virtual Address 
	(VA) space defined by the PASID assigned to an application, or some 
	abstract I/O virtual address (IOVA) space defined by software.

	For simplicity, this document refers to address in requests-without-
	PASID as GPA, and address in requests-with-PASID as Virtual Address 
	(VA) (or Guest Virtual Address (GVA), if such request is from a device 
	assigned to a virtual machine). The translated address is referred to as 
	HPA.

It will add more readability if similar comment is added in this file - you
can say choosing iova to represent address in requests-without-PASID.

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 36 ++++++++++++++++++------------------
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 77d467a..275c3db 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -259,7 +259,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t
> source_id,
>      uint64_t *key = g_malloc(sizeof(*key));
>      uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
> 
> -    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
> +    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
>                  " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
>                  domain_id);
>      if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
> @@ -575,12 +575,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t
> index)
>      return slpte;
>  }
> 
> -/* Given a gpa and the level of paging structure, return the offset of current
> - * level.
> +/* Given an iova and the level of paging structure, return the offset
> + * of current level.
>   */
> -static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
> +static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
>  {
> -    return (gpa >> vtd_slpt_level_shift(level)) &
> +    return (iova >> vtd_slpt_level_shift(level)) &
>              ((1ULL << VTD_SL_LEVEL_BITS) - 1);
>  }
> 
> @@ -628,10 +628,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t
> level)
>      }
>  }
> 
> -/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
> +/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> -static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa, bool is_write,
> +static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>                              uint64_t *slptep, uint32_t *slpte_level,
>                              bool *reads, bool *writes)
>  {
> @@ -642,11 +642,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
> bool is_write,
>      uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>      uint64_t access_right_check;
> 
> -    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
> -     * and AW in context-entry.
> +    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
>       */
> -    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> -        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
> +    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
> 
> @@ -654,13 +654,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
> bool is_write,
>      access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
> 
>      while (true) {
> -        offset = vtd_gpa_level_offset(gpa, level);
> +        offset = vtd_iova_level_offset(iova, level);
>          slpte = vtd_get_slpte(addr, offset);
> 
>          if (slpte == (uint64_t)-1) {
>              VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
> -                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
> -                        level, gpa);
> +                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
> +                        level, iova);
>              if (level == vtd_get_level_from_context_entry(ce)) {
>                  /* Invalid programming of context-entry */
>                  return -VTD_FR_CONTEXT_ENTRY_INV;
> @@ -672,8 +672,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
> bool is_write,
>          *writes = (*writes) && (slpte & VTD_SL_W);
>          if (!(slpte & access_right_check)) {
>              VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
> -                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
> -                        (is_write ? "write" : "read"), gpa, slpte);
> +                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
> +                        (is_write ? "write" : "read"), iova, slpte);
>              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
>          }
>          if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> @@ -827,7 +827,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as,
> PCIBus *bus,
>      /* Try to fetch slpte form IOTLB */
>      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>      if (iotlb_entry) {
> -        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
> +        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
>                      " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
>                      iotlb_entry->slpte, iotlb_entry->domain_id);
>          slpte = iotlb_entry->slpte;
> @@ -2025,7 +2025,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion
> *iommu, hwaddr addr,
>                             is_write, &ret);
>      VTD_DPRINTF(MMU,
>                  "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
> -                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
> +                " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
>                  VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
>                  vtd_as->devfn, addr, ret.translated_addr);
>      return ret;
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
@ 2017-01-20  8:32   ` Tian, Kevin
  2017-01-20  8:54     ` Peter Xu
  2017-01-20 15:42   ` Eric Blake
  1 sibling, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  8:32 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Lan, Tianyu, mst, jan.kiszka, jasowang, alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 13, 2017 11:06 AM
> 
> From: Aviv Ben-David <bd.aviv@gmail.com>
> 
> This capability asks the guest to invalidate cache before each map operation.
> We can use this invalidation to trap map operations in the hypervisor.
> 
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c          | 5 +++++
>  hw/i386/intel_iommu_internal.h | 1 +
>  include/hw/i386/intel_iommu.h  | 2 ++
>  3 files changed, 8 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ec62239..2868e37 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2107,6 +2107,7 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
>                              ON_OFF_AUTO_AUTO),
>      DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
> +    DEFINE_PROP_BOOL("cache-mode", IntelIOMMUState, cache_mode_enabled,
> FALSE),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> 
> @@ -2488,6 +2489,10 @@ static void vtd_init(IntelIOMMUState *s)
>          s->ecap |= VTD_ECAP_DT;
>      }
> 
> +    if (s->cache_mode_enabled) {
> +        s->cap |= VTD_CAP_CM;
> +    }
> +

I think some of my old comments are not answered:

1) Better to use caching_mode to follow spec

2) Does it make sense to automatically set this flag if any VFIO device
has been statically assigned when starting Qemu? Also for hot-add
device path, some check of caching mode is required. If not set, 
should we fail hot-add operation? I don't think we have such physical
platform with some devices behind IOMMU while others not.

>      vtd_reset_context_cache(s);
>      vtd_reset_iotlb(s);
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 356f188..4104121 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -202,6 +202,7 @@
>  #define VTD_CAP_MAMV                (VTD_MAMV << 48)
>  #define VTD_CAP_PSI                 (1ULL << 39)
>  #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
> +#define VTD_CAP_CM                  (1ULL << 7)
> 
>  /* Supported Adjusted Guest Address Widths */
>  #define VTD_CAP_SAGAW_SHIFT         8
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 405c9d1..749eef9 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -257,6 +257,8 @@ struct IntelIOMMUState {
>      uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>      uint32_t version;
> 
> +    bool cache_mode_enabled;        /* RO - is cap CM enabled? */
> +
>      dma_addr_t root;                /* Current root table pointer */
>      bool root_extended;             /* Type of root table (extended or not) */
>      bool dmar_enabled;              /* Set if DMA remapping is enabled */
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-20  8:32   ` Tian, Kevin
@ 2017-01-20  8:54     ` Peter Xu
  2017-01-20  8:59       ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-20  8:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 08:32:06AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, January 13, 2017 11:06 AM
> > 
> > From: Aviv Ben-David <bd.aviv@gmail.com>
> > 
> > This capability asks the guest to invalidate cache before each map operation.
> > We can use this invalidation to trap map operations in the hypervisor.
> > 
> > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c          | 5 +++++
> >  hw/i386/intel_iommu_internal.h | 1 +
> >  include/hw/i386/intel_iommu.h  | 2 ++
> >  3 files changed, 8 insertions(+)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index ec62239..2868e37 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -2107,6 +2107,7 @@ static Property vtd_properties[] = {
> >      DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
> >                              ON_OFF_AUTO_AUTO),
> >      DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
> > +    DEFINE_PROP_BOOL("cache-mode", IntelIOMMUState, cache_mode_enabled,
> > FALSE),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> > 
> > @@ -2488,6 +2489,10 @@ static void vtd_init(IntelIOMMUState *s)
> >          s->ecap |= VTD_ECAP_DT;
> >      }
> > 
> > +    if (s->cache_mode_enabled) {
> > +        s->cap |= VTD_CAP_CM;
> > +    }
> > +
> 
> I think some of my old comments are not answered:
> 
> 1) Better to use caching_mode to follow spec

Sure.

> 
> 2) Does it make sense to automatically set this flag if any VFIO device
> has been statically assigned when starting Qemu?

I'm okay with both, considering that people using this flag will be
possibly advanced users. So I would like to hear others' opinion.

> Also for hot-add
> device path, some check of caching mode is required. If not set, 
> should we fail hot-add operation? I don't think we have such physical
> platform with some devices behind IOMMU while others not.

Could you explain in what case will we fail a hot plug?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-20  8:54     ` Peter Xu
@ 2017-01-20  8:59       ` Tian, Kevin
  2017-01-20  9:11         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  8:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 20, 2017 4:55 PM
> 
> On Fri, Jan 20, 2017 at 08:32:06AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Friday, January 13, 2017 11:06 AM
> > >
> > > From: Aviv Ben-David <bd.aviv@gmail.com>
> > >
> > > This capability asks the guest to invalidate cache before each map operation.
> > > We can use this invalidation to trap map operations in the hypervisor.
> > >
> > > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 5 +++++
> > >  hw/i386/intel_iommu_internal.h | 1 +
> > >  include/hw/i386/intel_iommu.h  | 2 ++
> > >  3 files changed, 8 insertions(+)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index ec62239..2868e37 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -2107,6 +2107,7 @@ static Property vtd_properties[] = {
> > >      DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
> > >                              ON_OFF_AUTO_AUTO),
> > >      DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
> > > +    DEFINE_PROP_BOOL("cache-mode", IntelIOMMUState, cache_mode_enabled,
> > > FALSE),
> > >      DEFINE_PROP_END_OF_LIST(),
> > >  };
> > >
> > > @@ -2488,6 +2489,10 @@ static void vtd_init(IntelIOMMUState *s)
> > >          s->ecap |= VTD_ECAP_DT;
> > >      }
> > >
> > > +    if (s->cache_mode_enabled) {
> > > +        s->cap |= VTD_CAP_CM;
> > > +    }
> > > +
> >
> > I think some of my old comments are not answered:
> >
> > 1) Better to use caching_mode to follow spec
> 
> Sure.
> 
> >
> > 2) Does it make sense to automatically set this flag if any VFIO device
> > has been statically assigned when starting Qemu?
> 
> I'm okay with both, considering that people using this flag will be
> possibly advanced users. So I would like to hear others' opinion.
> 
> > Also for hot-add
> > device path, some check of caching mode is required. If not set,
> > should we fail hot-add operation? I don't think we have such physical
> > platform with some devices behind IOMMU while others not.
> 
> Could you explain in what case will we fail a hot plug?
> 

user enables intel-iommu, but don't set caching mode.

Then later user hot-add a PCI device to the VM. Guest will assume
newly assigned device also behind the default vIOMMU, and thus
needs to setup IOVA mappings, which is then broken...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-20  8:22   ` Tian, Kevin
@ 2017-01-20  9:05     ` Peter Xu
  2017-01-20  9:15       ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-20  9:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 08:22:14AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, January 13, 2017 11:06 AM
> > 
> > Before we have int-remap, we need to bypass interrupt write requests.
> > That's not necessary now - we have supported int-remap, and all the irq
> > region requests should be redirected there. Cleaning up the block with
> > an assertion instead.
> 
> This comment is not accurate. According to code, the reason why you
> can do such simplification is because we have standalone memory
> region now for interrupt addresses. There should be nothing to do 
> with int-remap, which can be disabled by guest... Maybe the standalone
> region was added when developing int-remap, but functionally they
> are not related. :-)

IMHO the above commit message is fairly clear. :-)

But sure I can add some more emphasise like:

  "Before we have int-remap memory region, ..."

Do you think it's okay? Or any better suggestion?

(Just to mention that even guest disables IR, the MSI region will
 still be there.)

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-20  8:59       ` Tian, Kevin
@ 2017-01-20  9:11         ` Peter Xu
  2017-01-20  9:20           ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-20  9:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 08:59:01AM +0000, Tian, Kevin wrote:

[...]

> > > Also for hot-add
> > > device path, some check of caching mode is required. If not set,
> > > should we fail hot-add operation? I don't think we have such physical
> > > platform with some devices behind IOMMU while others not.
> > 
> > Could you explain in what case will we fail a hot plug?
> > 
> 
> user enables intel-iommu, but don't set caching mode.
> 
> Then later user hot-add a PCI device to the VM. Guest will assume
> newly assigned device also behind the default vIOMMU, and thus
> needs to setup IOVA mappings, which is then broken...

Is the newly added device a vfio-pci device? If so, we should hit
this and VM will stops to work:

    if (!s->cache_mode_enabled && new & IOMMU_NOTIFIER_MAP) {
        error_report("We need to set cache_mode=1 for intel-iommu to enable "
                     "device assignment with IOMMU protection.");
        exit(1);
    }

I admit this is not user-friendly, and a better way may be that we
disallow the hot-plug in that case, telling the user about the error,
rather than crashing the VM. But, I think that can be a patch outside
this series, considering (again) that this only affects advanced
users.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-20  9:05     ` Peter Xu
@ 2017-01-20  9:15       ` Tian, Kevin
  2017-01-20  9:27         ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  9:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 20, 2017 5:05 PM
> 
> On Fri, Jan 20, 2017 at 08:22:14AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Friday, January 13, 2017 11:06 AM
> > >
> > > Before we have int-remap, we need to bypass interrupt write requests.
> > > That's not necessary now - we have supported int-remap, and all the irq
> > > region requests should be redirected there. Cleaning up the block with
> > > an assertion instead.
> >
> > This comment is not accurate. According to code, the reason why you
> > can do such simplification is because we have standalone memory
> > region now for interrupt addresses. There should be nothing to do
> > with int-remap, which can be disabled by guest... Maybe the standalone
> > region was added when developing int-remap, but functionally they
> > are not related. :-)
> 
> IMHO the above commit message is fairly clear. :-)
> 
> But sure I can add some more emphasise like:
> 
>   "Before we have int-remap memory region, ..."
> 
> Do you think it's okay? Or any better suggestion?
> 
> (Just to mention that even guest disables IR, the MSI region will
>  still be there.)
> 

My option is simple - this patch has nothing to do with int-remap.
It's not necessary, not because we supported int-remap. It's because
we have a standalone memory region for interrupt addresses, as you
described in the code. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-20  9:11         ` Peter Xu
@ 2017-01-20  9:20           ` Tian, Kevin
  2017-01-20  9:30             ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  9:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 20, 2017 5:12 PM
> 
> On Fri, Jan 20, 2017 at 08:59:01AM +0000, Tian, Kevin wrote:
> 
> [...]
> 
> > > > Also for hot-add
> > > > device path, some check of caching mode is required. If not set,
> > > > should we fail hot-add operation? I don't think we have such physical
> > > > platform with some devices behind IOMMU while others not.
> > >
> > > Could you explain in what case will we fail a hot plug?
> > >
> >
> > user enables intel-iommu, but don't set caching mode.
> >
> > Then later user hot-add a PCI device to the VM. Guest will assume
> > newly assigned device also behind the default vIOMMU, and thus
> > needs to setup IOVA mappings, which is then broken...
> 
> Is the newly added device a vfio-pci device? If so, we should hit
> this and VM will stops to work:
> 
>     if (!s->cache_mode_enabled && new & IOMMU_NOTIFIER_MAP) {
>         error_report("We need to set cache_mode=1 for intel-iommu to enable "
>                      "device assignment with IOMMU protection.");
>         exit(1);
>     }

sorry I didn't found this code. In which code path is it hit?

> 
> I admit this is not user-friendly, and a better way may be that we
> disallow the hot-plug in that case, telling the user about the error,
> rather than crashing the VM. But, I think that can be a patch outside
> this series, considering (again) that this only affects advanced
> users.
> 

Crashing VM is bad.... but anyway, I'll leave maintainer to decide
whether they'd like it fixed now or later. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper
  2017-01-20  8:27   ` Tian, Kevin
@ 2017-01-20  9:23     ` Peter Xu
  2017-01-20  9:41       ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-20  9:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 08:27:31AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, January 13, 2017 11:06 AM
> > 
> > There are lots of places in current intel_iommu.c codes that named
> > "iova" as "gpa". It is really confusing to use a name "gpa" in these
> > places (which is very easily to be understood as "Guest Physical
> > Address", while it's not). To make the codes (much) easier to be read, I
> > decided to do this once and for all.
> > 
> > No functional change is made. Only literal ones.
> 
> If looking at VT-d spec (3.2 Domains and Address Translation)
> 
> 	Remapping hardware treats the address in inbound requests as DMA 
> 	Address. Depending on the software usage model, the DMA address 
> 	space may be the Guest-Physical Address (GPA) space of the virtual 
> 	machine to which the device is assigned, or application Virtual Address 
> 	(VA) space defined by the PASID assigned to an application, or some 
> 	abstract I/O virtual address (IOVA) space defined by software.
> 
> 	For simplicity, this document refers to address in requests-without-
> 	PASID as GPA, and address in requests-with-PASID as Virtual Address 
> 	(VA) (or Guest Virtual Address (GVA), if such request is from a device 
> 	assigned to a virtual machine). The translated address is referred to as 
> 	HPA.
> 
> It will add more readability if similar comment is added in this file - you
> can say choosing iova to represent address in requests-without-PASID.

I agree that the code will be more readable if we can explain all the
bits in detail.

However this patch is not adding comments, but to "fix" an existing
literal problem, which is very misleading to people reading the codes
for the first time. The places touched in this patch was doing the
namings incorrectly, so I just corrected them. So even if we want to
have more comments on explaining the bits, IMHO it'll be nicer to use
a separate patch, not squashing all the things into a single one.

If you won't disagree, I'd like to keep this single patch as-it-is, to
make sure this series can converge soon (I will be sorry since I'll
extend this series a bit, I hope that won't extend the review process
too long for it). After that, we can add more documentations if
needed.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-20  9:15       ` Tian, Kevin
@ 2017-01-20  9:27         ` Peter Xu
  2017-01-20  9:52           ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-20  9:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 09:15:27AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, January 20, 2017 5:05 PM
> > 
> > On Fri, Jan 20, 2017 at 08:22:14AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Friday, January 13, 2017 11:06 AM
> > > >
> > > > Before we have int-remap, we need to bypass interrupt write requests.
> > > > That's not necessary now - we have supported int-remap, and all the irq
> > > > region requests should be redirected there. Cleaning up the block with
> > > > an assertion instead.
> > >
> > > This comment is not accurate. According to code, the reason why you
> > > can do such simplification is because we have standalone memory
> > > region now for interrupt addresses. There should be nothing to do
> > > with int-remap, which can be disabled by guest... Maybe the standalone
> > > region was added when developing int-remap, but functionally they
> > > are not related. :-)
> > 
> > IMHO the above commit message is fairly clear. :-)
> > 
> > But sure I can add some more emphasise like:
> > 
> >   "Before we have int-remap memory region, ..."
> > 
> > Do you think it's okay? Or any better suggestion?
> > 
> > (Just to mention that even guest disables IR, the MSI region will
> >  still be there.)
> > 
> 
> My option is simple - this patch has nothing to do with int-remap.
> It's not necessary, not because we supported int-remap. It's because
> we have a standalone memory region for interrupt addresses, as you
> described in the code. :-)

I really think they are the same thing...

How about this:

    Now we have a standalone memory region for MSI, all the irq region
    requests should be redirected there. Cleaning up the block with an
    assertion instead.

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-20  9:20           ` Tian, Kevin
@ 2017-01-20  9:30             ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-20  9:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 09:20:01AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, January 20, 2017 5:12 PM
> > 
> > On Fri, Jan 20, 2017 at 08:59:01AM +0000, Tian, Kevin wrote:
> > 
> > [...]
> > 
> > > > > Also for hot-add
> > > > > device path, some check of caching mode is required. If not set,
> > > > > should we fail hot-add operation? I don't think we have such physical
> > > > > platform with some devices behind IOMMU while others not.
> > > >
> > > > Could you explain in what case will we fail a hot plug?
> > > >
> > >
> > > user enables intel-iommu, but don't set caching mode.
> > >
> > > Then later user hot-add a PCI device to the VM. Guest will assume
> > > newly assigned device also behind the default vIOMMU, and thus
> > > needs to setup IOVA mappings, which is then broken...
> > 
> > Is the newly added device a vfio-pci device? If so, we should hit
> > this and VM will stops to work:
> > 
> >     if (!s->cache_mode_enabled && new & IOMMU_NOTIFIER_MAP) {
> >         error_report("We need to set cache_mode=1 for intel-iommu to enable "
> >                      "device assignment with IOMMU protection.");
> >         exit(1);
> >     }
> 
> sorry I didn't found this code. In which code path is it hit?

It's in patch 14/14 of this series.

> 
> > 
> > I admit this is not user-friendly, and a better way may be that we
> > disallow the hot-plug in that case, telling the user about the error,
> > rather than crashing the VM. But, I think that can be a patch outside
> > this series, considering (again) that this only affects advanced
> > users.
> > 
> 
> Crashing VM is bad.... but anyway, I'll leave maintainer to decide
> whether they'd like it fixed now or later. :-)

Sure. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper
  2017-01-20  9:23     ` Peter Xu
@ 2017-01-20  9:41       ` Tian, Kevin
  0 siblings, 0 replies; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  9:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 20, 2017 5:24 PM
> 
> On Fri, Jan 20, 2017 at 08:27:31AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Friday, January 13, 2017 11:06 AM
> > >
> > > There are lots of places in current intel_iommu.c codes that named
> > > "iova" as "gpa". It is really confusing to use a name "gpa" in these
> > > places (which is very easily to be understood as "Guest Physical
> > > Address", while it's not). To make the codes (much) easier to be read, I
> > > decided to do this once and for all.
> > >
> > > No functional change is made. Only literal ones.
> >
> > If looking at VT-d spec (3.2 Domains and Address Translation)
> >
> > 	Remapping hardware treats the address in inbound requests as DMA
> > 	Address. Depending on the software usage model, the DMA address
> > 	space may be the Guest-Physical Address (GPA) space of the virtual
> > 	machine to which the device is assigned, or application Virtual Address
> > 	(VA) space defined by the PASID assigned to an application, or some
> > 	abstract I/O virtual address (IOVA) space defined by software.
> >
> > 	For simplicity, this document refers to address in requests-without-
> > 	PASID as GPA, and address in requests-with-PASID as Virtual Address
> > 	(VA) (or Guest Virtual Address (GVA), if such request is from a device
> > 	assigned to a virtual machine). The translated address is referred to as
> > 	HPA.
> >
> > It will add more readability if similar comment is added in this file - you
> > can say choosing iova to represent address in requests-without-PASID.
> 
> I agree that the code will be more readable if we can explain all the
> bits in detail.
> 
> However this patch is not adding comments, but to "fix" an existing
> literal problem, which is very misleading to people reading the codes
> for the first time. The places touched in this patch was doing the
> namings incorrectly, so I just corrected them. So even if we want to
> have more comments on explaining the bits, IMHO it'll be nicer to use
> a separate patch, not squashing all the things into a single one.
> 
> If you won't disagree, I'd like to keep this single patch as-it-is, to
> make sure this series can converge soon (I will be sorry since I'll
> extend this series a bit, I hope that won't extend the review process
> too long for it). After that, we can add more documentations if
> needed.
> 

fine with me.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-20  9:27         ` Peter Xu
@ 2017-01-20  9:52           ` Tian, Kevin
  2017-01-20 10:04             ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-20  9:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 20, 2017 5:28 PM
> 
> On Fri, Jan 20, 2017 at 09:15:27AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Friday, January 20, 2017 5:05 PM
> > >
> > > On Fri, Jan 20, 2017 at 08:22:14AM +0000, Tian, Kevin wrote:
> > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > Sent: Friday, January 13, 2017 11:06 AM
> > > > >
> > > > > Before we have int-remap, we need to bypass interrupt write requests.
> > > > > That's not necessary now - we have supported int-remap, and all the irq
> > > > > region requests should be redirected there. Cleaning up the block with
> > > > > an assertion instead.
> > > >
> > > > This comment is not accurate. According to code, the reason why you
> > > > can do such simplification is because we have standalone memory
> > > > region now for interrupt addresses. There should be nothing to do
> > > > with int-remap, which can be disabled by guest... Maybe the standalone
> > > > region was added when developing int-remap, but functionally they
> > > > are not related. :-)
> > >
> > > IMHO the above commit message is fairly clear. :-)
> > >
> > > But sure I can add some more emphasise like:
> > >
> > >   "Before we have int-remap memory region, ..."
> > >
> > > Do you think it's okay? Or any better suggestion?
> > >
> > > (Just to mention that even guest disables IR, the MSI region will
> > >  still be there.)
> > >
> >
> > My option is simple - this patch has nothing to do with int-remap.
> > It's not necessary, not because we supported int-remap. It's because
> > we have a standalone memory region for interrupt addresses, as you
> > described in the code. :-)
> 
> I really think they are the same thing...
> 
> How about this:
> 
>     Now we have a standalone memory region for MSI, all the irq region
>     requests should be redirected there. Cleaning up the block with an
>     assertion instead.
> 

btw what about guest setups a valid mapping at 0xFEEx_xxxx in
its remapping structure, which is then programmed to virtual
device as DMA destination? Then when emulating that virtual DMA,
vtd_do_iommu_translate should simply return (maybe throw out
a warning for diagnostic purpose) instead of assert here. 

VT-d spec defines as below:

	Software must ensure the second-level paging-structure entries 
	are programmed not to remap input addresses to the interrupt 
	address range. Hardware behavior is undefined for memory 
	requests remapped to the interrupt address range.

I don't think "hardware behavior is undefined" is equal to "assert
thus kill VM"...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-20  9:52           ` Tian, Kevin
@ 2017-01-20 10:04             ` Peter Xu
  2017-01-22  4:42               ` Tian, Kevin
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Xu @ 2017-01-20 10:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 09:52:01AM +0000, Tian, Kevin wrote:

[...]

> btw what about guest setups a valid mapping at 0xFEEx_xxxx in
> its remapping structure, which is then programmed to virtual
> device as DMA destination? Then when emulating that virtual DMA,
> vtd_do_iommu_translate should simply return (maybe throw out
> a warning for diagnostic purpose) instead of assert here. 
> 
> VT-d spec defines as below:
> 
> 	Software must ensure the second-level paging-structure entries 
> 	are programmed not to remap input addresses to the interrupt 
> 	address range. Hardware behavior is undefined for memory 
> 	requests remapped to the interrupt address range.

Thanks for this reference. That's something I was curious about.

> 
> I don't think "hardware behavior is undefined" is equal to "assert
> thus kill VM"...

I don't think it will kill the VM. After we have the MSI region, it
should just use that IR region for everything (read/write/translate).
So iiuc when anyone setups IOVA mapping within range 0xfeexxxxx, then
a DMA will trigger an interrupt (rather than memory moves), but in
most cases the interrupt will be illegal since either the data is
invalid (e.g., non-zero reserved bits, or SID verification failure),
further it should trigger a vIOMMU fault (though IR fault reporting is
still incomplete, that's my next thing to do after this series).

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
  2017-01-20  8:32   ` Tian, Kevin
@ 2017-01-20 15:42   ` Eric Blake
  2017-01-22  2:32     ` Peter Xu
  1 sibling, 1 reply; 93+ messages in thread
From: Eric Blake @ 2017-01-20 15:42 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On 01/12/2017 09:06 PM, Peter Xu wrote:
> From: Aviv Ben-David <bd.aviv@gmail.com>

Long subject line, please try to keep it around 60 or so characters
(look at 'git shortlog -30' for comparison).  Also, fix the typos:
s/capility exposoed/capability exposed/

> 
> This capability asks the guest to invalidate cache before each map operation.
> We can use this invalidation to trap map operations in the hypervisor.
> 
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest
  2017-01-20 15:42   ` Eric Blake
@ 2017-01-22  2:32     ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-22  2:32 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Fri, Jan 20, 2017 at 09:42:25AM -0600, Eric Blake wrote:
> On 01/12/2017 09:06 PM, Peter Xu wrote:
> > From: Aviv Ben-David <bd.aviv@gmail.com>
> 
> Long subject line, please try to keep it around 60 or so characters
> (look at 'git shortlog -30' for comparison).  Also, fix the typos:
> s/capility exposoed/capability exposed/

Will fix this and repost this single patch as v4.1 based on v4 series.

Thanks!

> 
> > 
> > This capability asks the guest to invalidate cache before each map operation.
> > We can use this invalidation to trap map operations in the hypervisor.
> > 
> > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-20 10:04             ` Peter Xu
@ 2017-01-22  4:42               ` Tian, Kevin
  2017-01-22  4:50                 ` Peter Xu
  0 siblings, 1 reply; 93+ messages in thread
From: Tian, Kevin @ 2017-01-22  4:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, January 20, 2017 6:04 PM
> 
> On Fri, Jan 20, 2017 at 09:52:01AM +0000, Tian, Kevin wrote:
> 
> [...]
> 
> > btw what about guest setups a valid mapping at 0xFEEx_xxxx in
> > its remapping structure, which is then programmed to virtual
> > device as DMA destination? Then when emulating that virtual DMA,
> > vtd_do_iommu_translate should simply return (maybe throw out
> > a warning for diagnostic purpose) instead of assert here.
> >
> > VT-d spec defines as below:
> >
> > 	Software must ensure the second-level paging-structure entries
> > 	are programmed not to remap input addresses to the interrupt
> > 	address range. Hardware behavior is undefined for memory
> > 	requests remapped to the interrupt address range.
> 
> Thanks for this reference. That's something I was curious about.
> 
> >
> > I don't think "hardware behavior is undefined" is equal to "assert
> > thus kill VM"...
> 
> I don't think it will kill the VM. After we have the MSI region, it
> should just use that IR region for everything (read/write/translate).
> So iiuc when anyone setups IOVA mapping within range 0xfeexxxxx, then
> a DMA will trigger an interrupt (rather than memory moves), but in
> most cases the interrupt will be illegal since either the data is
> invalid (e.g., non-zero reserved bits, or SID verification failure),
> further it should trigger a vIOMMU fault (though IR fault reporting is
> still incomplete, that's my next thing to do after this series).
> 

Yes, you're right here. Sorry for bothering with my wrong understanding. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation
  2017-01-22  4:42               ` Tian, Kevin
@ 2017-01-22  4:50                 ` Peter Xu
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Xu @ 2017-01-22  4:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Lan, Tianyu, mst, jan.kiszka, jasowang,
	alex.williamson, bd.aviv

On Sun, Jan 22, 2017 at 04:42:13AM +0000, Tian, Kevin wrote:
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, January 20, 2017 6:04 PM
> > 
> > On Fri, Jan 20, 2017 at 09:52:01AM +0000, Tian, Kevin wrote:
> > 
> > [...]
> > 
> > > btw what about guest setups a valid mapping at 0xFEEx_xxxx in
> > > its remapping structure, which is then programmed to virtual
> > > device as DMA destination? Then when emulating that virtual DMA,
> > > vtd_do_iommu_translate should simply return (maybe throw out
> > > a warning for diagnostic purpose) instead of assert here.
> > >
> > > VT-d spec defines as below:
> > >
> > > 	Software must ensure the second-level paging-structure entries
> > > 	are programmed not to remap input addresses to the interrupt
> > > 	address range. Hardware behavior is undefined for memory
> > > 	requests remapped to the interrupt address range.
> > 
> > Thanks for this reference. That's something I was curious about.
> > 
> > >
> > > I don't think "hardware behavior is undefined" is equal to "assert
> > > thus kill VM"...
> > 
> > I don't think it will kill the VM. After we have the MSI region, it
> > should just use that IR region for everything (read/write/translate).
> > So iiuc when anyone setups IOVA mapping within range 0xfeexxxxx, then
> > a DMA will trigger an interrupt (rather than memory moves), but in
> > most cases the interrupt will be illegal since either the data is
> > invalid (e.g., non-zero reserved bits, or SID verification failure),
> > further it should trigger a vIOMMU fault (though IR fault reporting is
> > still incomplete, that's my next thing to do after this series).
> > 
> 
> Yes, you're right here. Sorry for bothering with my wrong understanding. :-)

No problem at all.

Looking forward to any of your further comments on v4. :-)

-- peterx

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2017-01-22  4:50 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
2017-01-20  8:32   ` Tian, Kevin
2017-01-20  8:54     ` Peter Xu
2017-01-20  8:59       ` Tian, Kevin
2017-01-20  9:11         ` Peter Xu
2017-01-20  9:20           ` Tian, Kevin
2017-01-20  9:30             ` Peter Xu
2017-01-20 15:42   ` Eric Blake
2017-01-22  2:32     ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation Peter Xu
2017-01-20  8:22   ` Tian, Kevin
2017-01-20  9:05     ` Peter Xu
2017-01-20  9:15       ` Tian, Kevin
2017-01-20  9:27         ` Peter Xu
2017-01-20  9:52           ` Tian, Kevin
2017-01-20 10:04             ` Peter Xu
2017-01-22  4:42               ` Tian, Kevin
2017-01-22  4:50                 ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper Peter Xu
2017-01-20  8:27   ` Tian, Kevin
2017-01-20  9:23     ` Peter Xu
2017-01-20  9:41       ` Tian, Kevin
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling Peter Xu
2017-01-13  7:46   ` Jason Wang
2017-01-13  9:13     ` Peter Xu
2017-01-13  9:33       ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 05/14] intel_iommu: fix trace for addr translation Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 06/14] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier Peter Xu
2017-01-13  7:55   ` Jason Wang
2017-01-13  9:23     ` Peter Xu
2017-01-13  9:37       ` Jason Wang
2017-01-13 10:22         ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 08/14] memory: provide iommu_replay_all() Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one() Peter Xu
2017-01-13  7:58   ` Jason Wang
2017-01-16  7:08     ` Peter Xu
2017-01-16  7:38       ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 10/14] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback Peter Xu
2017-01-13  9:26   ` Jason Wang
2017-01-16  7:31     ` Peter Xu
2017-01-16  7:47       ` Jason Wang
2017-01-16  7:59         ` Peter Xu
2017-01-16  8:03           ` Jason Wang
2017-01-16  8:06             ` Peter Xu
2017-01-16  8:23               ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate Peter Xu
2017-01-16  5:53   ` Jason Wang
2017-01-16  7:43     ` Peter Xu
2017-01-16  7:52       ` Jason Wang
2017-01-16  8:02         ` Peter Xu
2017-01-16  8:18         ` Peter Xu
2017-01-16  8:28           ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2017-01-16  6:20   ` Jason Wang
2017-01-16  7:50     ` Peter Xu
2017-01-16  8:01       ` Jason Wang
2017-01-16  8:12         ` Peter Xu
2017-01-16  8:25           ` Jason Wang
2017-01-16  8:32             ` Peter Xu
2017-01-16 16:25               ` Michael S. Tsirkin
2017-01-17 14:53                 ` Peter Xu
2017-01-16 19:53   ` Alex Williamson
2017-01-17 14:00     ` Peter Xu
2017-01-17 15:46       ` Alex Williamson
2017-01-18  7:49         ` Peter Xu
2017-01-19  8:20           ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices Peter Xu
2017-01-16  6:30   ` Jason Wang
2017-01-16  9:18     ` Peter Xu
2017-01-16  9:54       ` Jason Wang
2017-01-17 14:45         ` Peter Xu
2017-01-18  3:10           ` Jason Wang
2017-01-18  8:11             ` Peter Xu
2017-01-18  8:36               ` Jason Wang
2017-01-18  8:46                 ` Peter Xu
2017-01-18  9:38                   ` Tian, Kevin
2017-01-18 10:06                     ` Jason Wang
2017-01-19  3:32                       ` Peter Xu
2017-01-19  3:36                         ` Jason Wang
2017-01-19  3:16                     ` Peter Xu
2017-01-19  6:22                       ` Tian, Kevin
2017-01-19  9:38                         ` Peter Xu
2017-01-19  6:44                     ` Liu, Yi L
2017-01-19  7:02                       ` Jason Wang
2017-01-19  7:02                       ` Peter Xu
2017-01-16  9:20     ` Peter Xu
2017-01-13 15:58 ` [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
2017-01-14  2:59   ` Peter Xu
2017-01-17 15:07     ` Michael S. Tsirkin
2017-01-18  7:34       ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.