[Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support)
@ 2015-07-14 12:21 Alexey Kardashevskiy
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes Alexey Kardashevskiy
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-14 12:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, Michael Roth,
	Alex Williamson, qemu-ppc, Paolo Bonzini, David Gibson

Yet another try, reworked the whole patchset.

Here are few patches to prepare an existing listener for handling memory
preregistration for SPAPR guests running on POWER8.

This used to be a part of DDW patchset but now is separated as requested.


Please comment. Thanks!

Changes:
v3:
* removed incorrect "vfio: Skip PCI BARs in memory listener"
* removed page size changes from quirks as they did not completely fix
the crashes happening on POWER8 (only total removal helps there)
* added "memory: Add reporting of supported page sizes"




Alexey Kardashevskiy (4):
  memory: Add reporting of supported page sizes
  vfio: Use different page size for different IOMMU types
  vfio: Store IOMMU type in container
  vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)

 hw/ppc/spapr_iommu.c          |   8 ++
 hw/vfio/common.c              | 245 ++++++++++++++++++++++++++++++++++--------
 include/exec/memory.h         |  11 ++
 include/hw/vfio/vfio-common.h |   7 ++
 memory.c                      |   9 ++
 trace-events                  |   2 +
 6 files changed, 235 insertions(+), 47 deletions(-)

-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes
  2015-07-14 12:21 [Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
@ 2015-07-14 12:21 ` Alexey Kardashevskiy
  2015-07-15 18:26   ` Alex Williamson
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-14 12:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, Michael Roth,
	Alex Williamson, qemu-ppc, Paolo Bonzini, David Gibson

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the actual
page size(s) used by an IOMMU.

TARGET_PAGE_BITS shift is used as fallback.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c  |  8 ++++++++
 include/exec/memory.h | 11 +++++++++++
 memory.c              |  9 +++++++++
 3 files changed, 28 insertions(+)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index f61504e..a2572c4 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -104,6 +104,13 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return 1ULL << tcet->page_shift;
+}
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -135,6 +142,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
+    .get_page_sizes = spapr_tce_get_page_sizes,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1394715..9ca74e3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -152,6 +152,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /* Returns supported page sizes */
+    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -552,6 +554,15 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
 bool memory_region_is_iommu(MemoryRegion *mr);
 
 /**
+ * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
+ *
+ * Returns %bitmap of supported page sizes for an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
+
+/**
  * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
  *
  * @mr: the memory region that was changed
diff --git a/memory.c b/memory.c
index 5a0cc66..0732763 100644
--- a/memory.c
+++ b/memory.c
@@ -1413,6 +1413,15 @@ bool memory_region_is_iommu(MemoryRegion *mr)
     return mr->iommu_ops;
 }
 
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
+{
+    assert(memory_region_is_iommu(mr));
+    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
+        return mr->iommu_ops->get_page_sizes(mr);
+    }
+    return 1ULL << TARGET_PAGE_BITS;
+}
+
 void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
 {
     notifier_list_add(&mr->iommu_notify, n);
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types
  2015-07-14 12:21 [Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2015-07-14 12:21 ` Alexey Kardashevskiy
  2015-07-15 18:26   ` Alex Williamson
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 3/4] vfio: Store IOMMU type in container Alexey Kardashevskiy
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
  3 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-14 12:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, Michael Roth,
	Alex Williamson, qemu-ppc, Paolo Bonzini, David Gibson

The existing memory listener is called on RAM or PCI address space
which implies potentially different page size. Instead of guessing
what page size should be used, this replaces a single IOMMU memory
listener by two, one per supported IOMMU type; listener callbacks
call the existing helpers with a known page size.

For now, Type1 uses qemu_real_host_page_mask, sPAPR uses the page size
from IOMMU.

As memory_region_iommu_get_page_sizes() asserts on non-IOMMU regions,
this adds memory_region_is_iommu() to the SPAPR IOMMU listener to skip
non IOMMU regions (which is an MSIX window) which duplicates
vfio_listener_skipped_section() a bit.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c | 95 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 76 insertions(+), 19 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 85ee9b0..aad41e1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -312,11 +312,11 @@ out:
     rcu_read_unlock();
 }
 
-static void vfio_listener_region_add(MemoryListener *listener,
+static void vfio_listener_region_add(VFIOContainer *container,
+                                     hwaddr page_mask,
+                                     MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
     hwaddr iova, end;
     Int128 llend;
     void *vaddr;
@@ -330,16 +330,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
         return;
     }
 
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
         error_report("%s received unaligned region", __func__);
         return;
     }
 
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
     llend = int128_make64(section->offset_within_address_space);
     llend = int128_add(llend, section->size);
-    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+    llend = int128_and(llend, int128_exts64(page_mask));
 
     if (int128_ge(int128_make64(iova), llend)) {
         return;
@@ -416,11 +416,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
 }
 
-static void vfio_listener_region_del(MemoryListener *listener,
+static void vfio_listener_region_del(VFIOContainer *container,
+                                     hwaddr page_mask,
+                                     MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
     hwaddr iova, end;
     int ret;
 
@@ -432,8 +432,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         return;
     }
 
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
         error_report("%s received unaligned region", __func__);
         return;
     }
@@ -459,9 +459,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
          */
     }
 
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
     end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
+          page_mask;
 
     if (iova >= end) {
         return;
@@ -478,9 +478,66 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
-static const MemoryListener vfio_memory_listener = {
-    .region_add = vfio_listener_region_add,
-    .region_del = vfio_listener_region_del,
+static void vfio_type1_iommu_listener_region_add(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+
+    vfio_listener_region_add(container, qemu_real_host_page_mask, listener,
+                             section);
+}
+
+
+static void vfio_type1_iommu_listener_region_del(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+
+    vfio_listener_region_del(container, qemu_real_host_page_mask, listener,
+                             section);
+}
+
+static const MemoryListener vfio_type1_iommu_listener = {
+    .region_add = vfio_type1_iommu_listener_region_add,
+    .region_del = vfio_type1_iommu_listener_region_del,
+};
+
+static void vfio_spapr_iommu_listener_region_add(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container;
+    hwaddr page_mask;
+
+    if (!memory_region_is_iommu(section->mr)) {
+        return;
+    }
+    container = container_of(listener, VFIOContainer,
+                             iommu_data.type1.listener);
+    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
+    vfio_listener_region_add(container, page_mask, listener, section);
+}
+
+
+static void vfio_spapr_iommu_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container;
+    hwaddr page_mask;
+
+    if (!memory_region_is_iommu(section->mr)) {
+        return;
+    }
+    container = container_of(listener, VFIOContainer,
+                             iommu_data.type1.listener);
+    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
+    vfio_listener_region_del(container, page_mask, listener, section);
+}
+
+static const MemoryListener vfio_spapr_iommu_listener = {
+    .region_add = vfio_spapr_iommu_listener_region_add,
+    .region_del = vfio_spapr_iommu_listener_region_del,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
@@ -684,7 +741,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        container->iommu_data.type1.listener = vfio_memory_listener;
+        container->iommu_data.type1.listener = vfio_type1_iommu_listener;
         container->iommu_data.release = vfio_listener_release;
 
         memory_listener_register(&container->iommu_data.type1.listener,
@@ -724,7 +781,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        container->iommu_data.type1.listener = vfio_memory_listener;
+        container->iommu_data.type1.listener = vfio_spapr_iommu_listener;
         container->iommu_data.release = vfio_listener_release;
 
         memory_listener_register(&container->iommu_data.type1.listener,
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v3 3/4] vfio: Store IOMMU type in container
  2015-07-14 12:21 [Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes Alexey Kardashevskiy
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
@ 2015-07-14 12:21 ` Alexey Kardashevskiy
  2015-07-15 18:26   ` Alex Williamson
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
  3 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-14 12:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, Michael Roth,
	Alex Williamson, qemu-ppc, Paolo Bonzini, David Gibson

So far we were managing not to have an IOMMU type stored anywhere but
since we are going to implement different behavior for different IOMMU
types in the same memory listener, we need to know IOMMU type after
initialization.

This adds an IOMMU type into VFIOContainer and initializes it. This
adds SPAPR IOMMU data into the iommu_data union; for now it only includes
the existing Type1 data struct. Since zero is not used for any type,
no additional initialization is necessary for VFIOContainer::type.

This reworks vfio_listener_region_add() in order to prepare it to
handle RAM regions on IOMMUs other than Type1/Type1v2.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* folded vfio_listener_region_add() change into this

v2:
* added VFIOContainer::iommu_data::spapr
---
 hw/vfio/common.c              | 55 ++++++++++++++++++++++++++-----------------
 include/hw/vfio/vfio-common.h |  6 +++++
 2 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index aad41e1..6982b8f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -393,26 +393,36 @@ static void vfio_listener_region_add(VFIOContainer *container,
             section->offset_within_region +
             (iova - section->offset_within_address_space);
 
-    trace_vfio_listener_region_add_ram(iova, end - 1, vaddr);
+    switch (container->iommu_data.type) {
+    case VFIO_TYPE1_IOMMU:
+    case VFIO_TYPE1v2_IOMMU:
+        trace_vfio_listener_region_add_ram(iova, end - 1, vaddr);
 
-    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
-    if (ret) {
-        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                     container, iova, end - iova, vaddr, ret);
+        ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
+        if (ret) {
+            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         container, iova, end - iova, vaddr, ret);
+            goto error_exit;
+        }
+        break;
+    }
+
+    return;
+
+error_exit:
 
-        /*
-         * On the initfn path, store the first error in the container so we
-         * can gracefully fail.  Runtime, there's not much we can do other
-         * than throw a hardware error.
-         */
-        if (!container->iommu_data.type1.initialized) {
-            if (!container->iommu_data.type1.error) {
-                container->iommu_data.type1.error = ret;
-            }
-        } else {
-            hw_error("vfio: DMA mapping failed, unable to continue");
+    /*
+     * On the initfn path, store the first error in the container so we
+     * can gracefully fail.  Runtime, there's not much we can do other
+     * than throw a hardware error.
+     */
+    if (!container->iommu_data.type1.initialized) {
+        if (!container->iommu_data.type1.error) {
+            container->iommu_data.type1.error = ret;
         }
+    } else {
+        hw_error("vfio: DMA mapping failed, unable to continue");
     }
 }
 
@@ -733,8 +743,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_data.type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -762,7 +772,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_data.type = VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -781,10 +792,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        container->iommu_data.type1.listener = vfio_spapr_iommu_listener;
+        container->iommu_data.spapr.common.listener = vfio_spapr_iommu_listener;
         container->iommu_data.release = vfio_listener_release;
 
-        memory_listener_register(&container->iommu_data.type1.listener,
+        memory_listener_register(&container->iommu_data.spapr.common.listener,
                                  container->space->as);
 
     } else {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 59a321d..135ea64 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -70,13 +70,19 @@ typedef struct VFIOType1 {
     bool initialized;
 } VFIOType1;
 
+typedef struct VFIOSPAPR {
+    VFIOType1 common;
+} VFIOSPAPR;
+
 typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     struct {
         /* enable abstraction to support various iommu backends */
+        unsigned type;
         union {
             VFIOType1 type1;
+            VFIOSPAPR spapr;
         };
         void (*release)(struct VFIOContainer *);
     } iommu_data;
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-14 12:21 [Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 3/4] vfio: Store IOMMU type in container Alexey Kardashevskiy
@ 2015-07-14 12:21 ` Alexey Kardashevskiy
  2015-07-16  5:11   ` David Gibson
  3 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-14 12:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, Michael Roth,
	Alex Williamson, qemu-ppc, Paolo Bonzini, David Gibson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a guest RAM memory listener which notifies a VFIO container
about memory which needs to be pinned/unpinned. VFIO MMIO regions
(i.e. "skip dump" regions) are skipped.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This does not change the guest visible interface.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* new RAM listener skips BARs (i.e. "skip dump" regions)

v2:
* added another listener for RAM
---
 hw/vfio/common.c              | 99 +++++++++++++++++++++++++++++++++++++++----
 include/hw/vfio/vfio-common.h |  1 +
 trace-events                  |  2 +
 3 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6982b8f..d78a83f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -406,6 +406,19 @@ static void vfio_listener_region_add(VFIOContainer *container,
             goto error_exit;
         }
         break;
+
+    case VFIO_SPAPR_TCE_v2_IOMMU: {
+        struct vfio_iommu_spapr_register_memory reg = {
+            .argsz = sizeof(reg),
+            .flags = 0,
+            .vaddr = (uint64_t) vaddr,
+            .size = end - iova
+        };
+
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+        trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
+        break;
+    }
     }
 
     return;
@@ -486,6 +499,26 @@ static void vfio_listener_region_del(VFIOContainer *container,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, end - iova, ret);
     }
+
+    switch (container->iommu_data.type) {
+    case VFIO_SPAPR_TCE_v2_IOMMU:
+        if (!memory_region_is_iommu(section->mr)) {
+            void *vaddr = memory_region_get_ram_ptr(section->mr) +
+                section->offset_within_region +
+                (iova - section->offset_within_address_space);
+            struct vfio_iommu_spapr_register_memory reg = {
+                .argsz = sizeof(reg),
+                .flags = 0,
+                .vaddr = (uint64_t) vaddr,
+                .size = end - iova
+            };
+
+            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY,
+                        &reg);
+            trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+        }
+        break;
+    }
 }
 
 static void vfio_type1_iommu_listener_region_add(MemoryListener *listener,
@@ -550,8 +583,42 @@ static const MemoryListener vfio_spapr_iommu_listener = {
     .region_del = vfio_spapr_iommu_listener_region_del,
 };
 
+static void vfio_spapr_ram_listener_region_add(MemoryListener *listener,
+                                               MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.ram_listener);
+
+    if (memory_region_is_skip_dump(section->mr)) {
+        return;
+    }
+    vfio_listener_region_add(container, qemu_real_host_page_mask, listener,
+                             section);
+}
+
+static void vfio_spapr_ram_listener_region_del(MemoryListener *listener,
+                                               MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.ram_listener);
+
+    if (memory_region_is_skip_dump(section->mr)) {
+        return;
+    }
+    vfio_listener_region_del(container, qemu_real_host_page_mask, listener,
+                             section);
+}
+
+static const MemoryListener vfio_spapr_ram_listener = {
+    .region_add = vfio_spapr_ram_listener_region_add,
+    .region_del = vfio_spapr_ram_listener_region_del,
+};
+
 static void vfio_listener_release(VFIOContainer *container)
 {
+    if (container->iommu_data.type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->iommu_data.spapr.ram_listener);
+    }
     memory_listener_unregister(&container->iommu_data.type1.listener);
 }
 
@@ -765,14 +832,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 
         container->iommu_data.type1.initialized = true;
 
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
+
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
             error_report("vfio: failed to set group container: %m");
             ret = -errno;
             goto free_container_exit;
         }
-        container->iommu_data.type = VFIO_SPAPR_TCE_IOMMU;
+        container->iommu_data.type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
         ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
@@ -785,18 +856,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
         }
 
         container->iommu_data.spapr.common.listener = vfio_spapr_iommu_listener;
         container->iommu_data.release = vfio_listener_release;
-
         memory_listener_register(&container->iommu_data.spapr.common.listener,
                                  container->space->as);
+        if (v2) {
+            container->iommu_data.spapr.ram_listener = vfio_spapr_ram_listener;
+            memory_listener_register(&container->iommu_data.spapr.ram_listener,
+                                     &address_space_memory);
+            if (container->iommu_data.spapr.common.error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto listener_release_exit;
+            }
+
+            container->iommu_data.spapr.common.initialized = true;
+        }
 
     } else {
         error_report("vfio: No available IOMMU models");
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 135ea64..8edd572 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -72,6 +72,7 @@ typedef struct VFIOType1 {
 
 typedef struct VFIOSPAPR {
     VFIOType1 common;
+    MemoryListener ram_listener;
 } VFIOSPAPR;
 
 typedef struct VFIOContainer {
diff --git a/trace-events b/trace-events
index d24d80a..f859ad0 100644
--- a/trace-events
+++ b/trace-events
@@ -1582,6 +1582,8 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 3/4] vfio: Store IOMMU type in container
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 3/4] vfio: Store IOMMU type in container Alexey Kardashevskiy
@ 2015-07-15 18:26   ` Alex Williamson
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2015-07-15 18:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Crosthwaite, Michael Roth, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On Tue, 2015-07-14 at 22:21 +1000, Alexey Kardashevskiy wrote:
> So far we were managing not to have an IOMMU type stored anywhere but
> since we are going to implement different behavior for different IOMMU
> types in the same memory listener, we need to know IOMMU type after
> initialization.
> 
> This adds an IOMMU type into VFIOContainer and initializes it. This
> adds SPAPR IOMMU data into the iommu_data union; for now it only includes
> the existing Type1 data struct. Since zero is not used for any type,
> no additional initialization is necessary for VFIOContainer::type.
> 
> This reworks vfio_listener_region_add() in order to prepare it to
> handle RAM regions on IOMMUs other than Type1/Type1v2.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v3:
> * folded vfio_listener_region_add() change into this
> 
> v2:
> * added VFIOContainer::iommu_data::spapr
> ---
>  hw/vfio/common.c              | 55 ++++++++++++++++++++++++++-----------------
>  include/hw/vfio/vfio-common.h |  6 +++++
>  2 files changed, 39 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index aad41e1..6982b8f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -393,26 +393,36 @@ static void vfio_listener_region_add(VFIOContainer *container,
>              section->offset_within_region +
>              (iova - section->offset_within_address_space);
>  
> -    trace_vfio_listener_region_add_ram(iova, end - 1, vaddr);
> +    switch (container->iommu_data.type) {
> +    case VFIO_TYPE1_IOMMU:
> +    case VFIO_TYPE1v2_IOMMU:
> +        trace_vfio_listener_region_add_ram(iova, end - 1, vaddr);
>  
> -    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
> -    if (ret) {
> -        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> -                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                     container, iova, end - iova, vaddr, ret);
> +        ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
> +        if (ret) {
> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> +                         container, iova, end - iova, vaddr, ret);
> +            goto error_exit;
> +        }
> +        break;
> +    }
> +
> +    return;
> +
> +error_exit:
>  
> -        /*
> -         * On the initfn path, store the first error in the container so we
> -         * can gracefully fail.  Runtime, there's not much we can do other
> -         * than throw a hardware error.
> -         */
> -        if (!container->iommu_data.type1.initialized) {
> -            if (!container->iommu_data.type1.error) {
> -                container->iommu_data.type1.error = ret;
> -            }
> -        } else {
> -            hw_error("vfio: DMA mapping failed, unable to continue");
> +    /*
> +     * On the initfn path, store the first error in the container so we
> +     * can gracefully fail.  Runtime, there's not much we can do other
> +     * than throw a hardware error.
> +     */
> +    if (!container->iommu_data.type1.initialized) {
> +        if (!container->iommu_data.type1.error) {
> +            container->iommu_data.type1.error = ret;

If only the type1 error path makes use of this, why isn't it in the
type1 code path exclusively?  If it's out here in the common code, it
should be common (which could be achieved with an error callback on the
new VFIOMemoryListener.

>          }
> +    } else {
> +        hw_error("vfio: DMA mapping failed, unable to continue");
>      }
>  }
>  
> @@ -733,8 +743,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_data.type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -762,7 +772,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_data.type = VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -781,10 +792,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        container->iommu_data.type1.listener = vfio_spapr_iommu_listener;
> +        container->iommu_data.spapr.common.listener = vfio_spapr_iommu_listener;
>          container->iommu_data.release = vfio_listener_release;
>  
> -        memory_listener_register(&container->iommu_data.type1.listener,
> +        memory_listener_register(&container->iommu_data.spapr.common.listener,
>                                   container->space->as);
>  
>      } else {
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 59a321d..135ea64 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -70,13 +70,19 @@ typedef struct VFIOType1 {
>      bool initialized;
>  } VFIOType1;
>  
> +typedef struct VFIOSPAPR {
> +    VFIOType1 common;
> +} VFIOSPAPR;
> +
>  typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      struct {
>          /* enable abstraction to support various iommu backends */
> +        unsigned type;
>          union {
>              VFIOType1 type1;
> +            VFIOSPAPR spapr;
>          };
>          void (*release)(struct VFIOContainer *);
>      } iommu_data;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
@ 2015-07-15 18:26   ` Alex Williamson
  2015-07-16  1:26     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2015-07-15 18:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Crosthwaite, Michael Roth, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On Tue, 2015-07-14 at 22:21 +1000, Alexey Kardashevskiy wrote:
> The existing memory listener is called on RAM or PCI address space
> which implies potentially different page size. Instead of guessing
> what page size should be used, this replaces a single IOMMU memory
> listener by two, one per supported IOMMU type; listener callbacks
> call the existing helpers with a known page size.
> 
> For now, Type1 uses qemu_real_host_page_mask, sPAPR uses the page size
> from IOMMU.
> 
> As memory_region_iommu_get_page_sizes() asserts on non-IOMMU regions,
> this adds memory_region_is_iommu() to the SPAPR IOMMU listener to skip
> non IOMMU regions (which is an MSIX window) which duplicates
> vfio_listener_skipped_section() a bit.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/common.c | 95 ++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 76 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 85ee9b0..aad41e1 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -312,11 +312,11 @@ out:
>      rcu_read_unlock();
>  }
>  
> -static void vfio_listener_region_add(MemoryListener *listener,
> +static void vfio_listener_region_add(VFIOContainer *container,
> +                                     hwaddr page_mask,

Should memory_region_iommu_get_page_sizes() return a hwaddr?

> +                                     MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            iommu_data.type1.listener);
>      hwaddr iova, end;
>      Int128 llend;
>      void *vaddr;
> @@ -330,16 +330,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          return;
>      }
>  
> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
>          error_report("%s received unaligned region", __func__);
>          return;
>      }
>  
> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>      llend = int128_make64(section->offset_within_address_space);
>      llend = int128_add(llend, section->size);
> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +    llend = int128_and(llend, int128_exts64(page_mask));
>  
>      if (int128_ge(int128_make64(iova), llend)) {
>          return;
> @@ -416,11 +416,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>  }
>  
> -static void vfio_listener_region_del(MemoryListener *listener,
> +static void vfio_listener_region_del(VFIOContainer *container,
> +                                     hwaddr page_mask,
> +                                     MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            iommu_data.type1.listener);
>      hwaddr iova, end;
>      int ret;
>  
> @@ -432,8 +432,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          return;
>      }
>  
> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
>          error_report("%s received unaligned region", __func__);
>          return;
>      }
> @@ -459,9 +459,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
>           */
>      }
>  
> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>      end = (section->offset_within_address_space + int128_get64(section->size)) &
> -          TARGET_PAGE_MASK;
> +          page_mask;
>  
>      if (iova >= end) {
>          return;
> @@ -478,9 +478,66 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> -static const MemoryListener vfio_memory_listener = {
> -    .region_add = vfio_listener_region_add,
> -    .region_del = vfio_listener_region_del,
> +static void vfio_type1_iommu_listener_region_add(MemoryListener *listener,
> +                                                 MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.type1.listener);
> +
> +    vfio_listener_region_add(container, qemu_real_host_page_mask, listener,
> +                             section);
> +}
> +
> +
> +static void vfio_type1_iommu_listener_region_del(MemoryListener *listener,
> +                                                 MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.type1.listener);
> +
> +    vfio_listener_region_del(container, qemu_real_host_page_mask, listener,
> +                             section);
> +}
> +
> +static const MemoryListener vfio_type1_iommu_listener = {
> +    .region_add = vfio_type1_iommu_listener_region_add,
> +    .region_del = vfio_type1_iommu_listener_region_del,
> +};
> +
> +static void vfio_spapr_iommu_listener_region_add(MemoryListener *listener,
> +                                     MemoryRegionSection *section)
> +{
> +    VFIOContainer *container;
> +    hwaddr page_mask;
> +
> +    if (!memory_region_is_iommu(section->mr)) {
> +        return;
> +    }
> +    container = container_of(listener, VFIOContainer,
> +                             iommu_data.type1.listener);
> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
> +    vfio_listener_region_add(container, page_mask, listener, section);
> +}
> +
> +
> +static void vfio_spapr_iommu_listener_region_del(MemoryListener *listener,
> +                                     MemoryRegionSection *section)
> +{
> +    VFIOContainer *container;
> +    hwaddr page_mask;
> +
> +    if (!memory_region_is_iommu(section->mr)) {
> +        return;
> +    }
> +    container = container_of(listener, VFIOContainer,
> +                             iommu_data.type1.listener);
> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
> +    vfio_listener_region_del(container, page_mask, listener, section);
> +}
> +
> +static const MemoryListener vfio_spapr_iommu_listener = {
> +    .region_add = vfio_spapr_iommu_listener_region_add,
> +    .region_del = vfio_spapr_iommu_listener_region_del,
>  };


Rather than creating all these wrappers, why don't we create a structure
that gives us callbacks we can use for a shared function?  If we had:

struct VFIOMemoryListener {
	struct MemoryListener listener;
	bool (*filter)(MemoryRegionSection *section);
	hwaddr (page_size)(MemoryRegionSection *section);
	VFIOContainer *container;
}

Then there would be no reason for you to have separate wrappers for
spapr.  VFIOType1 would have a single VFIOMemoryListener, you could have
an array of two.
>  
>  static void vfio_listener_release(VFIOContainer *container)
> @@ -684,7 +741,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        container->iommu_data.type1.listener = vfio_memory_listener;
> +        container->iommu_data.type1.listener = vfio_type1_iommu_listener;
>          container->iommu_data.release = vfio_listener_release;
>  
>          memory_listener_register(&container->iommu_data.type1.listener,
> @@ -724,7 +781,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        container->iommu_data.type1.listener = vfio_memory_listener;
> +        container->iommu_data.type1.listener = vfio_spapr_iommu_listener;
>          container->iommu_data.release = vfio_listener_release;
>  
>          memory_listener_register(&container->iommu_data.type1.listener,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2015-07-15 18:26   ` Alex Williamson
  2015-07-16  1:12     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2015-07-15 18:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Crosthwaite, Michael Roth, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On Tue, 2015-07-14 at 22:21 +1000, Alexey Kardashevskiy wrote:
> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> page size(s) used by an IOMMU.
> 
> TARGET_PAGE_BITS shift is used as fallback.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr_iommu.c  |  8 ++++++++
>  include/exec/memory.h | 11 +++++++++++
>  memory.c              |  9 +++++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index f61504e..a2572c4 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -104,6 +104,13 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>      return ret;
>  }
>  
> +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> +{
> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> +
> +    return 1ULL << tcet->page_shift;
> +}
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -135,6 +142,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
> +    .get_page_sizes = spapr_tce_get_page_sizes,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 1394715..9ca74e3 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -152,6 +152,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>  struct MemoryRegionIOMMUOps {
>      /* Return a TLB entry that contains a given address. */
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> +    /* Returns supported page sizes */
> +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -552,6 +554,15 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
>  bool memory_region_is_iommu(MemoryRegion *mr);
>  
>  /**
> + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
> + *
> + * Returns %bitmap of supported page sizes for an iommu.
> + *
> + * @mr: the memory region being queried
> + */
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
> +
> +/**
>   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>   *
>   * @mr: the memory region that was changed
> diff --git a/memory.c b/memory.c
> index 5a0cc66..0732763 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1413,6 +1413,15 @@ bool memory_region_is_iommu(MemoryRegion *mr)
>      return mr->iommu_ops;
>  }
>  
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
> +{
> +    assert(memory_region_is_iommu(mr));
> +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
> +        return mr->iommu_ops->get_page_sizes(mr);
> +    }
> +    return 1ULL << TARGET_PAGE_BITS;

At this rate we're adding TARGET_PAGE_foo faster than Peter can remove
them.

> +}
> +
>  void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>  {
>      notifier_list_add(&mr->iommu_notify, n);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes
  2015-07-15 18:26   ` Alex Williamson
@ 2015-07-16  1:12     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-16  1:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Crosthwaite, Michael Roth, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On 07/16/2015 04:26 AM, Alex Williamson wrote:
> On Tue, 2015-07-14 at 22:21 +1000, Alexey Kardashevskiy wrote:
>> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
>> uses when translating, however this information is not available outside
>> the translate context for various checks.
>>
>> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
>> a wrapper for it so IOMMU users (such as VFIO) can know the actual
>> page size(s) used by an IOMMU.
>>
>> TARGET_PAGE_BITS shift is used as fallback.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/ppc/spapr_iommu.c  |  8 ++++++++
>>   include/exec/memory.h | 11 +++++++++++
>>   memory.c              |  9 +++++++++
>>   3 files changed, 28 insertions(+)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index f61504e..a2572c4 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -104,6 +104,13 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>>       return ret;
>>   }
>>
>> +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>> +{
>> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
>> +
>> +    return 1ULL << tcet->page_shift;
>> +}
>> +
>>   static int spapr_tce_table_post_load(void *opaque, int version_id)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -135,6 +142,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>>
>>   static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>       .translate = spapr_tce_translate_iommu,
>> +    .get_page_sizes = spapr_tce_get_page_sizes,
>>   };
>>
>>   static int spapr_tce_table_realize(DeviceState *dev)
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 1394715..9ca74e3 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -152,6 +152,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>>   struct MemoryRegionIOMMUOps {
>>       /* Return a TLB entry that contains a given address. */
>>       IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>> +    /* Returns supported page sizes */
>> +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>>   };
>>
>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>> @@ -552,6 +554,15 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
>>   bool memory_region_is_iommu(MemoryRegion *mr);
>>
>>   /**
>> + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
>> + *
>> + * Returns %bitmap of supported page sizes for an iommu.
>> + *
>> + * @mr: the memory region being queried
>> + */
>> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
>> +
>> +/**
>>    * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>>    *
>>    * @mr: the memory region that was changed
>> diff --git a/memory.c b/memory.c
>> index 5a0cc66..0732763 100644
>> --- a/memory.c
>> +++ b/memory.c
>> @@ -1413,6 +1413,15 @@ bool memory_region_is_iommu(MemoryRegion *mr)
>>       return mr->iommu_ops;
>>   }
>>
>> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
>> +{
>> +    assert(memory_region_is_iommu(mr));
>> +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
>> +        return mr->iommu_ops->get_page_sizes(mr);
>> +    }
>> +    return 1ULL << TARGET_PAGE_BITS;
>
> At this rate we're adding TARGET_PAGE_foo faster than Peter can remove
> them.


Well, returning zero is better? Or qemu_real_host_page_size?



-- 
Alexey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types
  2015-07-15 18:26   ` Alex Williamson
@ 2015-07-16  1:26     ` Alexey Kardashevskiy
  2015-07-16  2:51       ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-16  1:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Crosthwaite, Michael Roth, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On 07/16/2015 04:26 AM, Alex Williamson wrote:
> On Tue, 2015-07-14 at 22:21 +1000, Alexey Kardashevskiy wrote:
>> The existing memory listener is called on RAM or PCI address space
>> which implies potentially different page size. Instead of guessing
>> what page size should be used, this replaces a single IOMMU memory
>> listener by two, one per supported IOMMU type; listener callbacks
>> call the existing helpers with a known page size.
>>
>> For now, Type1 uses qemu_real_host_page_mask, sPAPR uses the page size
>> from IOMMU.
>>
>> As memory_region_iommu_get_page_sizes() asserts on non-IOMMU regions,
>> this adds memory_region_is_iommu() to the SPAPR IOMMU listener to skip
>> non IOMMU regions (which is an MSIX window) which duplicates
>> vfio_listener_skipped_section() a bit.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/vfio/common.c | 95 ++++++++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 76 insertions(+), 19 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 85ee9b0..aad41e1 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -312,11 +312,11 @@ out:
>>       rcu_read_unlock();
>>   }
>>
>> -static void vfio_listener_region_add(MemoryListener *listener,
>> +static void vfio_listener_region_add(VFIOContainer *container,
>> +                                     hwaddr page_mask,
>
> Should memory_region_iommu_get_page_sizes() return a hwaddr?


I do not think so, memory.c uses uint64_t for masks.



>> +                                     MemoryListener *listener,
>>                                        MemoryRegionSection *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            iommu_data.type1.listener);
>>       hwaddr iova, end;
>>       Int128 llend;
>>       void *vaddr;
>> @@ -330,16 +330,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>           return;
>>       }
>>
>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>> +                 (section->offset_within_region & ~page_mask))) {
>>           error_report("%s received unaligned region", __func__);
>>           return;
>>       }
>>
>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>>       llend = int128_make64(section->offset_within_address_space);
>>       llend = int128_add(llend, section->size);
>> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
>> +    llend = int128_and(llend, int128_exts64(page_mask));
>>
>>       if (int128_ge(int128_make64(iova), llend)) {
>>           return;
>> @@ -416,11 +416,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>       }
>>   }
>>
>> -static void vfio_listener_region_del(MemoryListener *listener,
>> +static void vfio_listener_region_del(VFIOContainer *container,
>> +                                     hwaddr page_mask,
>> +                                     MemoryListener *listener,
>>                                        MemoryRegionSection *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            iommu_data.type1.listener);
>>       hwaddr iova, end;
>>       int ret;
>>
>> @@ -432,8 +432,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>           return;
>>       }
>>
>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>> +                 (section->offset_within_region & ~page_mask))) {
>>           error_report("%s received unaligned region", __func__);
>>           return;
>>       }
>> @@ -459,9 +459,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>            */
>>       }
>>
>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>>       end = (section->offset_within_address_space + int128_get64(section->size)) &
>> -          TARGET_PAGE_MASK;
>> +          page_mask;
>>
>>       if (iova >= end) {
>>           return;
>> @@ -478,9 +478,66 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       }
>>   }
>>
>> -static const MemoryListener vfio_memory_listener = {
>> -    .region_add = vfio_listener_region_add,
>> -    .region_del = vfio_listener_region_del,
>> +static void vfio_type1_iommu_listener_region_add(MemoryListener *listener,
>> +                                                 MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.type1.listener);
>> +
>> +    vfio_listener_region_add(container, qemu_real_host_page_mask, listener,
>> +                             section);
>> +}
>> +
>> +
>> +static void vfio_type1_iommu_listener_region_del(MemoryListener *listener,
>> +                                                 MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            iommu_data.type1.listener);
>> +
>> +    vfio_listener_region_del(container, qemu_real_host_page_mask, listener,
>> +                             section);
>> +}
>> +
>> +static const MemoryListener vfio_type1_iommu_listener = {
>> +    .region_add = vfio_type1_iommu_listener_region_add,
>> +    .region_del = vfio_type1_iommu_listener_region_del,
>> +};
>> +
>> +static void vfio_spapr_iommu_listener_region_add(MemoryListener *listener,
>> +                                     MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container;
>> +    hwaddr page_mask;
>> +
>> +    if (!memory_region_is_iommu(section->mr)) {
>> +        return;
>> +    }
>> +    container = container_of(listener, VFIOContainer,
>> +                             iommu_data.type1.listener);
>> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
>> +    vfio_listener_region_add(container, page_mask, listener, section);
>> +}
>> +
>> +
>> +static void vfio_spapr_iommu_listener_region_del(MemoryListener *listener,
>> +                                     MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container;
>> +    hwaddr page_mask;
>> +
>> +    if (!memory_region_is_iommu(section->mr)) {
>> +        return;
>> +    }
>> +    container = container_of(listener, VFIOContainer,
>> +                             iommu_data.type1.listener);
>> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
>> +    vfio_listener_region_del(container, page_mask, listener, section);
>> +}
>> +
>> +static const MemoryListener vfio_spapr_iommu_listener = {
>> +    .region_add = vfio_spapr_iommu_listener_region_add,
>> +    .region_del = vfio_spapr_iommu_listener_region_del,
>>   };
>
>
> Rather than creating all these wrappers, why don't we create a structure
> that gives us callbacks we can use for a shared function?  If we had:
>
> struct VFIOMemoryListener {
> 	struct MemoryListener listener;
> 	bool (*filter)(MemoryRegionSection *section);
> 	hwaddr (page_size)(MemoryRegionSection *section);
> 	VFIOContainer *container;
> }
>
> Then there would be no reason for you to have separate wrappers for
> spapr.  VFIOType1 would have a single VFIOMemoryListener, you could have
> an array of two.

Sorry, I am missing the point here...

I cannot just have an array of these - I will have to store an address 
spaces per a listener in a container to implement filter() so I'd rather 
store just an address space and not add filter() at all. And I will still 
need "memory: Add reporting of supported page sizes" - or there is no way 
to implement it all. So I still end up having 6 callbacks in 
hw/vfio/common.c. What does this win for us?


>>
>>   static void vfio_listener_release(VFIOContainer *container)
>> @@ -684,7 +741,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               goto free_container_exit;
>>           }
>>
>> -        container->iommu_data.type1.listener = vfio_memory_listener;
>> +        container->iommu_data.type1.listener = vfio_type1_iommu_listener;
>>           container->iommu_data.release = vfio_listener_release;
>>
>>           memory_listener_register(&container->iommu_data.type1.listener,
>> @@ -724,7 +781,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               goto free_container_exit;
>>           }
>>
>> -        container->iommu_data.type1.listener = vfio_memory_listener;
>> +        container->iommu_data.type1.listener = vfio_spapr_iommu_listener;
>>           container->iommu_data.release = vfio_listener_release;
>>
>>           memory_listener_register(&container->iommu_data.type1.listener,
>
>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types
  2015-07-16  1:26     ` Alexey Kardashevskiy
@ 2015-07-16  2:51       ` Alex Williamson
  2015-07-16  3:31         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2015-07-16  2:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Crosthwaite, Michael Roth, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On Thu, 2015-07-16 at 11:26 +1000, Alexey Kardashevskiy wrote:
> On 07/16/2015 04:26 AM, Alex Williamson wrote:
> > On Tue, 2015-07-14 at 22:21 +1000, Alexey Kardashevskiy wrote:
> >> The existing memory listener is called on RAM or PCI address space
> >> which implies potentially different page size. Instead of guessing
> >> what page size should be used, this replaces a single IOMMU memory
> >> listener by two, one per supported IOMMU type; listener callbacks
> >> call the existing helpers with a known page size.
> >>
> >> For now, Type1 uses qemu_real_host_page_mask, sPAPR uses the page size
> >> from IOMMU.
> >>
> >> As memory_region_iommu_get_page_sizes() asserts on non-IOMMU regions,
> >> this adds memory_region_is_iommu() to the SPAPR IOMMU listener to skip
> >> non IOMMU regions (which is an MSIX window) which duplicates
> >> vfio_listener_skipped_section() a bit.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>   hw/vfio/common.c | 95 ++++++++++++++++++++++++++++++++++++++++++++------------
> >>   1 file changed, 76 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 85ee9b0..aad41e1 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -312,11 +312,11 @@ out:
> >>       rcu_read_unlock();
> >>   }
> >>
> >> -static void vfio_listener_region_add(MemoryListener *listener,
> >> +static void vfio_listener_region_add(VFIOContainer *container,
> >> +                                     hwaddr page_mask,
> >
> > Should memory_region_iommu_get_page_sizes() return a hwaddr?
> 
> 
> I do not think so, memory.c uses uint64_t for masks.
> 
> 
> 
> >> +                                     MemoryListener *listener,
> >>                                        MemoryRegionSection *section)
> >>   {
> >> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> -                                            iommu_data.type1.listener);
> >>       hwaddr iova, end;
> >>       Int128 llend;
> >>       void *vaddr;
> >> @@ -330,16 +330,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>           return;
> >>       }
> >>
> >> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> >> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> >> +                 (section->offset_within_region & ~page_mask))) {
> >>           error_report("%s received unaligned region", __func__);
> >>           return;
> >>       }
> >>
> >> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
> >>       llend = int128_make64(section->offset_within_address_space);
> >>       llend = int128_add(llend, section->size);
> >> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> >> +    llend = int128_and(llend, int128_exts64(page_mask));
> >>
> >>       if (int128_ge(int128_make64(iova), llend)) {
> >>           return;
> >> @@ -416,11 +416,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>       }
> >>   }
> >>
> >> -static void vfio_listener_region_del(MemoryListener *listener,
> >> +static void vfio_listener_region_del(VFIOContainer *container,
> >> +                                     hwaddr page_mask,
> >> +                                     MemoryListener *listener,
> >>                                        MemoryRegionSection *section)
> >>   {
> >> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> -                                            iommu_data.type1.listener);
> >>       hwaddr iova, end;
> >>       int ret;
> >>
> >> @@ -432,8 +432,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>           return;
> >>       }
> >>
> >> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> >> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> >> +                 (section->offset_within_region & ~page_mask))) {
> >>           error_report("%s received unaligned region", __func__);
> >>           return;
> >>       }
> >> @@ -459,9 +459,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>            */
> >>       }
> >>
> >> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
> >>       end = (section->offset_within_address_space + int128_get64(section->size)) &
> >> -          TARGET_PAGE_MASK;
> >> +          page_mask;
> >>
> >>       if (iova >= end) {
> >>           return;
> >> @@ -478,9 +478,66 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>       }
> >>   }
> >>
> >> -static const MemoryListener vfio_memory_listener = {
> >> -    .region_add = vfio_listener_region_add,
> >> -    .region_del = vfio_listener_region_del,
> >> +static void vfio_type1_iommu_listener_region_add(MemoryListener *listener,
> >> +                                                 MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            iommu_data.type1.listener);
> >> +
> >> +    vfio_listener_region_add(container, qemu_real_host_page_mask, listener,
> >> +                             section);
> >> +}
> >> +
> >> +
> >> +static void vfio_type1_iommu_listener_region_del(MemoryListener *listener,
> >> +                                                 MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            iommu_data.type1.listener);
> >> +
> >> +    vfio_listener_region_del(container, qemu_real_host_page_mask, listener,
> >> +                             section);
> >> +}
> >> +
> >> +static const MemoryListener vfio_type1_iommu_listener = {
> >> +    .region_add = vfio_type1_iommu_listener_region_add,
> >> +    .region_del = vfio_type1_iommu_listener_region_del,
> >> +};
> >> +
> >> +static void vfio_spapr_iommu_listener_region_add(MemoryListener *listener,
> >> +                                     MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container;
> >> +    hwaddr page_mask;
> >> +
> >> +    if (!memory_region_is_iommu(section->mr)) {
> >> +        return;
> >> +    }
> >> +    container = container_of(listener, VFIOContainer,
> >> +                             iommu_data.type1.listener);
> >> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
> >> +    vfio_listener_region_add(container, page_mask, listener, section);
> >> +}
> >> +
> >> +
> >> +static void vfio_spapr_iommu_listener_region_del(MemoryListener *listener,
> >> +                                     MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container;
> >> +    hwaddr page_mask;
> >> +
> >> +    if (!memory_region_is_iommu(section->mr)) {
> >> +        return;
> >> +    }
> >> +    container = container_of(listener, VFIOContainer,
> >> +                             iommu_data.type1.listener);
> >> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
> >> +    vfio_listener_region_del(container, page_mask, listener, section);
> >> +}
> >> +
> >> +static const MemoryListener vfio_spapr_iommu_listener = {
> >> +    .region_add = vfio_spapr_iommu_listener_region_add,
> >> +    .region_del = vfio_spapr_iommu_listener_region_del,
> >>   };
> >
> >
> > Rather than creating all these wrappers, why don't we create a structure
> > that gives us callbacks we can use for a shared function?  If we had:
> >
> > struct VFIOMemoryListener {
> > 	struct MemoryListener listener;
> > 	bool (*filter)(MemoryRegionSection *section);
> > 	hwaddr (page_size)(MemoryRegionSection *section);
> > 	VFIOContainer *container;
> > }
> >
> > Then there would be no reason for you to have separate wrappers for
> > spapr.  VFIOType1 would have a single VFIOMemoryListener, you could have
> > an array of two.
> 
> Sorry, I am missing the point here...
> 
> I cannot just have an array of these - I will have to store an address 
> spaces per a listener in a container to implement filter() so I'd rather 
> store just an address space and not add filter() at all. And I will still 

You're not storing an address space now.

> need "memory: Add reporting of supported page sizes" - or there is no way 

Just like you do now.

> to implement it all. So I still end up having 6 callbacks in 
> hw/vfio/common.c. What does this win for us?

Nothing you're suggesting adds any more callbacks.

The current filter is vfio_listener_skipped_section().  The iommu filter
is simply:

bool vfio_iommu_listener_skipped_section(mr)
{
  return vfio_listener_skipped_section(mr) || !memory_region_is_iommu(mr);
}

Notice how I didn't have to call it spapr because it's probably what any
guest iommu is going to do...

The v2 spapr filter is:

bool vfio_nodump_listener_skipped_section(mr)
{
  return vfio_listener_skipped_section(mr) || memory_region_is_skip_dump(mr);
}

Notice again that we can use generic functions and not make everything
spapr specific.

The page_size callbacks are similarly trivial, the iommu one obviously
calls memory_region_iommu_get_page_sizes, the other one uses
~qemu_real_host_page_mask.  Again, nothing spapr specific.

The container of course points back to the container and now
vfio_listener_region_add() simply does:

VFIOMemoryListener vlistener = container_of(listener, VFIOMemoryListener, listener);
VFIOContainer = vlistener->container;
hwaddr page_mask = vlistener->page_size(mr);

if (vlistener->filter && vlistener->filter(mr)) {
    return
}
...

And suddenly we get to piece together a handful of _generic_ helper
functions into a VFIOMemoryListener and we're not adding a new pair of
wrappers for everything and duplicating trivial setup.  Thanks,

Alex

> >>
> >>   static void vfio_listener_release(VFIOContainer *container)
> >> @@ -684,7 +741,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>               goto free_container_exit;
> >>           }
> >>
> >> -        container->iommu_data.type1.listener = vfio_memory_listener;
> >> +        container->iommu_data.type1.listener = vfio_type1_iommu_listener;
> >>           container->iommu_data.release = vfio_listener_release;
> >>
> >>           memory_listener_register(&container->iommu_data.type1.listener,
> >> @@ -724,7 +781,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>               goto free_container_exit;
> >>           }
> >>
> >> -        container->iommu_data.type1.listener = vfio_memory_listener;
> >> +        container->iommu_data.type1.listener = vfio_spapr_iommu_listener;
> >>           container->iommu_data.release = vfio_listener_release;
> >>
> >>           memory_listener_register(&container->iommu_data.type1.listener,
> >
> >
> >
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types
  2015-07-16  2:51       ` Alex Williamson
@ 2015-07-16  3:31         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-16  3:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Crosthwaite, Michael Roth, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On 07/16/2015 12:51 PM, Alex Williamson wrote:
> On Thu, 2015-07-16 at 11:26 +1000, Alexey Kardashevskiy wrote:
>> On 07/16/2015 04:26 AM, Alex Williamson wrote:
>>> On Tue, 2015-07-14 at 22:21 +1000, Alexey Kardashevskiy wrote:
>>>> The existing memory listener is called on RAM or PCI address space
>>>> which implies potentially different page size. Instead of guessing
>>>> what page size should be used, this replaces a single IOMMU memory
>>>> listener by two, one per supported IOMMU type; listener callbacks
>>>> call the existing helpers with a known page size.
>>>>
>>>> For now, Type1 uses qemu_real_host_page_mask, sPAPR uses the page size
>>>> from IOMMU.
>>>>
>>>> As memory_region_iommu_get_page_sizes() asserts on non-IOMMU regions,
>>>> this adds memory_region_is_iommu() to the SPAPR IOMMU listener to skip
>>>> non IOMMU regions (which is an MSIX window) which duplicates
>>>> vfio_listener_skipped_section() a bit.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>    hw/vfio/common.c | 95 ++++++++++++++++++++++++++++++++++++++++++++------------
>>>>    1 file changed, 76 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 85ee9b0..aad41e1 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -312,11 +312,11 @@ out:
>>>>        rcu_read_unlock();
>>>>    }
>>>>
>>>> -static void vfio_listener_region_add(MemoryListener *listener,
>>>> +static void vfio_listener_region_add(VFIOContainer *container,
>>>> +                                     hwaddr page_mask,
>>>
>>> Should memory_region_iommu_get_page_sizes() return a hwaddr?
>>
>>
>> I do not think so, memory.c uses uint64_t for masks.
>>
>>
>>
>>>> +                                     MemoryListener *listener,
>>>>                                         MemoryRegionSection *section)
>>>>    {
>>>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> -                                            iommu_data.type1.listener);
>>>>        hwaddr iova, end;
>>>>        Int128 llend;
>>>>        void *vaddr;
>>>> @@ -330,16 +330,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>            return;
>>>>        }
>>>>
>>>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>>>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>>>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>>>> +                 (section->offset_within_region & ~page_mask))) {
>>>>            error_report("%s received unaligned region", __func__);
>>>>            return;
>>>>        }
>>>>
>>>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>>>> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>>>>        llend = int128_make64(section->offset_within_address_space);
>>>>        llend = int128_add(llend, section->size);
>>>> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
>>>> +    llend = int128_and(llend, int128_exts64(page_mask));
>>>>
>>>>        if (int128_ge(int128_make64(iova), llend)) {
>>>>            return;
>>>> @@ -416,11 +416,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>        }
>>>>    }
>>>>
>>>> -static void vfio_listener_region_del(MemoryListener *listener,
>>>> +static void vfio_listener_region_del(VFIOContainer *container,
>>>> +                                     hwaddr page_mask,
>>>> +                                     MemoryListener *listener,
>>>>                                         MemoryRegionSection *section)
>>>>    {
>>>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> -                                            iommu_data.type1.listener);
>>>>        hwaddr iova, end;
>>>>        int ret;
>>>>
>>>> @@ -432,8 +432,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>            return;
>>>>        }
>>>>
>>>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>>>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>>>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>>>> +                 (section->offset_within_region & ~page_mask))) {
>>>>            error_report("%s received unaligned region", __func__);
>>>>            return;
>>>>        }
>>>> @@ -459,9 +459,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>             */
>>>>        }
>>>>
>>>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>>>> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>>>>        end = (section->offset_within_address_space + int128_get64(section->size)) &
>>>> -          TARGET_PAGE_MASK;
>>>> +          page_mask;
>>>>
>>>>        if (iova >= end) {
>>>>            return;
>>>> @@ -478,9 +478,66 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>        }
>>>>    }
>>>>
>>>> -static const MemoryListener vfio_memory_listener = {
>>>> -    .region_add = vfio_listener_region_add,
>>>> -    .region_del = vfio_listener_region_del,
>>>> +static void vfio_type1_iommu_listener_region_add(MemoryListener *listener,
>>>> +                                                 MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> +                                            iommu_data.type1.listener);
>>>> +
>>>> +    vfio_listener_region_add(container, qemu_real_host_page_mask, listener,
>>>> +                             section);
>>>> +}
>>>> +
>>>> +
>>>> +static void vfio_type1_iommu_listener_region_del(MemoryListener *listener,
>>>> +                                                 MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> +                                            iommu_data.type1.listener);
>>>> +
>>>> +    vfio_listener_region_del(container, qemu_real_host_page_mask, listener,
>>>> +                             section);
>>>> +}
>>>> +
>>>> +static const MemoryListener vfio_type1_iommu_listener = {
>>>> +    .region_add = vfio_type1_iommu_listener_region_add,
>>>> +    .region_del = vfio_type1_iommu_listener_region_del,
>>>> +};
>>>> +
>>>> +static void vfio_spapr_iommu_listener_region_add(MemoryListener *listener,
>>>> +                                     MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container;
>>>> +    hwaddr page_mask;
>>>> +
>>>> +    if (!memory_region_is_iommu(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +    container = container_of(listener, VFIOContainer,
>>>> +                             iommu_data.type1.listener);
>>>> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
>>>> +    vfio_listener_region_add(container, page_mask, listener, section);
>>>> +}
>>>> +
>>>> +
>>>> +static void vfio_spapr_iommu_listener_region_del(MemoryListener *listener,
>>>> +                                     MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container;
>>>> +    hwaddr page_mask;
>>>> +
>>>> +    if (!memory_region_is_iommu(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +    container = container_of(listener, VFIOContainer,
>>>> +                             iommu_data.type1.listener);
>>>> +    page_mask = ~(memory_region_iommu_get_page_sizes(section->mr) - 1);
>>>> +    vfio_listener_region_del(container, page_mask, listener, section);
>>>> +}
>>>> +
>>>> +static const MemoryListener vfio_spapr_iommu_listener = {
>>>> +    .region_add = vfio_spapr_iommu_listener_region_add,
>>>> +    .region_del = vfio_spapr_iommu_listener_region_del,
>>>>    };
>>>
>>>
>>> Rather than creating all these wrappers, why don't we create a structure
>>> that gives us callbacks we can use for a shared function?  If we had:
>>>
>>> struct VFIOMemoryListener {
>>> 	struct MemoryListener listener;
>>> 	bool (*filter)(MemoryRegionSection *section);
>>> 	hwaddr (page_size)(MemoryRegionSection *section);
>>> 	VFIOContainer *container;
>>> }
>>>
>>> Then there would be no reason for you to have separate wrappers for
>>> spapr.  VFIOType1 would have a single VFIOMemoryListener, you could have
>>> an array of two.
>>
>> Sorry, I am missing the point here...
>>
>> I cannot just have an array of these - I will have to store an address
>> spaces per a listener in a container to implement filter() so I'd rather
>> store just an address space and not add filter() at all. And I will still
>
> You're not storing an address space now.


I am, indirectly, when register the listener with a filter which is never 
NULL and which is passed to the listener. If not that, the whole thing 
would be simpler. The rest is quite clear now, you are right.



>> need "memory: Add reporting of supported page sizes" - or there is no way
>
> Just like you do now.
>
>> to implement it all. So I still end up having 6 callbacks in
>> hw/vfio/common.c. What does this win for us?
>
> Nothing you're suggesting adds any more callbacks.
>
> The current filter is vfio_listener_skipped_section().  The iommu filter
> is simply:
>
> bool vfio_iommu_listener_skipped_section(mr)
> {
>    return vfio_listener_skipped_section(mr) || !memory_region_is_iommu(mr);
> }
>
> Notice how I didn't have to call it spapr because it's probably what any
> guest iommu is going to do...
>
> The v2 spapr filter is:
>
> bool vfio_nodump_listener_skipped_section(mr)
> {
>    return vfio_listener_skipped_section(mr) || memory_region_is_skip_dump(mr);
> }
>
> Notice again that we can use generic functions and not make everything
> spapr specific.


Ok, we add 2 (not 3) as vfio_listener_skipped_section() is still there.
You got +1.


> The page_size callbacks are similarly trivial, the iommu one obviously
> calls memory_region_iommu_get_page_sizes, the other one uses
> ~qemu_real_host_page_mask.  Again, nothing spapr specific.


These are trivial but I need all 3 - for type1 iommu (real page size), 
spapr iommu (TCE page size) and spapr memoryprereg (real page size).
Ok, I can reuse 1st for 3rd and get one less. +2.


> The container of course points back to the container and now
> vfio_listener_region_add() simply does:
>
> VFIOMemoryListener vlistener = container_of(listener, VFIOMemoryListener, listener);
> VFIOContainer = vlistener->container;
> hwaddr page_mask = vlistener->page_size(mr);
>
> if (vlistener->filter && vlistener->filter(mr)) {
>      return
> }
> ...
>
> And suddenly we get to piece together a handful of _generic_ helper
> functions into a VFIOMemoryListener and we're not adding a new pair of
> wrappers for everything and duplicating trivial setup.  Thanks,

Ok, we are saving 2 callbacks and make others smaller. Makes sense. Thanks 
for chewing (right word here?) this stuff for me :)


>
> Alex
>
>>>>
>>>>    static void vfio_listener_release(VFIOContainer *container)
>>>> @@ -684,7 +741,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>                goto free_container_exit;
>>>>            }
>>>>
>>>> -        container->iommu_data.type1.listener = vfio_memory_listener;
>>>> +        container->iommu_data.type1.listener = vfio_type1_iommu_listener;
>>>>            container->iommu_data.release = vfio_listener_release;
>>>>
>>>>            memory_listener_register(&container->iommu_data.type1.listener,
>>>> @@ -724,7 +781,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>                goto free_container_exit;
>>>>            }
>>>>
>>>> -        container->iommu_data.type1.listener = vfio_memory_listener;
>>>> +        container->iommu_data.type1.listener = vfio_spapr_iommu_listener;
>>>>            container->iommu_data.release = vfio_listener_release;
>>>>
>>>>            memory_listener_register(&container->iommu_data.type1.listener,
>>>
>>>
>>>
>>
>>
>
>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2015-07-16  5:11   ` David Gibson
  2015-07-16 14:44     ` Alex Williamson
  2015-07-17  7:13     ` Alexey Kardashevskiy
  0 siblings, 2 replies; 22+ messages in thread
From: David Gibson @ 2015-07-16  5:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Crosthwaite, qemu-devel, Michael Roth, Alex Williamson,
	qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3821 bytes --]

On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a guest RAM memory listener which notifies a VFIO container
> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> (i.e. "skip dump" regions) are skipped.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This does not change the guest visible interface.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I've looked at this in more depth now, and attempting to unify the
pre-reg and mapping listeners like this can't work - they need to be
listening on different address spaces:  mapping actions need to be
listening on the PCI address space, whereas the pre-reg needs to be
listening on address_space_memory.  For x86 - for now - those end up
being the same thing, but on Power they're not.

We do need to be clear about what differences are due to the presence
of a guest IOMMU versus which are due to arch or underlying IOMMU
type.  For now Power has a guest IOMMU and x86 doesn't, but that could
well change in future: we could well implement the guest side IOMMU
for x86 in future (or x86 could invent a paravirt IOMMU interface).
On the other side, BenH's experimental powernv machine type could
introduce Power machines without a guest side IOMMU (or at least an
optional guest side IOMMU).

The quick and dirty approach here is:
   1. Leave the main listener as is
   2. Add a new pre-reg notifier to the spapr iommu specific code,
      which listens on address_space_memory, *not* the PCI space

The more generally correct approach, which allows for more complex
IOMMU arrangements and the possibility of new IOMMU types with pre-reg
is:
   1. Have the core implement both a mapping listener and a pre-reg
      listener (optionally enabled by a per-iommu-type flag).
      Basically the first one sees what *is* mapped, the second sees
      what *could* be mapped.

   2. As now, the mapping listener listens on PCI address space, if
      RAM blocks are added, immediately map them into the host IOMMU,
      if guest IOMMU blocks appear register a notifier which will
      mirror guest IOMMU mappings to the host IOMMU (this is what we
      do now).

   3. The pre-reg listener also listens on the PCI address space.  RAM
      blocks added are pre-registered immediately.  But, if guest
      IOMMU blocks are added, instead of registering a guest-iommu
      notifier, we register another listener on the *target* AS of the
      guest IOMMU, same callbacks as this one.  In practice that
      target AS will almost always resolve to address_space_memory,
      but this can at least in theory handle crazy guest setups with
      multiple layers of IOMMU.

   4. Have to ensure that the pre-reg callbacks always happen before
      the mapping calls.  For a system with an IOMMU backend which
      requires pre-registration, but doesn't have a guest IOMMU, we
      need to pre-reg, then host-iommu-map RAM blocks that appear in
      PCI address space.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-16  5:11   ` David Gibson
@ 2015-07-16 14:44     ` Alex Williamson
  2015-07-17  5:20       ` David Gibson
  2015-07-17  7:13     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2015-07-16 14:44 UTC (permalink / raw)
  To: David Gibson
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, qemu-devel,
	Michael Roth, qemu-ppc, Paolo Bonzini

On Thu, 2015-07-16 at 15:11 +1000, David Gibson wrote:
> On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> > This makes use of the new "memory registering" feature. The idea is
> > to provide the userspace ability to notify the host kernel about pages
> > which are going to be used for DMA. Having this information, the host
> > kernel can pin them all once per user process, do locked pages
> > accounting (once) and not spent time on doing that in real time with
> > possible failures which cannot be handled nicely in some cases.
> > 
> > This adds a guest RAM memory listener which notifies a VFIO container
> > about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > (i.e. "skip dump" regions) are skipped.
> > 
> > The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > not call it when v2 is detected and enabled.
> > 
> > This does not change the guest visible interface.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> I've looked at this in more depth now, and attempting to unify the
> pre-reg and mapping listeners like this can't work - they need to be
> listening on different address spaces:  mapping actions need to be
> listening on the PCI address space, whereas the pre-reg needs to be
> listening on address_space_memory.  For x86 - for now - those end up
> being the same thing, but on Power they're not.
> 
> We do need to be clear about what differences are due to the presence
> of a guest IOMMU versus which are due to arch or underlying IOMMU
> type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> well change in future: we could well implement the guest side IOMMU
> for x86 in future (or x86 could invent a paravirt IOMMU interface).
> On the other side, BenH's experimental powernv machine type could
> introduce Power machines without a guest side IOMMU (or at least an
> optional guest side IOMMU).
> 
> The quick and dirty approach here is:
>    1. Leave the main listener as is
>    2. Add a new pre-reg notifier to the spapr iommu specific code,
>       which listens on address_space_memory, *not* the PCI space

It is dirty and that's exactly what I've been advising Alexey against
because we have entirely too much dirty spapr specific code that doesn't
need to be spapr specific.  I don't see why separate address space
matters, that's done at the point of registering the listener and so far
doesn't play much a role in the actual listener behavior, just which
regions it sees.
> 
> The more generally correct approach, which allows for more complex
> IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> is:
>    1. Have the core implement both a mapping listener and a pre-reg
>       listener (optionally enabled by a per-iommu-type flag).
>       Basically the first one sees what *is* mapped, the second sees
>       what *could* be mapped.

This just sounds like different address spaces, address_space_memory vs
address_space_physical_memory.

> 
>    2. As now, the mapping listener listens on PCI address space, if
>       RAM blocks are added, immediately map them into the host IOMMU,
>       if guest IOMMU blocks appear register a notifier which will
>       mirror guest IOMMU mappings to the host IOMMU (this is what we
>       do now).

Right, this is done now, nothing new required.

>    3. The pre-reg listener also listens on the PCI address space.  RAM
>       blocks added are pre-registered immediately.  But, if guest
>       IOMMU blocks are added, instead of registering a guest-iommu
>       notifier, we register another listener on the *target* AS of the
>       guest IOMMU, same callbacks as this one.  In practice that
>       target AS will almost always resolve to address_space_memory,
>       but this can at least in theory handle crazy guest setups with
>       multiple layers of IOMMU.
> 
>    4. Have to ensure that the pre-reg callbacks always happen before
>       the mapping calls.  For a system with an IOMMU backend which
>       requires pre-registration, but doesn't have a guest IOMMU, we
>       need to pre-reg, then host-iommu-map RAM blocks that appear in
>       PCI address space.

Both of these just seem like details of the iommu-type (pre-reg)
specific parts of the listener, I'm not understanding what's
fundamentally different about it that we can't do it now, within a
single listener, nor do I see how it's all that different from x86.  x86
mappings are more dynamic, but that's largely in legacy space for x86
due to some bizarre compatibility things.  x86 also tries to map mmio
regions for p2p, but these are just minor details to mappings that are
largely the same.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-16 14:44     ` Alex Williamson
@ 2015-07-17  5:20       ` David Gibson
  2015-07-17 18:25         ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2015-07-17  5:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, qemu-devel,
	Michael Roth, qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6964 bytes --]

On Thu, Jul 16, 2015 at 08:44:59AM -0600, Alex Williamson wrote:
> On Thu, 2015-07-16 at 15:11 +1000, David Gibson wrote:
> > On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> > > This makes use of the new "memory registering" feature. The idea is
> > > to provide the userspace ability to notify the host kernel about pages
> > > which are going to be used for DMA. Having this information, the host
> > > kernel can pin them all once per user process, do locked pages
> > > accounting (once) and not spent time on doing that in real time with
> > > possible failures which cannot be handled nicely in some cases.
> > > 
> > > This adds a guest RAM memory listener which notifies a VFIO container
> > > about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > > (i.e. "skip dump" regions) are skipped.
> > > 
> > > The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > > are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > > not call it when v2 is detected and enabled.
> > > 
> > > This does not change the guest visible interface.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > I've looked at this in more depth now, and attempting to unify the
> > pre-reg and mapping listeners like this can't work - they need to be
> > listening on different address spaces:  mapping actions need to be
> > listening on the PCI address space, whereas the pre-reg needs to be
> > listening on address_space_memory.  For x86 - for now - those end up
> > being the same thing, but on Power they're not.
> > 
> > We do need to be clear about what differences are due to the presence
> > of a guest IOMMU versus which are due to arch or underlying IOMMU
> > type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> > well change in future: we could well implement the guest side IOMMU
> > for x86 in future (or x86 could invent a paravirt IOMMU interface).
> > On the other side, BenH's experimental powernv machine type could
> > introduce Power machines without a guest side IOMMU (or at least an
> > optional guest side IOMMU).
> > 
> > The quick and dirty approach here is:
> >    1. Leave the main listener as is
> >    2. Add a new pre-reg notifier to the spapr iommu specific code,
> >       which listens on address_space_memory, *not* the PCI space
> 
> It is dirty and that's exactly what I've been advising Alexey against
> because we have entirely too much dirty spapr specific code that doesn't
> need to be spapr specific.  I don't see why separate address space
> matters, that's done at the point of registering the listener and so far
> doesn't play much a role in the actual listener behavior, just which
> regions it sees.

Well, there's two parts to this - the different address spaces means
they need to be different listener instances.  They also need
different callbacks - or at least parameterized callback behaviour
because they do different things (one maps, the other preregs).

So I really don't see any sense in which these can be accomplished by
the same listener.  *Maybe* they could share some region walking code,
but I'm not sure there's going to be anything of significant size here.

> > The more generally correct approach, which allows for more complex
> > IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> > is:
> >    1. Have the core implement both a mapping listener and a pre-reg
> >       listener (optionally enabled by a per-iommu-type flag).
> >       Basically the first one sees what *is* mapped, the second sees
> >       what *could* be mapped.
> 
> This just sounds like different address spaces, address_space_memory vs
> address_space_physical_memory

Um.. what?  I'm not even sure what you mean by
address_space_physical_memory (I see no such thing in the source).

The idea was that the (top level) pre-reg listener would spawn more
listeners for any AS which could get (partially) mapped into the PCI
addres space.

But.. I looked a bit closer and realised this scheme doesn't actually
work.  IOMMU memory regions don't actually have a fixed target AS
property (by which I mean the address space the IOMMU translates
*into* rather than translates from - address_space_memory in most
cases).  Instead any individual IOMMU mapping can point to a different
AS supplied in the IOMMUTLB structure.

> >    2. As now, the mapping listener listens on PCI address space, if
> >       RAM blocks are added, immediately map them into the host IOMMU,
> >       if guest IOMMU blocks appear register a notifier which will
> >       mirror guest IOMMU mappings to the host IOMMU (this is what we
> >       do now).
> 
> Right, this is done now, nothing new required.

Yes, I was just spelling that out for comparison with the other part.

> >    3. The pre-reg listener also listens on the PCI address space.  RAM
> >       blocks added are pre-registered immediately.  But, if guest
> >       IOMMU blocks are added, instead of registering a guest-iommu
> >       notifier, we register another listener on the *target* AS of the
> >       guest IOMMU, same callbacks as this one.  In practice that
> >       target AS will almost always resolve to address_space_memory,
> >       but this can at least in theory handle crazy guest setups with
> >       multiple layers of IOMMU.
> > 
> >    4. Have to ensure that the pre-reg callbacks always happen before
> >       the mapping calls.  For a system with an IOMMU backend which
> >       requires pre-registration, but doesn't have a guest IOMMU, we
> >       need to pre-reg, then host-iommu-map RAM blocks that appear in
> >       PCI address space.

> Both of these just seem like details of the iommu-type (pre-reg)
> specific parts of the listener, I'm not understanding what's
> fundamentally different about it that we can't do it now, within a
> single listener, nor do I see how it's all that different from x86.

So, we have two different address spaces to listen on (PCI and
address_space_memory) so we need to different listener instances
registered, clearly.

Those two different listeners need to do different things.  1) maps
memory blocks into VFIO and sets up a notifier to mirror guest IOMMU
mappings into VFIO.  2) preregisters memory blocks, and it would be a
bug if it ever saw a guest IOMMU.

So, what's left to share?

> x86
> mappings are more dynamic, but that's largely in legacy space for x86
> due to some bizarre compatibility things.  x86 also tries to map mmio
> regions for p2p, but these are just minor details to mappings that are
> largely the same.  Thanks,
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-16  5:11   ` David Gibson
  2015-07-16 14:44     ` Alex Williamson
@ 2015-07-17  7:13     ` Alexey Kardashevskiy
  2015-07-17 13:39       ` David Gibson
  1 sibling, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-17  7:13 UTC (permalink / raw)
  To: David Gibson
  Cc: Peter Crosthwaite, qemu-devel, Michael Roth, Alex Williamson,
	qemu-ppc, Paolo Bonzini

On 07/16/2015 03:11 PM, David Gibson wrote:
> On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
>> This makes use of the new "memory registering" feature. The idea is
>> to provide the userspace ability to notify the host kernel about pages
>> which are going to be used for DMA. Having this information, the host
>> kernel can pin them all once per user process, do locked pages
>> accounting (once) and not spent time on doing that in real time with
>> possible failures which cannot be handled nicely in some cases.
>>
>> This adds a guest RAM memory listener which notifies a VFIO container
>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
>> (i.e. "skip dump" regions) are skipped.
>>
>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>> not call it when v2 is detected and enabled.
>>
>> This does not change the guest visible interface.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> I've looked at this in more depth now, and attempting to unify the
> pre-reg and mapping listeners like this can't work - they need to be
> listening on different address spaces:  mapping actions need to be
> listening on the PCI address space, whereas the pre-reg needs to be
> listening on address_space_memory.  For x86 - for now - those end up
> being the same thing, but on Power they're not.
>
> We do need to be clear about what differences are due to the presence
> of a guest IOMMU versus which are due to arch or underlying IOMMU
> type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> well change in future: we could well implement the guest side IOMMU
> for x86 in future (or x86 could invent a paravirt IOMMU interface).
> On the other side, BenH's experimental powernv machine type could
> introduce Power machines without a guest side IOMMU (or at least an
> optional guest side IOMMU).
>
> The quick and dirty approach here is:
>     1. Leave the main listener as is
>     2. Add a new pre-reg notifier to the spapr iommu specific code,
>        which listens on address_space_memory, *not* the PCI space
>
> The more generally correct approach, which allows for more complex
> IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> is:
>     1. Have the core implement both a mapping listener and a pre-reg
>        listener (optionally enabled by a per-iommu-type flag).
>        Basically the first one sees what *is* mapped, the second sees
>        what *could* be mapped.
>
>     2. As now, the mapping listener listens on PCI address space, if
>        RAM blocks are added, immediately map them into the host IOMMU,
>        if guest IOMMU blocks appear register a notifier which will
>        mirror guest IOMMU mappings to the host IOMMU (this is what we
>        do now).
>
>     3. The pre-reg listener also listens on the PCI address space.  RAM
>        blocks added are pre-registered immediately.


PCI address space listeners won't be notified about RAM blocks on sPAPR.


> But, if guest
>        IOMMU blocks are added, instead of registering a guest-iommu
>        notifier,

"guest-iommu notifier" is the one called via memory_region_notify_iommu() 
from H_PUT_TCE? "Instead" implies dropping it, how this can work?


> we register another listener on the *target* AS of the
>        guest IOMMU, same callbacks as this one.  In practice that
>        target AS will almost always resolve to address_space_memory,
>        but this can at least in theory handle crazy guest setups with
>        multiple layers of IOMMU.
>
>     4. Have to ensure that the pre-reg callbacks always happen before
>        the mapping calls.  For a system with an IOMMU backend which
>        requires pre-registration, but doesn't have a guest IOMMU, we
>        need to pre-reg, then host-iommu-map RAM blocks that appear in
>        PCI address space.




-- 
Alexey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-17  7:13     ` Alexey Kardashevskiy
@ 2015-07-17 13:39       ` David Gibson
  2015-07-17 15:47         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2015-07-17 13:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Crosthwaite, qemu-devel, Michael Roth, Alex Williamson,
	qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5194 bytes --]

On Fri, Jul 17, 2015 at 05:13:37PM +1000, Alexey Kardashevskiy wrote:
> On 07/16/2015 03:11 PM, David Gibson wrote:
> >On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> >>This makes use of the new "memory registering" feature. The idea is
> >>to provide the userspace ability to notify the host kernel about pages
> >>which are going to be used for DMA. Having this information, the host
> >>kernel can pin them all once per user process, do locked pages
> >>accounting (once) and not spent time on doing that in real time with
> >>possible failures which cannot be handled nicely in some cases.
> >>
> >>This adds a guest RAM memory listener which notifies a VFIO container
> >>about memory which needs to be pinned/unpinned. VFIO MMIO regions
> >>(i.e. "skip dump" regions) are skipped.
> >>
> >>The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >>not call it when v2 is detected and enabled.
> >>
> >>This does not change the guest visible interface.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >I've looked at this in more depth now, and attempting to unify the
> >pre-reg and mapping listeners like this can't work - they need to be
> >listening on different address spaces:  mapping actions need to be
> >listening on the PCI address space, whereas the pre-reg needs to be
> >listening on address_space_memory.  For x86 - for now - those end up
> >being the same thing, but on Power they're not.
> >
> >We do need to be clear about what differences are due to the presence
> >of a guest IOMMU versus which are due to arch or underlying IOMMU
> >type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> >well change in future: we could well implement the guest side IOMMU
> >for x86 in future (or x86 could invent a paravirt IOMMU interface).
> >On the other side, BenH's experimental powernv machine type could
> >introduce Power machines without a guest side IOMMU (or at least an
> >optional guest side IOMMU).
> >
> >The quick and dirty approach here is:
> >    1. Leave the main listener as is
> >    2. Add a new pre-reg notifier to the spapr iommu specific code,
> >       which listens on address_space_memory, *not* the PCI space
> >
> >The more generally correct approach, which allows for more complex
> >IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> >is:
> >    1. Have the core implement both a mapping listener and a pre-reg
> >       listener (optionally enabled by a per-iommu-type flag).
> >       Basically the first one sees what *is* mapped, the second sees
> >       what *could* be mapped.
> >
> >    2. As now, the mapping listener listens on PCI address space, if
> >       RAM blocks are added, immediately map them into the host IOMMU,
> >       if guest IOMMU blocks appear register a notifier which will
> >       mirror guest IOMMU mappings to the host IOMMU (this is what we
> >       do now).
> >
> >    3. The pre-reg listener also listens on the PCI address space.  RAM
> >       blocks added are pre-registered immediately.
> 
> 
> PCI address space listeners won't be notified about RAM blocks on sPAPR.

Sure they will - if any RAM blocks were mapped directly into PCI
address space, the listener would be notified.  It's just that no RAM
blocks are directly mapped into PCI space, only partially mapped in
via IOMMU blocks.

But the idea is this scheme could handle a platform that has both a
"bypass" DMA window which maps directly onto a block of ram and an
IOMMU controlled DMA window.  Or one which could have either setup
depending on circumstances (which is probably true of BenH's "powernv"
machine type).

> >But, if guest
> >       IOMMU blocks are added, instead of registering a guest-iommu
> >       notifier,
> 
> "guest-iommu notifier" is the one called via memory_region_notify_iommu()
> from H_PUT_TCE? "Instead" implies dropping it, how this can work?

Because the other listener - the mapping listener at (2) handles that
part.  The pre-reg listener doesn't.

But as noted in by other mail this whole scheme doesn't work without a
way to discover an IOMMU region's target AS in advance, which doesn't
currently exist.

> >we register another listener on the *target* AS of the
> >       guest IOMMU, same callbacks as this one.  In practice that
> >       target AS will almost always resolve to address_space_memory,
> >       but this can at least in theory handle crazy guest setups with
> >       multiple layers of IOMMU.
> >
> >    4. Have to ensure that the pre-reg callbacks always happen before
> >       the mapping calls.  For a system with an IOMMU backend which
> >       requires pre-registration, but doesn't have a guest IOMMU, we
> >       need to pre-reg, then host-iommu-map RAM blocks that appear in
> >       PCI address space.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-17 13:39       ` David Gibson
@ 2015-07-17 15:47         ` Alexey Kardashevskiy
  2015-07-18 15:17           ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-17 15:47 UTC (permalink / raw)
  To: David Gibson
  Cc: Peter Crosthwaite, qemu-devel, Michael Roth, Alex Williamson,
	qemu-ppc, Paolo Bonzini

On 07/17/2015 11:39 PM, David Gibson wrote:
> On Fri, Jul 17, 2015 at 05:13:37PM +1000, Alexey Kardashevskiy wrote:
>> On 07/16/2015 03:11 PM, David Gibson wrote:
>>> On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
>>>> This makes use of the new "memory registering" feature. The idea is
>>>> to provide the userspace ability to notify the host kernel about pages
>>>> which are going to be used for DMA. Having this information, the host
>>>> kernel can pin them all once per user process, do locked pages
>>>> accounting (once) and not spent time on doing that in real time with
>>>> possible failures which cannot be handled nicely in some cases.
>>>>
>>>> This adds a guest RAM memory listener which notifies a VFIO container
>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
>>>> (i.e. "skip dump" regions) are skipped.
>>>>
>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>> not call it when v2 is detected and enabled.
>>>>
>>>> This does not change the guest visible interface.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> I've looked at this in more depth now, and attempting to unify the
>>> pre-reg and mapping listeners like this can't work - they need to be
>>> listening on different address spaces:  mapping actions need to be
>>> listening on the PCI address space, whereas the pre-reg needs to be
>>> listening on address_space_memory.  For x86 - for now - those end up
>>> being the same thing, but on Power they're not.
>>>
>>> We do need to be clear about what differences are due to the presence
>>> of a guest IOMMU versus which are due to arch or underlying IOMMU
>>> type.  For now Power has a guest IOMMU and x86 doesn't, but that could
>>> well change in future: we could well implement the guest side IOMMU
>>> for x86 in future (or x86 could invent a paravirt IOMMU interface).
>>> On the other side, BenH's experimental powernv machine type could
>>> introduce Power machines without a guest side IOMMU (or at least an
>>> optional guest side IOMMU).
>>>
>>> The quick and dirty approach here is:
>>>     1. Leave the main listener as is
>>>     2. Add a new pre-reg notifier to the spapr iommu specific code,
>>>        which listens on address_space_memory, *not* the PCI space
>>>
>>> The more generally correct approach, which allows for more complex
>>> IOMMU arrangements and the possibility of new IOMMU types with pre-reg
>>> is:
>>>     1. Have the core implement both a mapping listener and a pre-reg
>>>        listener (optionally enabled by a per-iommu-type flag).
>>>        Basically the first one sees what *is* mapped, the second sees
>>>        what *could* be mapped.
>>>
>>>     2. As now, the mapping listener listens on PCI address space, if
>>>        RAM blocks are added, immediately map them into the host IOMMU,
>>>        if guest IOMMU blocks appear register a notifier which will
>>>        mirror guest IOMMU mappings to the host IOMMU (this is what we
>>>        do now).
>>>
>>>     3. The pre-reg listener also listens on the PCI address space.  RAM
>>>        blocks added are pre-registered immediately.
>>
>>
>> PCI address space listeners won't be notified about RAM blocks on sPAPR.
>
> Sure they will - if any RAM blocks were mapped directly into PCI
> address space, the listener would be notified.  It's just that no RAM
> blocks are directly mapped into PCI space, only partially mapped in
> via IOMMU blocks.

Right. No RAM blocks are mapped. So on *sPAPR* PCI AS listener won't be 
notified about *RAM*. But you say "they will". I am missing something here.


> But the idea is this scheme could handle a platform that has both a
> "bypass" DMA window which maps directly onto a block of ram and an
> IOMMU controlled DMA window.  Or one which could have either setup
> depending on circumstances (which is probably true of BenH's "powernv"
> machine type).
>
>>> But, if guest
>>>        IOMMU blocks are added, instead of registering a guest-iommu
>>>        notifier,
>>
>> "guest-iommu notifier" is the one called via memory_region_notify_iommu()
>> from H_PUT_TCE? "Instead" implies dropping it, how this can work?
>
> Because the other listener - the mapping listener at (2) handles that
> part.  The pre-reg listener doesn't.


My bad, #2 included notifiers, right.


> But as noted in by other mail this whole scheme doesn't work without a
> way to discover an IOMMU region's target AS in advance, which doesn't
> currently exist.

We can add AS to IOMMU MR now, few lines of code :)


>>> we register another listener on the *target* AS of the
>>>        guest IOMMU, same callbacks as this one.  In practice that
>>>        target AS will almost always resolve to address_space_memory,
>>>        but this can at least in theory handle crazy guest setups with
>>>        multiple layers of IOMMU.
>>>
>>>     4. Have to ensure that the pre-reg callbacks always happen before
>>>        the mapping calls.  For a system with an IOMMU backend which
>>>        requires pre-registration, but doesn't have a guest IOMMU, we
>>>        need to pre-reg, then host-iommu-map RAM blocks that appear in
>>>        PCI address space.
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-17  5:20       ` David Gibson
@ 2015-07-17 18:25         ` Alex Williamson
  2015-07-18 15:05           ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2015-07-17 18:25 UTC (permalink / raw)
  To: David Gibson
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, qemu-devel,
	Michael Roth, qemu-ppc, Paolo Bonzini

On Fri, 2015-07-17 at 15:20 +1000, David Gibson wrote:
> On Thu, Jul 16, 2015 at 08:44:59AM -0600, Alex Williamson wrote:
> > On Thu, 2015-07-16 at 15:11 +1000, David Gibson wrote:
> > > On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> > > > This makes use of the new "memory registering" feature. The idea is
> > > > to provide the userspace ability to notify the host kernel about pages
> > > > which are going to be used for DMA. Having this information, the host
> > > > kernel can pin them all once per user process, do locked pages
> > > > accounting (once) and not spent time on doing that in real time with
> > > > possible failures which cannot be handled nicely in some cases.
> > > > 
> > > > This adds a guest RAM memory listener which notifies a VFIO container
> > > > about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > > > (i.e. "skip dump" regions) are skipped.
> > > > 
> > > > The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > > > are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > > > not call it when v2 is detected and enabled.
> > > > 
> > > > This does not change the guest visible interface.
> > > > 
> > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > 
> > > I've looked at this in more depth now, and attempting to unify the
> > > pre-reg and mapping listeners like this can't work - they need to be
> > > listening on different address spaces:  mapping actions need to be
> > > listening on the PCI address space, whereas the pre-reg needs to be
> > > listening on address_space_memory.  For x86 - for now - those end up
> > > being the same thing, but on Power they're not.
> > > 
> > > We do need to be clear about what differences are due to the presence
> > > of a guest IOMMU versus which are due to arch or underlying IOMMU
> > > type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> > > well change in future: we could well implement the guest side IOMMU
> > > for x86 in future (or x86 could invent a paravirt IOMMU interface).
> > > On the other side, BenH's experimental powernv machine type could
> > > introduce Power machines without a guest side IOMMU (or at least an
> > > optional guest side IOMMU).
> > > 
> > > The quick and dirty approach here is:
> > >    1. Leave the main listener as is
> > >    2. Add a new pre-reg notifier to the spapr iommu specific code,
> > >       which listens on address_space_memory, *not* the PCI space
> > 
> > It is dirty and that's exactly what I've been advising Alexey against
> > because we have entirely too much dirty spapr specific code that doesn't
> > need to be spapr specific.  I don't see why separate address space
> > matters, that's done at the point of registering the listener and so far
> > doesn't play much a role in the actual listener behavior, just which
> > regions it sees.
> 
> Well, there's two parts to this - the different address spaces means
> they need to be different listener instances.  They also need
> different callbacks - or at least parameterized callback behaviour
> because they do different things (one maps, the other preregs).

Yep, and Alexey's doing that with a switch statement, which seems like
it's helping to share setup w/o being too ugly.  Whether it's a "map" or
a "pre-reg", we don't really care, it's just an ioctl specific to the
vfio-iommu type with some setup.  Much of that setup is not specific to
the vfio-iommu type and can be shared between guest-iommu type,
transparent or visible.  That's the part were I do not want spapr
specific setup hiding what should be generic code for anybody that has a
guest-visible iommu versus guest-transparent iommu.

> So I really don't see any sense in which these can be accomplished by
> the same listener.  *Maybe* they could share some region walking code,
> but I'm not sure there's going to be anything of significant size here.

We're already sharing a listener for type1 and spapr windows, spapr-v2
is a lot like type1, just different ioctls and therefore slightly
different setup.

> > > The more generally correct approach, which allows for more complex
> > > IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> > > is:
> > >    1. Have the core implement both a mapping listener and a pre-reg
> > >       listener (optionally enabled by a per-iommu-type flag).
> > >       Basically the first one sees what *is* mapped, the second sees
> > >       what *could* be mapped.
> > 
> > This just sounds like different address spaces, address_space_memory vs
> > address_space_physical_memory
> 
> Um.. what?  I'm not even sure what you mean by
> address_space_physical_memory (I see no such thing in the source).

It doesn't exist, it's just my attempt to interpret "*is* mapped" vs
"*could* be mapped".

> The idea was that the (top level) pre-reg listener would spawn more
> listeners for any AS which could get (partially) mapped into the PCI
> addres space.
> 
> But.. I looked a bit closer and realised this scheme doesn't actually
> work.  IOMMU memory regions don't actually have a fixed target AS
> property (by which I mean the address space the IOMMU translates
> *into* rather than translates from - address_space_memory in most
> cases).  Instead any individual IOMMU mapping can point to a different
> AS supplied in the IOMMUTLB structure.

Why does that imply separate listeners?  We already do this in the
current listener.  This all sounds like things that are not unique to
spapr, only the actual ioctls are unique.

> > >    2. As now, the mapping listener listens on PCI address space, if
> > >       RAM blocks are added, immediately map them into the host IOMMU,
> > >       if guest IOMMU blocks appear register a notifier which will
> > >       mirror guest IOMMU mappings to the host IOMMU (this is what we
> > >       do now).
> > 
> > Right, this is done now, nothing new required.
> 
> Yes, I was just spelling that out for comparison with the other part.
> 
> > >    3. The pre-reg listener also listens on the PCI address space.  RAM
> > >       blocks added are pre-registered immediately.  But, if guest
> > >       IOMMU blocks are added, instead of registering a guest-iommu
> > >       notifier, we register another listener on the *target* AS of the
> > >       guest IOMMU, same callbacks as this one.  In practice that
> > >       target AS will almost always resolve to address_space_memory,
> > >       but this can at least in theory handle crazy guest setups with
> > >       multiple layers of IOMMU.
> > > 
> > >    4. Have to ensure that the pre-reg callbacks always happen before
> > >       the mapping calls.  For a system with an IOMMU backend which
> > >       requires pre-registration, but doesn't have a guest IOMMU, we
> > >       need to pre-reg, then host-iommu-map RAM blocks that appear in
> > >       PCI address space.
> 
> > Both of these just seem like details of the iommu-type (pre-reg)
> > specific parts of the listener, I'm not understanding what's
> > fundamentally different about it that we can't do it now, within a
> > single listener, nor do I see how it's all that different from x86.
> 
> So, we have two different address spaces to listen on (PCI and
> address_space_memory) so we need to different listener instances
> registered, clearly.

Yes, Alexey's current proposal has two listener instances, that's fine.

> Those two different listeners need to do different things.  1) maps
> memory blocks into VFIO and sets up a notifier to mirror guest IOMMU
> mappings into VFIO.  2) preregisters memory blocks, and it would be a
> bug if it ever saw a guest IOMMU.
> 
> So, what's left to share?

Well, 2) is exactly what type1 needs and 1) seems like what any
guest-visible IOMMU needs.  Look at the code we have now, there's some
small alignment testing, if iommu, else type1.  Maybe the code can be
clarified slightly by duplicating the testing into separate listeners,
but unless there's something fundamentally different between type1 and
spapr-v2, I think that that the vfio-iommu callouts should just be
modularized and the listener shared.  Otherwise we're just splitting the
code for convenience at the cost of code sharing and maintenance.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-17 18:25         ` Alex Williamson
@ 2015-07-18 15:05           ` David Gibson
  2015-07-19 15:04             ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2015-07-18 15:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, qemu-devel,
	Michael Roth, qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 11248 bytes --]

On Fri, Jul 17, 2015 at 12:25:31PM -0600, Alex Williamson wrote:
> On Fri, 2015-07-17 at 15:20 +1000, David Gibson wrote:
> > On Thu, Jul 16, 2015 at 08:44:59AM -0600, Alex Williamson wrote:
> > > On Thu, 2015-07-16 at 15:11 +1000, David Gibson wrote:
> > > > On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> > > > > This makes use of the new "memory registering" feature. The idea is
> > > > > to provide the userspace ability to notify the host kernel about pages
> > > > > which are going to be used for DMA. Having this information, the host
> > > > > kernel can pin them all once per user process, do locked pages
> > > > > accounting (once) and not spent time on doing that in real time with
> > > > > possible failures which cannot be handled nicely in some cases.
> > > > > 
> > > > > This adds a guest RAM memory listener which notifies a VFIO container
> > > > > about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > > > > (i.e. "skip dump" regions) are skipped.
> > > > > 
> > > > > The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > > > > are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > > > > not call it when v2 is detected and enabled.
> > > > > 
> > > > > This does not change the guest visible interface.
> > > > > 
> > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > 
> > > > I've looked at this in more depth now, and attempting to unify the
> > > > pre-reg and mapping listeners like this can't work - they need to be
> > > > listening on different address spaces:  mapping actions need to be
> > > > listening on the PCI address space, whereas the pre-reg needs to be
> > > > listening on address_space_memory.  For x86 - for now - those end up
> > > > being the same thing, but on Power they're not.
> > > > 
> > > > We do need to be clear about what differences are due to the presence
> > > > of a guest IOMMU versus which are due to arch or underlying IOMMU
> > > > type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> > > > well change in future: we could well implement the guest side IOMMU
> > > > for x86 in future (or x86 could invent a paravirt IOMMU interface).
> > > > On the other side, BenH's experimental powernv machine type could
> > > > introduce Power machines without a guest side IOMMU (or at least an
> > > > optional guest side IOMMU).
> > > > 
> > > > The quick and dirty approach here is:
> > > >    1. Leave the main listener as is
> > > >    2. Add a new pre-reg notifier to the spapr iommu specific code,
> > > >       which listens on address_space_memory, *not* the PCI space
> > > 
> > > It is dirty and that's exactly what I've been advising Alexey against
> > > because we have entirely too much dirty spapr specific code that doesn't
> > > need to be spapr specific.  I don't see why separate address space
> > > matters, that's done at the point of registering the listener and so far
> > > doesn't play much a role in the actual listener behavior, just which
> > > regions it sees.
> > 
> > Well, there's two parts to this - the different address spaces means
> > they need to be different listener instances.  They also need
> > different callbacks - or at least parameterized callback behaviour
> > because they do different things (one maps, the other preregs).

Ugh.  I'm pretty sure we're talking at cross purposes here, but I
haven't figured out where the disconnect is.  So I'm not sure my
comments below will be coherent, but let's see.

> Yep, and Alexey's doing that with a switch statement, which seems like
> it's helping to share setup w/o being too ugly.

Which switch are you referring to?  There isn't one in
vfio_listener_region_add().  I'm talking about the current upstream
code here, not with Alexey's ddw patches - my whole point here is I
think the current draft is taking the wrong approach.

> Whether it's a "map" or
> a "pre-reg", we don't really care, it's just an ioctl specific to the
> vfio-iommu type with some setup.

So in this case the operation isn't controlled by the IOMMU type, it's
controlled by which listener instance we're being called from.  spapr
tce v2 needs both pre-reg and map operations, but it needs them on
different address spaces.

Obviously you can reduce things to one set of listener callbacks if
you parameterize it enough - I'm just not seeing that it will be
anything but an unnecessary multiplex if we do.

> Much of that setup is not specific to
> the vfio-iommu type and can be shared between guest-iommu type,
> transparent or visible.  That's the part were I do not want spapr
> specific setup hiding what should be generic code for anybody that has a
> guest-visible iommu versus guest-transparent iommu.

Right, I'm trying to achieve the same goal

> > So I really don't see any sense in which these can be accomplished by
> > the same listener.  *Maybe* they could share some region walking code,
> > but I'm not sure there's going to be anything of significant size here.
> 
> We're already sharing a listener for type1 and spapr windows, spapr-v2
> is a lot like type1, just different ioctls and therefore slightly
> different setup.

No, it's not diffent ioctl()s, it's *extra* ioctls().  v2 still needs
all the same map ioctl()s that v1 does, but it *also* needs pre-reg on
a different set of regions.

> > > > The more generally correct approach, which allows for more complex
> > > > IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> > > > is:
> > > >    1. Have the core implement both a mapping listener and a pre-reg
> > > >       listener (optionally enabled by a per-iommu-type flag).
> > > >       Basically the first one sees what *is* mapped, the second sees
> > > >       what *could* be mapped.
> > > 
> > > This just sounds like different address spaces, address_space_memory vs
> > > address_space_physical_memory
> > 
> > Um.. what?  I'm not even sure what you mean by
> > address_space_physical_memory (I see no such thing in the source).
> 
> It doesn't exist, it's just my attempt to interpret "*is* mapped" vs
> "*could* be mapped".

Um.. ok.  So address_space_memory is already physical address space.
"is mapped" is the PCI address space - that's true by definition on
all platforms.  The platform difference is just whether the PCI
address space is the same as address_space_memory or not.

"could be mapped" is address_space_memory on all current platforms.  I
can think of theoretical cases where it wouldn't be, but they're
pretty contrived.

> > The idea was that the (top level) pre-reg listener would spawn more
> > listeners for any AS which could get (partially) mapped into the PCI
> > addres space.
> > 
> > But.. I looked a bit closer and realised this scheme doesn't actually
> > work.  IOMMU memory regions don't actually have a fixed target AS
> > property (by which I mean the address space the IOMMU translates
> > *into* rather than translates from - address_space_memory in most
> > cases).  Instead any individual IOMMU mapping can point to a different
> > AS supplied in the IOMMUTLB structure.
> 
> Why does that imply separate listeners?

It doesn't, I'm point out a different problem with the scheme I was
suggesting.   Essentially what 3&4 were supposed to be doing was
procedurally determining the "could be mapped" AS in a platform
independent way.

> We already do this in the
> current listener.

No, not quite.  The current listener looks at the IOMMU's source
address space - the space of IOVAs.  I'm saying here that pre-reg /
the "could be mapped" path needs to look at the target address space -
the space which IO PTEs point into.

> This all sounds like things that are not unique to
> spapr, only the actual ioctls are unique.

I'm not suggesting any spapr uniqueness per se here, the only
difference I'm envisigaing is presence / absence of pre-reg.  In fact
even the ioctl()s are the same - or would be assuming any future
pre-reg IOMMU type re-uses the same pre-reg ioctl()s.

> > > >    2. As now, the mapping listener listens on PCI address space, if
> > > >       RAM blocks are added, immediately map them into the host IOMMU,
> > > >       if guest IOMMU blocks appear register a notifier which will
> > > >       mirror guest IOMMU mappings to the host IOMMU (this is what we
> > > >       do now).
> > > 
> > > Right, this is done now, nothing new required.
> > 
> > Yes, I was just spelling that out for comparison with the other part.
> > 
> > > >    3. The pre-reg listener also listens on the PCI address space.  RAM
> > > >       blocks added are pre-registered immediately.  But, if guest
> > > >       IOMMU blocks are added, instead of registering a guest-iommu
> > > >       notifier, we register another listener on the *target* AS of the
> > > >       guest IOMMU, same callbacks as this one.  In practice that
> > > >       target AS will almost always resolve to address_space_memory,
> > > >       but this can at least in theory handle crazy guest setups with
> > > >       multiple layers of IOMMU.
> > > > 
> > > >    4. Have to ensure that the pre-reg callbacks always happen before
> > > >       the mapping calls.  For a system with an IOMMU backend which
> > > >       requires pre-registration, but doesn't have a guest IOMMU, we
> > > >       need to pre-reg, then host-iommu-map RAM blocks that appear in
> > > >       PCI address space.
> > 
> > > Both of these just seem like details of the iommu-type (pre-reg)
> > > specific parts of the listener, I'm not understanding what's
> > > fundamentally different about it that we can't do it now, within a
> > > single listener, nor do I see how it's all that different from x86.
> > 
> > So, we have two different address spaces to listen on (PCI and
> > address_space_memory) so we need to different listener instances
> > registered, clearly.
> 
> Yes, Alexey's current proposal has two listener instances, that's fine.
> 
> > Those two different listeners need to do different things.  1) maps
> > memory blocks into VFIO and sets up a notifier to mirror guest IOMMU
> > mappings into VFIO.  2) preregisters memory blocks, and it would be a
> > bug if it ever saw a guest IOMMU.
> > 
> > So, what's left to share?
> 
> Well, 2) is exactly what type1 needs and 1) seems like what any
> guest-visible IOMMU needs.  Look at the code we have now, there's some
> small alignment testing, if iommu, else type1.  Maybe the code can be
> clarified slightly by duplicating the testing into separate listeners,
> but unless there's something fundamentally different between type1 and
> spapr-v2, I think that that the vfio-iommu callouts should just be
> modularized and the listener shared.  Otherwise we're just splitting the
> code for convenience at the cost of code sharing and maintenance.
> Thanks,
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-17 15:47         ` Alexey Kardashevskiy
@ 2015-07-18 15:17           ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2015-07-18 15:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Crosthwaite, qemu-devel, Michael Roth, Alex Williamson,
	qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6297 bytes --]

On Sat, Jul 18, 2015 at 01:47:34AM +1000, Alexey Kardashevskiy wrote:
> On 07/17/2015 11:39 PM, David Gibson wrote:
> >On Fri, Jul 17, 2015 at 05:13:37PM +1000, Alexey Kardashevskiy wrote:
> >>On 07/16/2015 03:11 PM, David Gibson wrote:
> >>>On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> >>>>This makes use of the new "memory registering" feature. The idea is
> >>>>to provide the userspace ability to notify the host kernel about pages
> >>>>which are going to be used for DMA. Having this information, the host
> >>>>kernel can pin them all once per user process, do locked pages
> >>>>accounting (once) and not spent time on doing that in real time with
> >>>>possible failures which cannot be handled nicely in some cases.
> >>>>
> >>>>This adds a guest RAM memory listener which notifies a VFIO container
> >>>>about memory which needs to be pinned/unpinned. VFIO MMIO regions
> >>>>(i.e. "skip dump" regions) are skipped.
> >>>>
> >>>>The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>>>are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >>>>not call it when v2 is detected and enabled.
> >>>>
> >>>>This does not change the guest visible interface.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>
> >>>I've looked at this in more depth now, and attempting to unify the
> >>>pre-reg and mapping listeners like this can't work - they need to be
> >>>listening on different address spaces:  mapping actions need to be
> >>>listening on the PCI address space, whereas the pre-reg needs to be
> >>>listening on address_space_memory.  For x86 - for now - those end up
> >>>being the same thing, but on Power they're not.
> >>>
> >>>We do need to be clear about what differences are due to the presence
> >>>of a guest IOMMU versus which are due to arch or underlying IOMMU
> >>>type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> >>>well change in future: we could well implement the guest side IOMMU
> >>>for x86 in future (or x86 could invent a paravirt IOMMU interface).
> >>>On the other side, BenH's experimental powernv machine type could
> >>>introduce Power machines without a guest side IOMMU (or at least an
> >>>optional guest side IOMMU).
> >>>
> >>>The quick and dirty approach here is:
> >>>    1. Leave the main listener as is
> >>>    2. Add a new pre-reg notifier to the spapr iommu specific code,
> >>>       which listens on address_space_memory, *not* the PCI space
> >>>
> >>>The more generally correct approach, which allows for more complex
> >>>IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> >>>is:
> >>>    1. Have the core implement both a mapping listener and a pre-reg
> >>>       listener (optionally enabled by a per-iommu-type flag).
> >>>       Basically the first one sees what *is* mapped, the second sees
> >>>       what *could* be mapped.
> >>>
> >>>    2. As now, the mapping listener listens on PCI address space, if
> >>>       RAM blocks are added, immediately map them into the host IOMMU,
> >>>       if guest IOMMU blocks appear register a notifier which will
> >>>       mirror guest IOMMU mappings to the host IOMMU (this is what we
> >>>       do now).
> >>>
> >>>    3. The pre-reg listener also listens on the PCI address space.  RAM
> >>>       blocks added are pre-registered immediately.
> >>
> >>
> >>PCI address space listeners won't be notified about RAM blocks on sPAPR.
> >
> >Sure they will - if any RAM blocks were mapped directly into PCI
> >address space, the listener would be notified.  It's just that no RAM
> >blocks are directly mapped into PCI space, only partially mapped in
> >via IOMMU blocks.
> 
> Right. No RAM blocks are mapped. So on *sPAPR* PCI AS listener won't be
> notified about *RAM*. But you say "they will". I am missing something here.

So, I'm thinking more generally that just the existing PAPR and x86
cases.

The current listener structure on the PCI address space is correct by
construction for all platforms.  It handles the no guest IOMMU case
where RAM is mapped directly into PCI, it handles the guest iommu
case.  It handles the case where there isn't a guest IOMMU as such,
but some or all of RAM is mapped into PCI space at an offset.  It
handles the case where a platform has a "bypass window" which has all
(or a large block) of RAM mapped with a simple offset, and another
window that has a paged IOMMU.

So what I'm saying about is that the listener will see RAM blocks that
are actually mapped into PCI address space.  That's exactly what we
need for the actual mapping calls.

But for pre-reg we need to see RAM blocks that *might* be mapped into
the bus, and we want to see them early.

> >But the idea is this scheme could handle a platform that has both a
> >"bypass" DMA window which maps directly onto a block of ram and an
> >IOMMU controlled DMA window.  Or one which could have either setup
> >depending on circumstances (which is probably true of BenH's "powernv"
> >machine type).
> >
> >>>But, if guest
> >>>       IOMMU blocks are added, instead of registering a guest-iommu
> >>>       notifier,
> >>
> >>"guest-iommu notifier" is the one called via memory_region_notify_iommu()
> >>from H_PUT_TCE? "Instead" implies dropping it, how this can work?
> >
> >Because the other listener - the mapping listener at (2) handles that
> >part.  The pre-reg listener doesn't.
> 
> 
> My bad, #2 included notifiers, right.
> 
> 
> >But as noted in by other mail this whole scheme doesn't work without a
> >way to discover an IOMMU region's target AS in advance, which doesn't
> >currently exist.
> 
> We can add AS to IOMMU MR now, few lines of code :)

Not really, because it changes the semantics of what's possible with a
guest IOMMU.  The current code allows the IOMMU to map in individual
pages from multiple target pools.  That seems an unlikely thing to do
in practice, but I think it requires some thought before doing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-18 15:05           ` David Gibson
@ 2015-07-19 15:04             ` Alex Williamson
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2015-07-19 15:04 UTC (permalink / raw)
  To: David Gibson
  Cc: Peter Crosthwaite, Alexey Kardashevskiy, qemu-devel,
	Michael Roth, qemu-ppc, Paolo Bonzini

On Sun, 2015-07-19 at 01:05 +1000, David Gibson wrote:
> On Fri, Jul 17, 2015 at 12:25:31PM -0600, Alex Williamson wrote:
> > On Fri, 2015-07-17 at 15:20 +1000, David Gibson wrote:
> > > On Thu, Jul 16, 2015 at 08:44:59AM -0600, Alex Williamson wrote:
> > > > On Thu, 2015-07-16 at 15:11 +1000, David Gibson wrote:
> > > > > On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> > > > > > This makes use of the new "memory registering" feature. The idea is
> > > > > > to provide the userspace ability to notify the host kernel about pages
> > > > > > which are going to be used for DMA. Having this information, the host
> > > > > > kernel can pin them all once per user process, do locked pages
> > > > > > accounting (once) and not spent time on doing that in real time with
> > > > > > possible failures which cannot be handled nicely in some cases.
> > > > > > 
> > > > > > This adds a guest RAM memory listener which notifies a VFIO container
> > > > > > about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > > > > > (i.e. "skip dump" regions) are skipped.
> > > > > > 
> > > > > > The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > > > > > are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > > > > > not call it when v2 is detected and enabled.
> > > > > > 
> > > > > > This does not change the guest visible interface.
> > > > > > 
> > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > 
> > > > > I've looked at this in more depth now, and attempting to unify the
> > > > > pre-reg and mapping listeners like this can't work - they need to be
> > > > > listening on different address spaces:  mapping actions need to be
> > > > > listening on the PCI address space, whereas the pre-reg needs to be
> > > > > listening on address_space_memory.  For x86 - for now - those end up
> > > > > being the same thing, but on Power they're not.
> > > > > 
> > > > > We do need to be clear about what differences are due to the presence
> > > > > of a guest IOMMU versus which are due to arch or underlying IOMMU
> > > > > type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> > > > > well change in future: we could well implement the guest side IOMMU
> > > > > for x86 in future (or x86 could invent a paravirt IOMMU interface).
> > > > > On the other side, BenH's experimental powernv machine type could
> > > > > introduce Power machines without a guest side IOMMU (or at least an
> > > > > optional guest side IOMMU).
> > > > > 
> > > > > The quick and dirty approach here is:
> > > > >    1. Leave the main listener as is
> > > > >    2. Add a new pre-reg notifier to the spapr iommu specific code,
> > > > >       which listens on address_space_memory, *not* the PCI space
> > > > 
> > > > It is dirty and that's exactly what I've been advising Alexey against
> > > > because we have entirely too much dirty spapr specific code that doesn't
> > > > need to be spapr specific.  I don't see why separate address space
> > > > matters, that's done at the point of registering the listener and so far
> > > > doesn't play much a role in the actual listener behavior, just which
> > > > regions it sees.
> > > 
> > > Well, there's two parts to this - the different address spaces means
> > > they need to be different listener instances.  They also need
> > > different callbacks - or at least parameterized callback behaviour
> > > because they do different things (one maps, the other preregs).
> 
> Ugh.  I'm pretty sure we're talking at cross purposes here, but I
> haven't figured out where the disconnect is.  So I'm not sure my
> comments below will be coherent, but let's see.
> 
> > Yep, and Alexey's doing that with a switch statement, which seems like
> > it's helping to share setup w/o being too ugly.
> 
> Which switch are you referring to?  There isn't one in
> vfio_listener_region_add().  I'm talking about the current upstream
> code here, not with Alexey's ddw patches - my whole point here is I
> think the current draft is taking the wrong approach.

I'm not sure how we're going to improve on the current draft or disagree
on its direction, if we're not going to take into account the changes
that it makes, such as the added switch statements.  In my opinion, the
series here is starting to head in a workable direction and with the
suggestions made, I think we can eliminate a lot of the repetitive
wrappers and get good re-use of code.  From my perspective, the comments
here are saying we can't do what we think we're already doing due to
some subtle difference in terminology that I'm unable to translate into
a practical difference in code.  Perhaps you could point out
specifically what doesn't work in the current proposal or at least stub
out with pseudo code the proposal you'd like to see.  Thanks,

Alex

> > Whether it's a "map" or
> > a "pre-reg", we don't really care, it's just an ioctl specific to the
> > vfio-iommu type with some setup.
> 
> So in this case the operation isn't controlled by the IOMMU type, it's
> controlled by which listener instance we're being called from.  spapr
> tce v2 needs both pre-reg and map operations, but it needs them on
> different address spaces.
> 
> Obviously you can reduce things to one set of listener callbacks if
> you parameterize it enough - I'm just not seeing that it will be
> anything but an unnecessary multiplex if we do.
> 
> > Much of that setup is not specific to
> > the vfio-iommu type and can be shared between guest-iommu type,
> > transparent or visible.  That's the part were I do not want spapr
> > specific setup hiding what should be generic code for anybody that has a
> > guest-visible iommu versus guest-transparent iommu.
> 
> Right, I'm trying to achieve the same goal
> 
> > > So I really don't see any sense in which these can be accomplished by
> > > the same listener.  *Maybe* they could share some region walking code,
> > > but I'm not sure there's going to be anything of significant size here.
> > 
> > We're already sharing a listener for type1 and spapr windows, spapr-v2
> > is a lot like type1, just different ioctls and therefore slightly
> > different setup.
> 
> No, it's not diffent ioctl()s, it's *extra* ioctls().  v2 still needs
> all the same map ioctl()s that v1 does, but it *also* needs pre-reg on
> a different set of regions.
> 
> > > > > The more generally correct approach, which allows for more complex
> > > > > IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> > > > > is:
> > > > >    1. Have the core implement both a mapping listener and a pre-reg
> > > > >       listener (optionally enabled by a per-iommu-type flag).
> > > > >       Basically the first one sees what *is* mapped, the second sees
> > > > >       what *could* be mapped.
> > > > 
> > > > This just sounds like different address spaces, address_space_memory vs
> > > > address_space_physical_memory
> > > 
> > > Um.. what?  I'm not even sure what you mean by
> > > address_space_physical_memory (I see no such thing in the source).
> > 
> > It doesn't exist, it's just my attempt to interpret "*is* mapped" vs
> > "*could* be mapped".
> 
> Um.. ok.  So address_space_memory is already physical address space.
> "is mapped" is the PCI address space - that's true by definition on
> all platforms.  The platform difference is just whether the PCI
> address space is the same as address_space_memory or not.
> 
> "could be mapped" is address_space_memory on all current platforms.  I
> can think of theoretical cases where it wouldn't be, but they're
> pretty contrived.
> 
> > > The idea was that the (top level) pre-reg listener would spawn more
> > > listeners for any AS which could get (partially) mapped into the PCI
> > > addres space.
> > > 
> > > But.. I looked a bit closer and realised this scheme doesn't actually
> > > work.  IOMMU memory regions don't actually have a fixed target AS
> > > property (by which I mean the address space the IOMMU translates
> > > *into* rather than translates from - address_space_memory in most
> > > cases).  Instead any individual IOMMU mapping can point to a different
> > > AS supplied in the IOMMUTLB structure.
> > 
> > Why does that imply separate listeners?
> 
> It doesn't, I'm point out a different problem with the scheme I was
> suggesting.   Essentially what 3&4 were supposed to be doing was
> procedurally determining the "could be mapped" AS in a platform
> independent way.
> 
> > We already do this in the
> > current listener.
> 
> No, not quite.  The current listener looks at the IOMMU's source
> address space - the space of IOVAs.  I'm saying here that pre-reg /
> the "could be mapped" path needs to look at the target address space -
> the space which IO PTEs point into.
> 
> > This all sounds like things that are not unique to
> > spapr, only the actual ioctls are unique.
> 
> I'm not suggesting any spapr uniqueness per se here, the only
> difference I'm envisigaing is presence / absence of pre-reg.  In fact
> even the ioctl()s are the same - or would be assuming any future
> pre-reg IOMMU type re-uses the same pre-reg ioctl()s.
> 
> > > > >    2. As now, the mapping listener listens on PCI address space, if
> > > > >       RAM blocks are added, immediately map them into the host IOMMU,
> > > > >       if guest IOMMU blocks appear register a notifier which will
> > > > >       mirror guest IOMMU mappings to the host IOMMU (this is what we
> > > > >       do now).
> > > > 
> > > > Right, this is done now, nothing new required.
> > > 
> > > Yes, I was just spelling that out for comparison with the other part.
> > > 
> > > > >    3. The pre-reg listener also listens on the PCI address space.  RAM
> > > > >       blocks added are pre-registered immediately.  But, if guest
> > > > >       IOMMU blocks are added, instead of registering a guest-iommu
> > > > >       notifier, we register another listener on the *target* AS of the
> > > > >       guest IOMMU, same callbacks as this one.  In practice that
> > > > >       target AS will almost always resolve to address_space_memory,
> > > > >       but this can at least in theory handle crazy guest setups with
> > > > >       multiple layers of IOMMU.
> > > > > 
> > > > >    4. Have to ensure that the pre-reg callbacks always happen before
> > > > >       the mapping calls.  For a system with an IOMMU backend which
> > > > >       requires pre-registration, but doesn't have a guest IOMMU, we
> > > > >       need to pre-reg, then host-iommu-map RAM blocks that appear in
> > > > >       PCI address space.
> > > 
> > > > Both of these just seem like details of the iommu-type (pre-reg)
> > > > specific parts of the listener, I'm not understanding what's
> > > > fundamentally different about it that we can't do it now, within a
> > > > single listener, nor do I see how it's all that different from x86.
> > > 
> > > So, we have two different address spaces to listen on (PCI and
> > > address_space_memory) so we need to different listener instances
> > > registered, clearly.
> > 
> > Yes, Alexey's current proposal has two listener instances, that's fine.
> > 
> > > Those two different listeners need to do different things.  1) maps
> > > memory blocks into VFIO and sets up a notifier to mirror guest IOMMU
> > > mappings into VFIO.  2) preregisters memory blocks, and it would be a
> > > bug if it ever saw a guest IOMMU.
> > > 
> > > So, what's left to share?
> > 
> > Well, 2) is exactly what type1 needs and 1) seems like what any
> > guest-visible IOMMU needs.  Look at the code we have now, there's some
> > small alignment testing, if iommu, else type1.  Maybe the code can be
> > clarified slightly by duplicating the testing into separate listeners,
> > but unless there's something fundamentally different between type1 and
> > spapr-v2, I think that that the vfio-iommu callouts should just be
> > modularized and the listener shared.  Otherwise we're just splitting the
> > code for convenience at the cost of code sharing and maintenance.
> > Thanks,
> > 
> > Alex
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-07-19 15:04 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-14 12:21 [Qemu-devel] [RFC PATCH qemu v3 0/4] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 1/4] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2015-07-15 18:26   ` Alex Williamson
2015-07-16  1:12     ` Alexey Kardashevskiy
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 2/4] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
2015-07-15 18:26   ` Alex Williamson
2015-07-16  1:26     ` Alexey Kardashevskiy
2015-07-16  2:51       ` Alex Williamson
2015-07-16  3:31         ` Alexey Kardashevskiy
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 3/4] vfio: Store IOMMU type in container Alexey Kardashevskiy
2015-07-15 18:26   ` Alex Williamson
2015-07-14 12:21 ` [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-16  5:11   ` David Gibson
2015-07-16 14:44     ` Alex Williamson
2015-07-17  5:20       ` David Gibson
2015-07-17 18:25         ` Alex Williamson
2015-07-18 15:05           ` David Gibson
2015-07-19 15:04             ` Alex Williamson
2015-07-17  7:13     ` Alexey Kardashevskiy
2015-07-17 13:39       ` David Gibson
2015-07-17 15:47         ` Alexey Kardashevskiy
2015-07-18 15:17           ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.