All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v18 0/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
@ 2016-06-21  1:14 Alexey Kardashevskiy
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes Alexey Kardashevskiy
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-21  1:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Alex Williamson

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows on pseries.

This patchset is based on David's ppc-for-2.7 branch, sha1 8dc2e5e.

This patchset does not contain guest view table reallocation as
it is not a part of DDW and it depends on
"memory: Add MemoryRegionIOMMUOps.notify_started/stopped callbacks"
which went to the VFIO tree; will repost spapr part when the prerequisite
is merged into David's queue.

Please comment. Thanks!


Alexey Kardashevskiy (5):
  memory: Add reporting of supported page sizes
  vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  vfio: Add host side DMA window capabilities
  vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

 hw/ppc/Makefile.objs          |   1 +
 hw/ppc/spapr.c                |   7 +-
 hw/ppc/spapr_iommu.c          |   8 ++
 hw/ppc/spapr_pci.c            |  77 ++++++++---
 hw/ppc/spapr_rtas_ddw.c       | 295 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              | 180 ++++++++++++++++++++------
 hw/vfio/spapr.c               | 210 ++++++++++++++++++++++++++++++
 include/exec/memory.h         |  19 ++-
 include/hw/pci-host/spapr.h   |   8 +-
 include/hw/ppc/spapr.h        |  16 ++-
 include/hw/vfio/vfio-common.h |  20 ++-
 memory.c                      |  16 ++-
 trace-events                  |  12 ++
 14 files changed, 800 insertions(+), 70 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c
 create mode 100644 hw/vfio/spapr.c

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes
  2016-06-21  1:14 [Qemu-devel] [PATCH qemu v18 0/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2016-06-21  1:14 ` Alexey Kardashevskiy
  2016-06-21  6:16   ` David Gibson
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-21  1:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Alex Williamson

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_min_page_size callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the minimum
actual page size supported by an IOMMU.

As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.

This removes vfio_container_granularity() and uses new helper in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
---
Changes:
v18:
* fixed unnecessary line wrap
* s/get_page_sizes/get_min_page_size/

v16:
* used memory_region_iommu_get_page_sizes() instead of
mr->iommu_ops->get_page_sizes() in memory_region_iommu_replay()

v15:
* s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes

v14:
* removed vfio_container_granularity(), changed memory_region_iommu_replay()

v4:
* s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
---
 hw/ppc/spapr_iommu.c  |  8 ++++++++
 hw/vfio/common.c      |  9 +--------
 include/exec/memory.h | 19 +++++++++++++++----
 memory.c              | 16 +++++++++++++---
 4 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index a3cc572..e230bac 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -149,6 +149,13 @@ static void spapr_tce_table_pre_save(void *opaque)
                                tcet->bus_offset, tcet->page_shift);
 }
 
+static uint64_t spapr_tce_get_min_page_size(MemoryRegion *iommu)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return 1ULL << tcet->page_shift;
+}
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -228,6 +235,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
+    .get_min_page_size = spapr_tce_get_min_page_size,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 1898f1f..27cc159 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -321,11 +321,6 @@ out:
     rcu_read_unlock();
 }
 
-static hwaddr vfio_container_granularity(VFIOContainer *container)
-{
-    return (hwaddr)1 << ctz64(container->iova_pgsizes);
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -392,9 +387,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
-        memory_region_iommu_replay(giommu->iommu, &giommu->n,
-                                   vfio_container_granularity(container),
-                                   false);
+        memory_region_iommu_replay(giommu->iommu, &giommu->n, false);
 
         return;
     }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4ab6800..e3829f7 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -151,6 +151,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /* Returns minimum supported page size */
+    uint64_t (*get_min_page_size)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -573,6 +575,16 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
 
 
 /**
+ * memory_region_iommu_get_min_page_size: get minimum supported page size
+ * for an iommu
+ *
+ * Returns minimum supported page size for an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+uint64_t memory_region_iommu_get_min_page_size(MemoryRegion *mr);
+
+/**
  * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
  *
  * @mr: the memory region that was changed
@@ -596,16 +608,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
 
 /**
  * memory_region_iommu_replay: replay existing IOMMU translations to
- * a notifier
+ * a notifier with the minimum page granularity returned by
+ * mr->iommu_ops->get_page_size().
  *
  * @mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
- * @granularity: Minimum page granularity to replay notifications for
  * @is_write: Whether to treat the replay as a translate "write"
  *     through the iommu
  */
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write);
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
 
 /**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
diff --git a/memory.c b/memory.c
index 8ba496d..fe44e29 100644
--- a/memory.c
+++ b/memory.c
@@ -1502,12 +1502,22 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
     notifier_list_add(&mr->iommu_notify, n);
 }
 
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write)
+uint64_t memory_region_iommu_get_min_page_size(MemoryRegion *mr)
 {
-    hwaddr addr;
+    assert(memory_region_is_iommu(mr));
+    if (mr->iommu_ops && mr->iommu_ops->get_min_page_size) {
+        return mr->iommu_ops->get_min_page_size(mr);
+    }
+    return TARGET_PAGE_SIZE;
+}
+
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
+{
+    hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_min_page_size(mr));
+
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
         iotlb = mr->iommu_ops->translate(mr, addr, is_write);
         if (iotlb.perm != IOMMU_NONE) {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-06-21  1:14 [Qemu-devel] [PATCH qemu v18 0/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-06-21  1:14 ` Alexey Kardashevskiy
  2016-06-21  6:46   ` David Gibson
  2016-06-22 16:49   ` Alex Williamson
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 3/5] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-21  1:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Alex Williamson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This enforces guest RAM blocks to be host page size aligned; however
this is not new as KVM already requires memory slots to be host page
size aligned.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v18:
* made a copy of listener trace points in spapr.c
* fixed cleanup in vfio_connect_container
* removed assert in vfio_prereg_listener_region_add()
* created "prereg" copy of traces

v17:
* s/prereg\.c/spapr.c/
* s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
* vfio_prereg_listener_skipped_section does hw_error() on IOMMUs

v16:
* switched to 64bit math everywhere as there is no chance to see
region_add on RAM blocks even remotely close to 1<<64bytes.

v15:
* banned unaligned sections
* added an vfio_prereg_gpa_to_ua() helper

v14:
* s/free_container_exit/listener_release_exit/g
* added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
---
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              |  42 ++++++++++---
 hw/vfio/spapr.c               | 139 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   4 ++
 trace-events                  |   6 ++
 5 files changed, 182 insertions(+), 10 deletions(-)
 create mode 100644 hw/vfio/spapr.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index ceddbb8..c25e32b 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
+obj-$(CONFIG_SOFTMMU) += spapr.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 27cc159..22be48b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -502,6 +502,9 @@ static const MemoryListener vfio_memory_listener = {
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
 }
 
 static struct vfio_info_cap_header *
@@ -860,8 +863,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -886,8 +889,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
             container->iova_pgsizes = info.iova_pgsizes;
         }
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
 
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
@@ -895,7 +900,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -907,11 +914,23 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                memory_listener_unregister(&container->prereg_listener);
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto free_container_exit;
+            }
         }
 
         /*
@@ -924,7 +943,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if (ret) {
             error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
             ret = -errno;
-            goto free_container_exit;
+            if (v2) {
+                memory_listener_unregister(&container->prereg_listener);
+            }
+            goto listener_release_exit;
         }
         container->min_iova = info.dma32_window_start;
         container->max_iova = container->min_iova + info.dma32_window_size - 1;
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
new file mode 100644
index 0000000..5c29bec
--- /dev/null
+++ b/hw/vfio/spapr.c
@@ -0,0 +1,139 @@
+/*
+ * DMA memory preregistration
+ *
+ * Authors:
+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "cpu.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/hw.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
+{
+    if (memory_region_is_iommu(section->mr)) {
+        hw_error("Cannot possibly preregister IOMMU memory");
+    }
+
+    return !memory_region_is_ram(section->mr) ||
+            memory_region_is_skip_dump(section->mr);
+}
+
+static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
+{
+    return memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (gpa - section->offset_within_address_space);
+}
+
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    hwaddr end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_prereg_listener_region_add_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    end = section->offset_within_address_space + int128_get64(section->size);
+    if (gpa >= end) {
+        return;
+    }
+
+    memory_region_ref(section->mr);
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+    trace_vfio_prereg_register(reg.vaddr, reg.size, ret ? -errno : 0);
+    if (ret) {
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->initialized) {
+            if (!container->error) {
+                container->error = ret;
+            }
+        } else {
+            hw_error("vfio: Memory registering failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    hwaddr end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_prereg_listener_region_del_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    end = section->offset_within_address_space + int128_get64(section->size);
+    if (gpa >= end) {
+        return;
+    }
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
+    trace_vfio_prereg_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+}
+
+const MemoryListener vfio_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0610377..405c3b2 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -73,6 +73,8 @@ typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
+    MemoryListener prereg_listener;
+    unsigned iommu_type;
     int error;
     bool initialized;
     /*
@@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
 int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
                              uint32_t subtype, struct vfio_region_info **info);
 #endif
+extern const MemoryListener vfio_prereg_listener;
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index da0d060..0b1583f 100644
--- a/trace-events
+++ b/trace-events
@@ -1770,6 +1770,12 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 
+# hw/vfio/spapr.c
+vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64
+vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del %"PRIx64" - %"PRIx64
+vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
 vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH qemu v18 3/5] vfio: Add host side DMA window capabilities
  2016-06-21  1:14 [Qemu-devel] [PATCH qemu v18 0/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes Alexey Kardashevskiy
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-06-21  1:14 ` Alexey Kardashevskiy
  2016-06-21  6:50   ` David Gibson
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  4 siblings, 1 reply; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-21  1:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Alex Williamson

There are going to be multiple IOMMUs per a container. This moves
the single host IOMMU parameter set to a list of VFIOHostDMAWindow.

This should cause no behavioral change and will be used later by
the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v18:
* vfio_host_win_add() checks for non-overlapping windows instead of calling
vfio_host_win_lookup() which checks for inclusion
* inlined vfio_host_win_lookup() as I ended up using it just once
* put VFIOHostDMAWindow::max_iova in new line in include/hw/vfio/vfio-common.h

v17:
* vfio_host_win_add() uses vfio_host_win_lookup() for overlap check and
aborts if any found instead of returning an error (as recovery is not
possible anyway)
* hw_error() when overlapped iommu is detected

v16:
* adjusted commit log with changes from v15

v15:
* s/vfio_host_iommu_add/vfio_host_win_add/
* s/VFIOHostIOMMU/VFIOHostDMAWindow/
---
 hw/vfio/common.c              | 60 +++++++++++++++++++++++++++++++------------
 include/hw/vfio/vfio-common.h | 10 ++++++--
 2 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 22be48b..b53a1db 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -28,6 +28,7 @@
 #include "exec/memory.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
+#include "qemu/range.h"
 #include "sysemu/kvm.h"
 #ifdef CONFIG_KVM
 #include "linux/kvm.h"
@@ -241,6 +242,29 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
+static void vfio_host_win_add(VFIOContainer *container,
+                              hwaddr min_iova, hwaddr max_iova,
+                              uint64_t iova_pgsizes)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1,
+                           min_iova,
+                           max_iova - min_iova + 1)) {
+            hw_error("%s: Overlapped IOMMU are not enabled", __func__);
+        }
+    }
+
+    hostwin = g_malloc0(sizeof(*hostwin));
+
+    hostwin->min_iova = min_iova;
+    hostwin->max_iova = max_iova;
+    hostwin->iova_pgsizes = iova_pgsizes;
+    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -329,6 +353,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
     Int128 llend, llsize;
     void *vaddr;
     int ret;
+    VFIOHostDMAWindow *hostwin;
+    bool hostwin_found;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_add_skip(
@@ -354,7 +380,15 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(int128_sub(llend, int128_one()));
 
-    if ((iova < container->min_iova) || (end > container->max_iova)) {
+    hostwin_found = false;
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
+            hostwin_found = true;
+            break;
+        }
+    }
+
+    if (!hostwin_found) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
                      container, iova, end);
@@ -369,10 +403,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         trace_vfio_listener_region_add_iommu(iova, end);
         /*
-         * FIXME: We should do some checking to see if the
-         * capabilities of the host VFIO IOMMU are adequate to model
-         * the guest IOMMU
-         *
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
          * would be the right place to wire that up (tell the KVM
@@ -878,17 +908,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        container->min_iova = 0;
-        container->max_iova = (hwaddr)-1;
-
-        /* Assume just 4K IOVA page size */
-        container->iova_pgsizes = 0x1000;
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
         /* Ignore errors */
-        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
-            container->iova_pgsizes = info.iova_pgsizes;
+        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+            /* Assume 4k IOVA page size */
+            info.iova_pgsizes = 4096;
         }
+        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
     } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
                ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
@@ -948,11 +975,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
             goto listener_release_exit;
         }
-        container->min_iova = info.dma32_window_start;
-        container->max_iova = container->min_iova + info.dma32_window_size - 1;
 
-        /* Assume just 4K IOVA pages for now */
-        container->iova_pgsizes = 0x1000;
+        /* The default table uses 4K pages */
+        vfio_host_win_add(container, info.dma32_window_start,
+                          info.dma32_window_start +
+                          info.dma32_window_size - 1,
+                          0x1000);
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 405c3b2..b1f3e92 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -82,9 +82,8 @@ typedef struct VFIOContainer {
      * contiguous IOVA window.  We may need to generalize that in
      * future
      */
-    hwaddr min_iova, max_iova;
-    uint64_t iova_pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
@@ -97,6 +96,13 @@ typedef struct VFIOGuestIOMMU {
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
+typedef struct VFIOHostDMAWindow {
+    hwaddr min_iova;
+    hwaddr max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
+} VFIOHostDMAWindow;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-06-21  1:14 [Qemu-devel] [PATCH qemu v18 0/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 3/5] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
@ 2016-06-21  1:14 ` Alexey Kardashevskiy
  2016-06-22  1:29   ` David Gibson
  2016-06-22 14:38   ` Laurent Vivier
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  4 siblings, 2 replies; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-21  1:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Alex Williamson

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds a helper to vfio_listener_region_add which makes
VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
the host IOMMU list; the opposite action is taken in
vfio_listener_region_del.

When creating a new window, this uses heuristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v18:
* moved trace definitions under hw/vfio/spapr.c section
* moved trace_vfio_spapr_remove_window to vfio_spapr_remove_window()
* vfio_host_win_del() now checks for exact window size
* one ctz() less in vfio_spapr_create_window()

v17:
* moved spapr window create/remove helpers to separate file
* added hw_error() if vfio_host_win_del() failed

v16:
* used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
* enforced no intersections between windows

v14:
* new to the series
---
 hw/vfio/common.c              | 79 +++++++++++++++++++++++++++++++++++++------
 hw/vfio/spapr.c               | 71 ++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  6 ++++
 trace-events                  |  2 ++
 4 files changed, 148 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b53a1db..8e3466c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -265,6 +265,21 @@ static void vfio_host_win_add(VFIOContainer *container,
     QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
 }
 
+static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+                             hwaddr max_iova)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
+            QLIST_REMOVE(hostwin, hostwin_next);
+            return 0;
+        }
+    }
+
+    return -1;
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -380,6 +395,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(int128_sub(llend, int128_one()));
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        VFIOHostDMAWindow *hostwin;
+        hwaddr pgsize = 0;
+
+        /* For now intersections are not allowed, we may relax this later */
+        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+            if (ranges_overlap(hostwin->min_iova,
+                               hostwin->max_iova - hostwin->min_iova + 1,
+                               section->offset_within_address_space,
+                               int128_get64(section->size))) {
+                goto fail;
+            }
+        }
+
+        ret = vfio_spapr_create_window(container, section, &pgsize);
+        if (ret) {
+            goto fail;
+        }
+
+        vfio_host_win_add(container, section->offset_within_address_space,
+                          section->offset_within_address_space +
+                          int128_get64(section->size) - 1, pgsize);
+    }
+
     hostwin_found = false;
     QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
         if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
@@ -522,6 +561,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, int128_get64(llsize), ret);
     }
+
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        vfio_spapr_remove_window(container,
+                                 section->offset_within_address_space);
+        if (vfio_host_win_del(container,
+                              section->offset_within_address_space,
+                              section->offset_within_address_space +
+                              int128_get64(section->size) - 1) < 0) {
+            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
+                     __func__, section->offset_within_address_space);
+        }
+    }
 }
 
 static const MemoryListener vfio_memory_listener = {
@@ -960,11 +1011,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
@@ -976,11 +1022,24 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto listener_release_exit;
         }
 
-        /* The default table uses 4K pages */
-        vfio_host_win_add(container, info.dma32_window_start,
-                          info.dma32_window_start +
-                          info.dma32_window_size - 1,
-                          0x1000);
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
+            if (ret) {
+                goto free_container_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 5c29bec..852da0b 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -137,3 +137,74 @@ const MemoryListener vfio_prereg_listener = {
     .region_add = vfio_prereg_listener_region_add,
     .region_del = vfio_prereg_listener_region_del,
 };
+
+int vfio_spapr_create_window(VFIOContainer *container,
+                             MemoryRegionSection *section,
+                             hwaddr *pgsize)
+{
+    int ret;
+    unsigned pagesize = memory_region_iommu_get_min_page_size(section->mr);
+    unsigned entries, pages;
+    struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
+
+    /*
+     * FIXME: For VFIO iommu types which have KVM acceleration to
+     * avoid bouncing all map/unmaps through qemu this way, this
+     * would be the right place to wire that up (tell the KVM
+     * device emulation the VFIO iommu handles to use).
+     */
+    create.window_size = int128_get64(section->size);
+    create.page_shift = ctz64(pagesize);
+    /*
+     * SPAPR host supports multilevel TCE tables, there is some
+     * heuristic to decide how many levels we want for our table:
+     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+     */
+    entries = create.window_size >> create.page_shift;
+    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
+    pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
+    create.levels = ctz64(pages) / 6 + 1;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+    if (ret) {
+        error_report("Failed to create a window, ret = %d (%m)", ret);
+        return -errno;
+    }
+
+    if (create.start_addr != section->offset_within_address_space) {
+        vfio_spapr_remove_window(container, create.start_addr);
+
+        error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
+                     section->offset_within_address_space,
+                     create.start_addr);
+        ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        return -EINVAL;
+    }
+    trace_vfio_spapr_create_window(create.page_shift,
+                                   create.window_size,
+                                   create.start_addr);
+    *pgsize = pagesize;
+
+    return 0;
+}
+
+int vfio_spapr_remove_window(VFIOContainer *container,
+                             hwaddr offset_within_address_space)
+{
+    struct vfio_iommu_spapr_tce_remove remove = {
+        .argsz = sizeof(remove),
+        .start_addr = offset_within_address_space,
+    };
+    int ret;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+    if (ret) {
+        error_report("Failed to remove window at %"PRIx64,
+                     remove.start_addr);
+        return -errno;
+    }
+
+    trace_vfio_spapr_remove_window(offset_within_address_space);
+
+    return 0;
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b1f3e92..07f7188 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -168,4 +168,10 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
+int vfio_spapr_create_window(VFIOContainer *container,
+                             MemoryRegionSection *section,
+                             hwaddr *pgsize);
+int vfio_spapr_remove_window(VFIOContainer *container,
+                             hwaddr offset_within_address_space);
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index 0b1583f..7e94d92 100644
--- a/trace-events
+++ b/trace-events
@@ -1775,6 +1775,8 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING reg
 vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del %"PRIx64" - %"PRIx64
 vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-21  1:14 [Qemu-devel] [PATCH qemu v18 0/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-06-21  1:14 ` Alexey Kardashevskiy
  2016-06-22  2:35   ` David Gibson
  4 siblings, 1 reply; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-21  1:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Alex Williamson

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.6 machine and older disable it.
This also creates a single DMA window for the older machines to
maintain backward migration.

This implements DDW for PHB with emulated and VFIO devices. The host
kernel support is required. The advertised IOMMU page sizes are 4K and
64K; 16M pages are supported but not advertised by default, in order to
enable them, the user has to specify "pgsz" property for PHB and
enable huge pages for RAM.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v18:
* fixed bug when ddw-create rtas call was always creating window at 1<<59
offset
* update minimum supported machine version
* s/dma64_window_addr/dma_win_addr/ to match dma_win_addr

v17:
* fixed: "query" did return non-page-shifted value when memory hotplug is enabled

v16:
* s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
* s/SPAPR_PCI_LIOBN()/dma_liobn[]/

v15:
* moved page mask filtering to PHB realize(), use "-mempath" to know
if there are huge pages
* fixed error reporting in RTAS handlers
* max window size accounts now hotpluggable memory boundaries
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   7 +-
 hw/ppc/spapr_pci.c          |  77 +++++++++---
 hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |   8 +-
 include/hw/ppc/spapr.h      |  16 ++-
 trace-events                |   4 +
 7 files changed, 386 insertions(+), 22 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 5cc6608..91a3420 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 778fa25..f7cff27 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
  * pseries-2.6
  */
 #define SPAPR_COMPAT_2_6 \
-    HW_COMPAT_2_6
+    HW_COMPAT_2_6 \
+    { \
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+        .property = "ddw",\
+        .value    = stringify(off),\
+    },
 
 static void spapr_machine_2_6_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 9f28fb3..0cb51dd 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -35,6 +35,7 @@
 #include "hw/ppc/spapr.h"
 #include "hw/pci-host/spapr.h"
 #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
 #include <libfdt.h>
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -45,6 +46,7 @@
 #include "hw/ppc/spapr_drc.h"
 #include "sysemu/device_tree.h"
 #include "sysemu/kvm.h"
+#include "sysemu/hostmem.h"
 
 #include "hw/vfio/vfio.h"
 
@@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     int fdt_start_offset = 0, fdt_size;
 
     if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
+        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
 
         spapr_tce_set_need_vfio(tcet, true);
     }
@@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
+    const unsigned windows_supported =
+        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
 
-        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
+        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
+            || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
             || (sphb->mem_win_addr != (hwaddr)-1)
             || (sphb->io_win_addr != (hwaddr)-1)) {
             error_setg(errp, "Either \"index\" or other parameters must"
@@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
 
         sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+        for (i = 0; i < windows_supported; ++i) {
+            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
+        }
 
         windows_base = SPAPR_PCI_WINDOW_BASE
             + sphb->index * SPAPR_PCI_WINDOW_SPACING;
@@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (sphb->dma_liobn == (uint32_t)-1) {
-        error_setg(errp, "LIOBN not specified for PHB");
+    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
+        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
+        error_setg(errp, "LIOBN(s) not specified for PHB");
         return;
     }
 
@@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return;
+    /* DMA setup */
+    for (i = 0; i < windows_supported; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
     }
 
-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
-                                        spapr_tce_get_iommu(tcet), 0);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+    int i;
+    sPAPRTCETable *tcet;
 
-    if (tcet && tcet->nb_table) {
-        spapr_tce_table_disable(tcet);
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
+
+        if (tcet && tcet->nb_table) {
+            spapr_tce_table_disable(tcet);
+        }
     }
 
     /* Register default 32bit DMA window */
+    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
 }
@@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
 static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
     DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
-    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
+    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
+    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
     DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
     DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
                        SPAPR_PCI_MMIO_WIN_SIZE),
@@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
+                       (1ULL << 12) | (1ULL << 16)),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
     .post_load = spapr_pci_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
-        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
+        VMSTATE_UNUSED(4), /* dma_liobn */
         VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
@@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
@@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
-    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
+    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
     if (!tcet) {
         return -1;
     }
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..177dcff
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,295 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "cpu.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->nb_table) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->nb_table) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
+{
+    int i;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
+        if (page_mask & (1ULL << masks[i].shift)) {
+            mask |= masks[i].mask;
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid, max_window_size;
+    uint32_t avail, addr, pgmask = 0;
+    MachineState *machine = MACHINE(spapr);
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    /* Translate page mask to LoPAPR format */
+    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as maximum supported RAM size was in 4K pages.
+     */
+    if (machine->ram_size == machine->maxram_size) {
+        max_window_size = machine->ram_size;
+    } else {
+        MemoryHotplugState *hpms = &spapr->hotplug_memory;
+
+        max_window_size = hpms->base + memory_region_size(&hpms->mr);
+    }
+
+    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid, win_addr;
+    int windows;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+    windows = spapr_phb_get_active_win_num(sphb);
+
+    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
+        (window_shift < page_shift)) {
+        goto param_error_exit;
+    }
+
+    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
+        goto hw_error_exit;
+    }
+
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto hw_error_exit;
+    }
+
+    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
+    spapr_tce_table_enable(tcet, page_shift, win_addr,
+                           1ULL << (window_shift - page_shift));
+    if (!tcet->nb_table) {
+        goto hw_error_exit;
+    }
+
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift, tcet->bus_offset, liobn);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
+        goto param_error_exit;
+    }
+
+    spapr_tce_table_disable(tcet);
+    trace_spapr_iommu_ddw_remove(liobn);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..92aa610 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -32,6 +32,8 @@
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 typedef struct sPAPRPHBState sPAPRPHBState;
 
 typedef struct spapr_pci_msi {
@@ -56,7 +58,7 @@ struct sPAPRPHBState {
     hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
     MemoryRegion memwindow, iowindow, msiwindow;
 
-    uint32_t dma_liobn;
+    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
     hwaddr dma_win_addr, dma_win_size;
     AddressSpace iommu_as;
     MemoryRegion iommu_root;
@@ -71,6 +73,10 @@ struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint64_t page_size_mask;
+    uint64_t dma64_win_addr;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index e1f8274..36d1748 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
diff --git a/trace-events b/trace-events
index 7e94d92..5b52634 100644
--- a/trace-events
+++ b/trace-events
@@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
+spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-06-21  6:16   ` David Gibson
  2016-06-21 10:23     ` Paolo Bonzini
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2016-06-21  6:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson, pbonzini

[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]

On Tue, Jun 21, 2016 at 11:14:01AM +1000, Alexey Kardashevskiy wrote:
> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_min_page_size callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the minimum
> actual page size supported by an IOMMU.
> 
> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
> as fallback.
> 
> This removes vfio_container_granularity() and uses new helper in
> memory_region_iommu_replay() when replaying IOMMU mappings on added
> IOMMU memory region.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Acked-by: Alex Williamson <alex.williamson@redhat.com>

One remaining nit..

[snip]
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
> +{
> +    hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_min_page_size(mr));

Because this is now a plain size, rather than some sort of pagemask,
you don't need the ctz64() business.

>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>          iotlb = mr->iommu_ops->translate(mr, addr, is_write);
>          if (iotlb.perm != IOMMU_NONE) {

Paolo, are you ok for me to make that small change and take this
through my tree?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-06-21  6:46   ` David Gibson
  2016-06-22 16:49   ` Alex Williamson
  1 sibling, 0 replies; 23+ messages in thread
From: David Gibson @ 2016-06-21  6:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 14411 bytes --]

On Tue, Jun 21, 2016 at 11:14:02AM +1000, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This enforces guest RAM blocks to be host page size aligned; however
> this is not new as KVM already requires memory slots to be host page
> size aligned.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Alex, want to take this through your tree, or should I take it through
mine?

> ---
> Changes:
> v18:
> * made a copy of listener trace points in spapr.c
> * fixed cleanup in vfio_connect_container
> * removed assert in vfio_prereg_listener_region_add()
> * created "prereg" copy of traces
> 
> v17:
> * s/prereg\.c/spapr.c/
> * s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
> * vfio_prereg_listener_skipped_section does hw_error() on IOMMUs
> 
> v16:
> * switched to 64bit math everywhere as there is no chance to see
> region_add on RAM blocks even remotely close to 1<<64bytes.
> 
> v15:
> * banned unaligned sections
> * added an vfio_prereg_gpa_to_ua() helper
> 
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  42 ++++++++++---
>  hw/vfio/spapr.c               | 139 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   6 ++
>  5 files changed, 182 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/spapr.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..c25e32b 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += spapr.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 27cc159..22be48b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -502,6 +502,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
>  }
>  
>  static struct vfio_info_cap_header *
> @@ -860,8 +863,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -886,8 +889,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -895,7 +900,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -907,11 +914,23 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                memory_listener_unregister(&container->prereg_listener);
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto free_container_exit;
> +            }
>          }
>  
>          /*
> @@ -924,7 +943,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if (ret) {
>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>              ret = -errno;
> -            goto free_container_exit;
> +            if (v2) {
> +                memory_listener_unregister(&container->prereg_listener);
> +            }
> +            goto listener_release_exit;
>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> new file mode 100644
> index 0000000..5c29bec
> --- /dev/null
> +++ b/hw/vfio/spapr.c
> @@ -0,0 +1,139 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "cpu.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/hw.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    if (memory_region_is_iommu(section->mr)) {
> +        hw_error("Cannot possibly preregister IOMMU memory");
> +    }
> +
> +    return !memory_region_is_ram(section->mr) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
> +{
> +    return memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_prereg_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_prereg_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: Memory registering failed, unable to continue");
> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_prereg_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_prereg_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0610377..405c3b2 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
> +    MemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>                               uint32_t subtype, struct vfio_region_info **info);
>  #endif
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index da0d060..0b1583f 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1770,6 +1770,12 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  
> +# hw/vfio/spapr.c
> +vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64
> +vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del %"PRIx64" - %"PRIx64
> +vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>  vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 3/5] vfio: Add host side DMA window capabilities
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 3/5] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
@ 2016-06-21  6:50   ` David Gibson
  2016-06-22 17:03     ` Alex Williamson
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2016-06-21  6:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1876 bytes --]

On Tue, Jun 21, 2016 at 11:14:03AM +1000, Alexey Kardashevskiy wrote:
> There are going to be multiple IOMMUs per a container. This moves
> the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
> 
> This should cause no behavioral change and will be used later by
> the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Looks ok to me.  Again, Alex, your tree or mine?

One minor point..
[snip]
> @@ -878,17 +908,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        container->min_iova = 0;
> -        container->max_iova = (hwaddr)-1;
> -
> -        /* Assume just 4K IOVA page size */
> -        container->iova_pgsizes = 0x1000;
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>          /* Ignore errors */
> -        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> -            container->iova_pgsizes = info.iova_pgsizes;
> +        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> +            /* Assume 4k IOVA page size */
> +            info.iova_pgsizes = 4096;
>          }
> +        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);

I don't think it needs to hold this patch up, but at some point we
should work out the real range covered by the x86 IOMMU tables and put
that in here.  I'm pretty sure it won't actually be 2^64-1.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes
  2016-06-21  6:16   ` David Gibson
@ 2016-06-21 10:23     ` Paolo Bonzini
  2016-06-22  1:13       ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2016-06-21 10:23 UTC (permalink / raw)
  To: David Gibson, Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson



On 21/06/2016 08:16, David Gibson wrote:
>>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>>          iotlb = mr->iommu_ops->translate(mr, addr, is_write);
>>          if (iotlb.perm != IOMMU_NONE) {
> 
> Paolo, are you ok for me to make that small change and take this
> through my tree?

Sure.

Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes
  2016-06-21 10:23     ` Paolo Bonzini
@ 2016-06-22  1:13       ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2016-06-22  1:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 654 bytes --]

On Tue, Jun 21, 2016 at 12:23:02PM +0200, Paolo Bonzini wrote:
> 
> 
> On 21/06/2016 08:16, David Gibson wrote:
> >>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> >>          iotlb = mr->iommu_ops->translate(mr, addr, is_write);
> >>          if (iotlb.perm != IOMMU_NONE) {
> > 
> > Paolo, are you ok for me to make that small change and take this
> > through my tree?
> 
> Sure.

Thanks, applied to ppc-for-2.7.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-06-22  1:29   ` David Gibson
  2016-06-22 14:38   ` Laurent Vivier
  1 sibling, 0 replies; 23+ messages in thread
From: David Gibson @ 2016-06-22  1:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 11561 bytes --]

On Tue, Jun 21, 2016 at 11:14:04AM +1000, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds a helper to vfio_listener_region_add which makes
> VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
> the host IOMMU list; the opposite action is taken in
> vfio_listener_region_del.
> 
> When creating a new window, this uses heuristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>


> ---
> Changes:
> v18:
> * moved trace definitions under hw/vfio/spapr.c section
> * moved trace_vfio_spapr_remove_window to vfio_spapr_remove_window()
> * vfio_host_win_del() now checks for exact window size
> * one ctz() less in vfio_spapr_create_window()
> 
> v17:
> * moved spapr window create/remove helpers to separate file
> * added hw_error() if vfio_host_win_del() failed
> 
> v16:
> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> * enforced no intersections between windows
> 
> v14:
> * new to the series
> ---
>  hw/vfio/common.c              | 79 +++++++++++++++++++++++++++++++++++++------
>  hw/vfio/spapr.c               | 71 ++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  6 ++++
>  trace-events                  |  2 ++
>  4 files changed, 148 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index b53a1db..8e3466c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -265,6 +265,21 @@ static void vfio_host_win_add(VFIOContainer *container,
>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
>  }
>  
> +static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> +                             hwaddr max_iova)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
> +            QLIST_REMOVE(hostwin, hostwin_next);
> +            return 0;
> +        }
> +    }
> +
> +    return -1;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -380,6 +395,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        VFIOHostDMAWindow *hostwin;
> +        hwaddr pgsize = 0;
> +
> +        /* For now intersections are not allowed, we may relax this later */
> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +            if (ranges_overlap(hostwin->min_iova,
> +                               hostwin->max_iova - hostwin->min_iova + 1,
> +                               section->offset_within_address_space,
> +                               int128_get64(section->size))) {
> +                goto fail;
> +            }
> +        }
> +
> +        ret = vfio_spapr_create_window(container, section, &pgsize);
> +        if (ret) {
> +            goto fail;
> +        }
> +
> +        vfio_host_win_add(container, section->offset_within_address_space,
> +                          section->offset_within_address_space +
> +                          int128_get64(section->size) - 1, pgsize);
> +    }
> +
>      hostwin_found = false;
>      QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>          if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
> @@ -522,6 +561,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       "0x%"HWADDR_PRIx") = %d (%m)",
>                       container, iova, int128_get64(llsize), ret);
>      }
> +
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        vfio_spapr_remove_window(container,
> +                                 section->offset_within_address_space);
> +        if (vfio_host_win_del(container,
> +                              section->offset_within_address_space,
> +                              section->offset_within_address_space +
> +                              int128_get64(section->size) - 1) < 0) {
> +            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> +                     __func__, section->offset_within_address_space);
> +        }
> +    }
>  }
>  
>  static const MemoryListener vfio_memory_listener = {
> @@ -960,11 +1011,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -976,11 +1022,24 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto listener_release_exit;
>          }
>  
> -        /* The default table uses 4K pages */
> -        vfio_host_win_add(container, info.dma32_window_start,
> -                          info.dma32_window_start +
> -                          info.dma32_window_size - 1,
> -                          0x1000);
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
> +            if (ret) {
> +                goto free_container_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            vfio_host_win_add(container, info.dma32_window_start,
> +                              info.dma32_window_start +
> +                              info.dma32_window_size - 1,
> +                              0x1000);
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index 5c29bec..852da0b 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -137,3 +137,74 @@ const MemoryListener vfio_prereg_listener = {
>      .region_add = vfio_prereg_listener_region_add,
>      .region_del = vfio_prereg_listener_region_del,
>  };
> +
> +int vfio_spapr_create_window(VFIOContainer *container,
> +                             MemoryRegionSection *section,
> +                             hwaddr *pgsize)
> +{
> +    int ret;
> +    unsigned pagesize = memory_region_iommu_get_min_page_size(section->mr);
> +    unsigned entries, pages;
> +    struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> +
> +    /*
> +     * FIXME: For VFIO iommu types which have KVM acceleration to
> +     * avoid bouncing all map/unmaps through qemu this way, this
> +     * would be the right place to wire that up (tell the KVM
> +     * device emulation the VFIO iommu handles to use).
> +     */
> +    create.window_size = int128_get64(section->size);
> +    create.page_shift = ctz64(pagesize);
> +    /*
> +     * SPAPR host supports multilevel TCE tables, there is some
> +     * heuristic to decide how many levels we want for our table:
> +     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +     */
> +    entries = create.window_size >> create.page_shift;
> +    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> +    pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
> +    create.levels = ctz64(pages) / 6 + 1;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    if (ret) {
> +        error_report("Failed to create a window, ret = %d (%m)", ret);
> +        return -errno;
> +    }
> +
> +    if (create.start_addr != section->offset_within_address_space) {
> +        vfio_spapr_remove_window(container, create.start_addr);
> +
> +        error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> +                     section->offset_within_address_space,
> +                     create.start_addr);
> +        ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        return -EINVAL;
> +    }
> +    trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.window_size,
> +                                   create.start_addr);
> +    *pgsize = pagesize;
> +
> +    return 0;
> +}
> +
> +int vfio_spapr_remove_window(VFIOContainer *container,
> +                             hwaddr offset_within_address_space)
> +{
> +    struct vfio_iommu_spapr_tce_remove remove = {
> +        .argsz = sizeof(remove),
> +        .start_addr = offset_within_address_space,
> +    };
> +    int ret;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +    if (ret) {
> +        error_report("Failed to remove window at %"PRIx64,
> +                     remove.start_addr);
> +        return -errno;
> +    }
> +
> +    trace_vfio_spapr_remove_window(offset_within_address_space);
> +
> +    return 0;
> +}
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index b1f3e92..07f7188 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -168,4 +168,10 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
>  
> +int vfio_spapr_create_window(VFIOContainer *container,
> +                             MemoryRegionSection *section,
> +                             hwaddr *pgsize);
> +int vfio_spapr_remove_window(VFIOContainer *container,
> +                             hwaddr offset_within_address_space);
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index 0b1583f..7e94d92 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1775,6 +1775,8 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING reg
>  vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del %"PRIx64" - %"PRIx64
>  vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2016-06-22  2:35   ` David Gibson
  2016-06-22  3:23     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2016-06-22  2:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 27386 bytes --]

On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.6 machine and older disable it.
> This also creates a single DMA window for the older machines to
> maintain backward migration.
> 
> This implements DDW for PHB with emulated and VFIO devices. The host
> kernel support is required. The advertised IOMMU page sizes are 4K and
> 64K; 16M pages are supported but not advertised by default, in order to
> enable them, the user has to specify "pgsz" property for PHB and
> enable huge pages for RAM.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

A few queries below.  Not sure if they'll require code changes or just
explanation.

> ---
> Changes:
> v18:
> * fixed bug when ddw-create rtas call was always creating window at 1<<59
> offset
> * update minimum supported machine version
> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
> 
> v17:
> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
> 
> v16:
> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> 
> v15:
> * moved page mask filtering to PHB realize(), use "-mempath" to know
> if there are huge pages
> * fixed error reporting in RTAS handlers
> * max window size accounts now hotpluggable memory boundaries
> ---
>  hw/ppc/Makefile.objs        |   1 +
>  hw/ppc/spapr.c              |   7 +-
>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |   8 +-
>  include/hw/ppc/spapr.h      |  16 ++-
>  trace-events                |   4 +
>  7 files changed, 386 insertions(+), 22 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 5cc6608..91a3420 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 778fa25..f7cff27 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
>   * pseries-2.6
>   */
>  #define SPAPR_COMPAT_2_6 \
> -    HW_COMPAT_2_6
> +    HW_COMPAT_2_6 \
> +    { \
> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +        .property = "ddw",\
> +        .value    = stringify(off),\
> +    },
>  
>  static void spapr_machine_2_6_instance_options(MachineState *machine)
>  {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 9f28fb3..0cb51dd 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -35,6 +35,7 @@
>  #include "hw/ppc/spapr.h"
>  #include "hw/pci-host/spapr.h"
>  #include "exec/address-spaces.h"
> +#include "exec/ram_addr.h"
>  #include <libfdt.h>
>  #include "trace.h"
>  #include "qemu/error-report.h"
> @@ -45,6 +46,7 @@
>  #include "hw/ppc/spapr_drc.h"
>  #include "sysemu/device_tree.h"
>  #include "sysemu/kvm.h"
> +#include "sysemu/hostmem.h"
>  
>  #include "hw/vfio/vfio.h"
>  
> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      int fdt_start_offset = 0, fdt_size;
>  
>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>  
>          spapr_tce_set_need_vfio(tcet, true);

Now that Alex took your notifier on/off patches, can you remove this
chunk?  If it's still necessary, don't you need to loop over all the
possible liobns, rather than just acting on liobn[0]?

>      }
> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
>      sPAPRTCETable *tcet;
> +    const unsigned windows_supported =
> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
>  
> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> +            || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
>              || (sphb->mem_win_addr != (hwaddr)-1)
>              || (sphb->io_win_addr != (hwaddr)-1)) {
>              error_setg(errp, "Either \"index\" or other parameters must"
> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>  
>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> +        for (i = 0; i < windows_supported; ++i) {
> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> +        }
>  
>          windows_base = SPAPR_PCI_WINDOW_BASE
>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (sphb->dma_liobn == (uint32_t)-1) {
> -        error_setg(errp, "LIOBN not specified for PHB");
> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>          return;
>      }
>  
> @@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -        error_setg(errp, "Unable to create TCE table for %s",
> -                   sphb->dtbusname);
> -        return;
> +    /* DMA setup */
> +    for (i = 0; i < windows_supported; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
> +        if (!tcet) {
> +            error_setg(errp, "Creating window#%d failed for %s",
> +                       i, sphb->dtbusname);
> +            return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
>      }
>  
> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> -                                        spapr_tce_get_iommu(tcet), 0);
> -
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>  
> @@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> +    int i;
> +    sPAPRTCETable *tcet;
>  
> -    if (tcet && tcet->nb_table) {
> -        spapr_tce_table_disable(tcet);
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
> +
> +        if (tcet && tcet->nb_table) {
> +            spapr_tce_table_disable(tcet);
> +        }
>      }
>  
>      /* Register default 32bit DMA window */
> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>  }
> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>  static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>                         SPAPR_PCI_MMIO_WIN_SIZE),
> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>      /* Default DMA window is 0..1GB */
>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
> +                       0x800000000000000ULL),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> +                       (1ULL << 12) | (1ULL << 16)),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>      .post_load = spapr_pci_post_load,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> +        VMSTATE_UNUSED(4), /* dma_liobn */

It's not obvious to me why this change is necessary.

>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>                       sizeof(interrupt_map)));
>  
> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>      if (!tcet) {
>          return -1;
>      }
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..177dcff
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,295 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "cpu.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->nb_table) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->nb_table) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> +{
> +    int i;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> +        if (page_mask & (1ULL << masks[i].shift)) {
> +            mask |= masks[i].mask;
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid, max_window_size;
> +    uint32_t avail, addr, pgmask = 0;
> +    MachineState *machine = MACHINE(spapr);
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    /* Translate page mask to LoPAPR format */
> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> +     */
> +    if (machine->ram_size == machine->maxram_size) {
> +        max_window_size = machine->ram_size;
> +    } else {
> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> +
> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> +    }
> +
> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid, win_addr;
> +    int windows;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +    windows = spapr_phb_get_active_win_num(sphb);
> +
> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
> +        (window_shift < page_shift)) {
> +        goto param_error_exit;
> +    }
> +
> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
> +        goto hw_error_exit;
> +    }
> +
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;

If the guest delets the default 32-bit window, then requests a really
big 64-bit DMA window, will that work ok with the big window at 0
instead of the usual 64-bit window address?

> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
> +                           1ULL << (window_shift - page_shift));
> +    if (!tcet->nb_table) {
> +        goto hw_error_exit;
> +    }
> +
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_tce_table_disable(tcet);
> +    trace_spapr_iommu_ddw_remove(liobn);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +
> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 7848366..92aa610 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -32,6 +32,8 @@
>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>  
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
>  typedef struct sPAPRPHBState sPAPRPHBState;
>  
>  typedef struct spapr_pci_msi {
> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>      MemoryRegion memwindow, iowindow, msiwindow;
>  
> -    uint32_t dma_liobn;
> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>      hwaddr dma_win_addr, dma_win_size;
>      AddressSpace iommu_as;
>      MemoryRegion iommu_root;
> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>      spapr_pci_msi_mig *msi_devs;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
> +
> +    bool ddw_enabled;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_win_addr;
>  };
>  
>  #define SPAPR_PCI_MAX_INDEX          255
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index e1f8274..36d1748 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> diff --git a/trace-events b/trace-events
> index 7e94d92..5b52634 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-22  2:35   ` David Gibson
@ 2016-06-22  3:23     ` Alexey Kardashevskiy
  2016-06-22  7:01       ` David Gibson
  2016-06-22  9:44       ` Thomas Huth
  0 siblings, 2 replies; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-22  3:23 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 29147 bytes --]

On 22/06/16 12:35, David Gibson wrote:
> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.6 machine and older disable it.
>> This also creates a single DMA window for the older machines to
>> maintain backward migration.
>>
>> This implements DDW for PHB with emulated and VFIO devices. The host
>> kernel support is required. The advertised IOMMU page sizes are 4K and
>> 64K; 16M pages are supported but not advertised by default, in order to
>> enable them, the user has to specify "pgsz" property for PHB and
>> enable huge pages for RAM.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> A few queries below.  Not sure if they'll require code changes or just
> explanation.
> 
>> ---
>> Changes:
>> v18:
>> * fixed bug when ddw-create rtas call was always creating window at 1<<59
>> offset
>> * update minimum supported machine version
>> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
>>
>> v17:
>> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
>>
>> v16:
>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>>
>> v15:
>> * moved page mask filtering to PHB realize(), use "-mempath" to know
>> if there are huge pages
>> * fixed error reporting in RTAS handlers
>> * max window size accounts now hotpluggable memory boundaries
>> ---
>>  hw/ppc/Makefile.objs        |   1 +
>>  hw/ppc/spapr.c              |   7 +-
>>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/pci-host/spapr.h |   8 +-
>>  include/hw/ppc/spapr.h      |  16 ++-
>>  trace-events                |   4 +
>>  7 files changed, 386 insertions(+), 22 deletions(-)
>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index 5cc6608..91a3420 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>  obj-y += spapr_pci_vfio.o
>>  endif
>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>  # PowerPC 4xx boards
>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>  obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 778fa25..f7cff27 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
>>   * pseries-2.6
>>   */
>>  #define SPAPR_COMPAT_2_6 \
>> -    HW_COMPAT_2_6
>> +    HW_COMPAT_2_6 \
>> +    { \
>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +        .property = "ddw",\
>> +        .value    = stringify(off),\
>> +    },
>>  
>>  static void spapr_machine_2_6_instance_options(MachineState *machine)
>>  {
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 9f28fb3..0cb51dd 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -35,6 +35,7 @@
>>  #include "hw/ppc/spapr.h"
>>  #include "hw/pci-host/spapr.h"
>>  #include "exec/address-spaces.h"
>> +#include "exec/ram_addr.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>>  #include "qemu/error-report.h"
>> @@ -45,6 +46,7 @@
>>  #include "hw/ppc/spapr_drc.h"
>>  #include "sysemu/device_tree.h"
>>  #include "sysemu/kvm.h"
>> +#include "sysemu/hostmem.h"
>>  
>>  #include "hw/vfio/vfio.h"
>>  
>> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>      int fdt_start_offset = 0, fdt_size;
>>  
>>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>  
>>          spapr_tce_set_need_vfio(tcet, true);
> 
> Now that Alex took your notifier on/off patches, can you remove this
> chunk? 

It will stop compiling as dma_liobn is an array now.


> If it's still necessary, don't you need to loop over all the
> possible liobns, rather than just acting on liobn[0]?

Ah, right. Forgot about it. That was the reason why I wanted those notifier
callbacks in this series, lost it in respins. I do need a loop here which
I'll have to remove soon though.


> 
>>      }
>> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>      PCIBus *bus;
>>      uint64_t msi_window_size = 4096;
>>      sPAPRTCETable *tcet;
>> +    const unsigned windows_supported =
>> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>>  
>>      if (sphb->index != (uint32_t)-1) {
>>          hwaddr windows_base;
>>  
>> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
>> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>> +            || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
>>              || (sphb->mem_win_addr != (hwaddr)-1)
>>              || (sphb->io_win_addr != (hwaddr)-1)) {
>>              error_setg(errp, "Either \"index\" or other parameters must"
>> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>  
>>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>> +        for (i = 0; i < windows_supported; ++i) {
>> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
>> +        }
>>  
>>          windows_base = SPAPR_PCI_WINDOW_BASE
>>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          return;
>>      }
>>  
>> -    if (sphb->dma_liobn == (uint32_t)-1) {
>> -        error_setg(errp, "LIOBN not specified for PHB");
>> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
>> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
>> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>>          return;
>>      }
>>  
>> @@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>      }
>>  
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> -    if (!tcet) {
>> -        error_setg(errp, "Unable to create TCE table for %s",
>> -                   sphb->dtbusname);
>> -        return;
>> +    /* DMA setup */
>> +    for (i = 0; i < windows_supported; ++i) {
>> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
>> +        if (!tcet) {
>> +            error_setg(errp, "Creating window#%d failed for %s",
>> +                       i, sphb->dtbusname);
>> +            return;
>> +        }
>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> +                                            spapr_tce_get_iommu(tcet), 0);
>>      }
>>  
>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> -                                        spapr_tce_get_iommu(tcet), 0);
>> -
>>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>  }
>>  
>> @@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>  
>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>  {
>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>> +    int i;
>> +    sPAPRTCETable *tcet;
>>  
>> -    if (tcet && tcet->nb_table) {
>> -        spapr_tce_table_disable(tcet);
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
>> +
>> +        if (tcet && tcet->nb_table) {
>> +            spapr_tce_table_disable(tcet);
>> +        }
>>      }
>>  
>>      /* Register default 32bit DMA window */
>> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>>  }
>> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>  static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>      /* Default DMA window is 0..1GB */
>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>> +                       0x800000000000000ULL),
>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>> +                       (1ULL << 12) | (1ULL << 16)),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>      .post_load = spapr_pci_post_load,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>> +        VMSTATE_UNUSED(4), /* dma_liobn */
> 
> It's not obvious to me why this change is necessary.

It is not. But I was touching liobn and this is a proper cleanup which
needs to be done anyway as _EQUAL() macros are sort of deprecated and
rather pointless. Since I am adding a new 64bit LIOBN in this patch, should
I add it in VMSTATE as 32bit one and bump the vmstate version? Or not add
it (leaving some inconsistency)?



> 
>>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
>> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      uint32_t interrupt_map_mask[] = {
>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>> +    };
>> +    uint32_t ddw_extensions[] = {
>> +        cpu_to_be32(1),
>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>> +    };
>>      sPAPRTCETable *tcet;
>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>      sPAPRFDT s_fdt;
>> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>  
>> +    /* Dynamic DMA window */
>> +    if (phb->ddw_enabled) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>> +    }
>> +
>>      /* Build the interrupt-map, this must matches what is done
>>       * in pci_spapr_map_irq
>>       */
>> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>>                       sizeof(interrupt_map)));
>>  
>> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>      if (!tcet) {
>>          return -1;
>>      }
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..177dcff
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,295 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "cpu.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->nb_table) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->nb_table) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
>> +{
>> +    int i;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
>> +        if (page_mask & (1ULL << masks[i].shift)) {
>> +            mask |= masks[i].mask;
>> +        }
>> +    }
>> +
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid, max_window_size;
>> +    uint32_t avail, addr, pgmask = 0;
>> +    MachineState *machine = MACHINE(spapr);
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    /* Translate page mask to LoPAPR format */
>> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>> +
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>> +     */
>> +    if (machine->ram_size == machine->maxram_size) {
>> +        max_window_size = machine->ram_size;
>> +    } else {
>> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
>> +
>> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
>> +    }
>> +
>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid, win_addr;
>> +    int windows;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = spapr_phb_get_free_liobn(sphb);
>> +    windows = spapr_phb_get_active_win_num(sphb);
>> +
>> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
>> +        (window_shift < page_shift)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
> 
> If the guest delets the default 32-bit window, then requests a really
> big 64-bit DMA window, will that work ok with the big window at 0
> instead of the usual 64-bit window address?


There is no valid guest to try that as they keep 32bit window.

There was a relatively short period of time in v3.0-ish era (sles11 did
have it and sles11sp3 did not if I remember correctly) when the guest would
remove all windows and create one huge window but for some reason it
expected the window to start non from zero (perhaps pHyp implementation
detail) so it would fail. I did an experiment and removed that particular
check and it worked just fine.

Today guests always keep a 32bit window as the platform cannot tell if all
the drivers on a specific PHB will request 64bit DMA.




>> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
>> +                           1ULL << (window_shift - page_shift));
>> +    if (!tcet->nb_table) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, liobn);
>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn;
>> +
>> +    if ((nargs != 1) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    liobn = rtas_ld(args, 0);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_tce_table_disable(tcet);
>> +    trace_spapr_iommu_ddw_remove(liobn);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t addr;
>> +
>> +    if ((nargs != 3) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_phb_dma_reset(sphb);
>> +    trace_spapr_iommu_ddw_reset(buid, addr);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void spapr_rtas_ddw_init(void)
>> +{
>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +                        "ibm,query-pe-dma-window",
>> +                        rtas_ibm_query_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +                        "ibm,create-pe-dma-window",
>> +                        rtas_ibm_create_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>> +                        "ibm,remove-pe-dma-window",
>> +                        rtas_ibm_remove_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>> +                        "ibm,reset-pe-dma-window",
>> +                        rtas_ibm_reset_pe_dma_window);
>> +}
>> +
>> +type_init(spapr_rtas_ddw_init)
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 7848366..92aa610 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -32,6 +32,8 @@
>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>  
>> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
>> +
>>  typedef struct sPAPRPHBState sPAPRPHBState;
>>  
>>  typedef struct spapr_pci_msi {
>> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>>      MemoryRegion memwindow, iowindow, msiwindow;
>>  
>> -    uint32_t dma_liobn;
>> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>>      hwaddr dma_win_addr, dma_win_size;
>>      AddressSpace iommu_as;
>>      MemoryRegion iommu_root;
>> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>>      spapr_pci_msi_mig *msi_devs;
>>  
>>      QLIST_ENTRY(sPAPRPHBState) list;
>> +
>> +    bool ddw_enabled;
>> +    uint64_t page_size_mask;
>> +    uint64_t dma64_win_addr;
>>  };
>>  
>>  #define SPAPR_PCI_MAX_INDEX          255
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index e1f8274..36d1748 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>>  
>> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
>> +#define RTAS_DDW_PGSIZE_4K       0x01
>> +#define RTAS_DDW_PGSIZE_64K      0x02
>> +#define RTAS_DDW_PGSIZE_16M      0x04
>> +#define RTAS_DDW_PGSIZE_32M      0x08
>> +#define RTAS_DDW_PGSIZE_64M      0x10
>> +#define RTAS_DDW_PGSIZE_128M     0x20
>> +#define RTAS_DDW_PGSIZE_256M     0x40
>> +#define RTAS_DDW_PGSIZE_16G      0x80
>> +
>>  /* RTAS tokens */
>>  #define RTAS_TOKEN_BASE      0x2000
>>  
>> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
>> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
>> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
>> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
>> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>>  
>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>>  
>>  /* RTAS ibm,get-system-parameter token values */
>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>> diff --git a/trace-events b/trace-events
>> index 7e94d92..5b52634 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
>> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
>> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
>> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>>  
>>  # hw/ppc/ppc.c
>>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-22  3:23     ` Alexey Kardashevskiy
@ 2016-06-22  7:01       ` David Gibson
  2016-06-22  8:26         ` Alexey Kardashevskiy
  2016-06-22  9:44       ` Thomas Huth
  1 sibling, 1 reply; 23+ messages in thread
From: David Gibson @ 2016-06-22  7:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 26492 bytes --]

On Wed, Jun 22, 2016 at 01:23:51PM +1000, Alexey Kardashevskiy wrote:
> On 22/06/16 12:35, David Gibson wrote:
> > On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
> >> This adds support for Dynamic DMA Windows (DDW) option defined by
> >> the SPAPR specification which allows to have additional DMA window(s)
> >>
> >> The "ddw" property is enabled by default on a PHB but for compatibility
> >> the pseries-2.6 machine and older disable it.
> >> This also creates a single DMA window for the older machines to
> >> maintain backward migration.
> >>
> >> This implements DDW for PHB with emulated and VFIO devices. The host
> >> kernel support is required. The advertised IOMMU page sizes are 4K and
> >> 64K; 16M pages are supported but not advertised by default, in order to
> >> enable them, the user has to specify "pgsz" property for PHB and
> >> enable huge pages for RAM.
> >>
> >> The existing linux guests try creating one additional huge DMA window
> >> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >> the guest switches to dma_direct_ops and never calls TCE hypercalls
> >> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >> property which is a bus address for the 64bit window and by default
> >> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >> uses and this allows having emulated and VFIO devices on the same bus.
> >>
> >> This adds 4 RTAS handlers:
> >> * ibm,query-pe-dma-window
> >> * ibm,create-pe-dma-window
> >> * ibm,remove-pe-dma-window
> >> * ibm,reset-pe-dma-window
> >> These are registered from type_init() callback.
> >>
> >> These RTAS handlers are implemented in a separate file to avoid polluting
> >> spapr_iommu.c with PCI.
> >>
> >> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > A few queries below.  Not sure if they'll require code changes or just
> > explanation.
> > 
> >> ---
> >> Changes:
> >> v18:
> >> * fixed bug when ddw-create rtas call was always creating window at 1<<59
> >> offset
> >> * update minimum supported machine version
> >> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
> >>
> >> v17:
> >> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
> >>
> >> v16:
> >> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> >> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> >>
> >> v15:
> >> * moved page mask filtering to PHB realize(), use "-mempath" to know
> >> if there are huge pages
> >> * fixed error reporting in RTAS handlers
> >> * max window size accounts now hotpluggable memory boundaries
> >> ---
> >>  hw/ppc/Makefile.objs        |   1 +
> >>  hw/ppc/spapr.c              |   7 +-
> >>  hw/ppc/spapr_pci.c          |  77 +++++++++---
> >>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/pci-host/spapr.h |   8 +-
> >>  include/hw/ppc/spapr.h      |  16 ++-
> >>  trace-events                |   4 +
> >>  7 files changed, 386 insertions(+), 22 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index 5cc6608..91a3420 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>  obj-y += spapr_pci_vfio.o
> >>  endif
> >> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index 778fa25..f7cff27 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
> >>   * pseries-2.6
> >>   */
> >>  #define SPAPR_COMPAT_2_6 \
> >> -    HW_COMPAT_2_6
> >> +    HW_COMPAT_2_6 \
> >> +    { \
> >> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >> +        .property = "ddw",\
> >> +        .value    = stringify(off),\
> >> +    },
> >>  
> >>  static void spapr_machine_2_6_instance_options(MachineState *machine)
> >>  {
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 9f28fb3..0cb51dd 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -35,6 +35,7 @@
> >>  #include "hw/ppc/spapr.h"
> >>  #include "hw/pci-host/spapr.h"
> >>  #include "exec/address-spaces.h"
> >> +#include "exec/ram_addr.h"
> >>  #include <libfdt.h>
> >>  #include "trace.h"
> >>  #include "qemu/error-report.h"
> >> @@ -45,6 +46,7 @@
> >>  #include "hw/ppc/spapr_drc.h"
> >>  #include "sysemu/device_tree.h"
> >>  #include "sysemu/kvm.h"
> >> +#include "sysemu/hostmem.h"
> >>  
> >>  #include "hw/vfio/vfio.h"
> >>  
> >> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >>      int fdt_start_offset = 0, fdt_size;
> >>  
> >>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> >> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >>  
> >>          spapr_tce_set_need_vfio(tcet, true);
> > 
> > Now that Alex took your notifier on/off patches, can you remove this
> > chunk? 
> 
> It will stop compiling as dma_liobn is an array now.

Sorry, I wasn't clear.  I meant remove this whole if statement, not
just remove this hunk of the patch.

> > If it's still necessary, don't you need to loop over all the
> > possible liobns, rather than just acting on liobn[0]?
> 
> Ah, right. Forgot about it. That was the reason why I wanted those notifier
> callbacks in this series, lost it in respins. I do need a loop here which
> I'll have to remove soon though.

Ok.
> >> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
> >>      /* Default DMA window is 0..1GB */
> >>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
> >> +                       0x800000000000000ULL),
> >> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >> +                       (1ULL << 12) | (1ULL << 16)),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
> >>      .post_load = spapr_pci_post_load,
> >>      .fields = (VMStateField[]) {
> >>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> >> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> >> +        VMSTATE_UNUSED(4), /* dma_liobn */
> > 
> > It's not obvious to me why this change is necessary.
> 
> It is not. But I was touching liobn and this is a proper cleanup which
> needs to be done anyway as _EQUAL() macros are sort of deprecated and
> rather pointless. Since I am adding a new 64bit LIOBN in this patch, should
> I add it in VMSTATE as 32bit one and bump the vmstate version? Or not add
> it (leaving some inconsistency)?

Ah, ok, I see your point.  Yeah, I guess we can drop it.

> >>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
> >>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
> >>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> >> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      uint32_t interrupt_map_mask[] = {
> >>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >> +    uint32_t ddw_applicable[] = {
> >> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >> +    };
> >> +    uint32_t ddw_extensions[] = {
> >> +        cpu_to_be32(1),
> >> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >> +    };
> >>      sPAPRTCETable *tcet;
> >>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>      sPAPRFDT s_fdt;
> >> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>  
> >> +    /* Dynamic DMA window */
> >> +    if (phb->ddw_enabled) {
> >> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >> +                         sizeof(ddw_applicable)));
> >> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >> +                         &ddw_extensions, sizeof(ddw_extensions)));
> >> +    }
> >> +
> >>      /* Build the interrupt-map, this must matches what is done
> >>       * in pci_spapr_map_irq
> >>       */
> >> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
> >>                       sizeof(interrupt_map)));
> >>  
> >> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >>      if (!tcet) {
> >>          return -1;
> >>      }
> >> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >> new file mode 100644
> >> index 0000000..177dcff
> >> --- /dev/null
> >> +++ b/hw/ppc/spapr_rtas_ddw.c
> >> @@ -0,0 +1,295 @@
> >> +/*
> >> + * QEMU sPAPR Dynamic DMA windows support
> >> + *
> >> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >> + *
> >> + *  This program is free software; you can redistribute it and/or modify
> >> + *  it under the terms of the GNU General Public License as published by
> >> + *  the Free Software Foundation; either version 2 of the License,
> >> + *  or (at your option) any later version.
> >> + *
> >> + *  This program is distributed in the hope that it will be useful,
> >> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> + *  GNU General Public License for more details.
> >> + *
> >> + *  You should have received a copy of the GNU General Public License
> >> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "cpu.h"
> >> +#include "qemu/error-report.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/pci-host/spapr.h"
> >> +#include "trace.h"
> >> +
> >> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >> +{
> >> +    sPAPRTCETable *tcet;
> >> +
> >> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >> +    if (tcet && tcet->nb_table) {
> >> +        ++*(unsigned *)opaque;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >> +{
> >> +    unsigned ret = 0;
> >> +
> >> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >> +{
> >> +    sPAPRTCETable *tcet;
> >> +
> >> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >> +    if (tcet && !tcet->nb_table) {
> >> +        *(uint32_t *)opaque = tcet->liobn;
> >> +        return 1;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >> +{
> >> +    uint32_t liobn = 0;
> >> +
> >> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >> +
> >> +    return liobn;
> >> +}
> >> +
> >> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> >> +{
> >> +    int i;
> >> +    uint32_t mask = 0;
> >> +    const struct { int shift; uint32_t mask; } masks[] = {
> >> +        { 12, RTAS_DDW_PGSIZE_4K },
> >> +        { 16, RTAS_DDW_PGSIZE_64K },
> >> +        { 24, RTAS_DDW_PGSIZE_16M },
> >> +        { 25, RTAS_DDW_PGSIZE_32M },
> >> +        { 26, RTAS_DDW_PGSIZE_64M },
> >> +        { 27, RTAS_DDW_PGSIZE_128M },
> >> +        { 28, RTAS_DDW_PGSIZE_256M },
> >> +        { 34, RTAS_DDW_PGSIZE_16G },
> >> +    };
> >> +
> >> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> >> +        if (page_mask & (1ULL << masks[i].shift)) {
> >> +            mask |= masks[i].mask;
> >> +        }
> >> +    }
> >> +
> >> +    return mask;
> >> +}
> >> +
> >> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >> +                                         sPAPRMachineState *spapr,
> >> +                                         uint32_t token, uint32_t nargs,
> >> +                                         target_ulong args,
> >> +                                         uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    uint64_t buid, max_window_size;
> >> +    uint32_t avail, addr, pgmask = 0;
> >> +    MachineState *machine = MACHINE(spapr);
> >> +
> >> +    if ((nargs != 3) || (nret != 5)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    /* Translate page mask to LoPAPR format */
> >> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> >> +
> >> +    /*
> >> +     * This is "Largest contiguous block of TCEs allocated specifically
> >> +     * for (that is, are reserved for) this PE".
> >> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> >> +     */
> >> +    if (machine->ram_size == machine->maxram_size) {
> >> +        max_window_size = machine->ram_size;
> >> +    } else {
> >> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> >> +
> >> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> >> +    }
> >> +
> >> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, avail);
> >> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> >> +    rtas_st(rets, 3, pgmask);
> >> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> >> +
> >> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRTCETable *tcet = NULL;
> >> +    uint32_t addr, page_shift, window_shift, liobn;
> >> +    uint64_t buid, win_addr;
> >> +    int windows;
> >> +
> >> +    if ((nargs != 5) || (nret != 4)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    page_shift = rtas_ld(args, 3);
> >> +    window_shift = rtas_ld(args, 4);
> >> +    liobn = spapr_phb_get_free_liobn(sphb);
> >> +    windows = spapr_phb_get_active_win_num(sphb);
> >> +
> >> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
> >> +        (window_shift < page_shift)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    tcet = spapr_tce_find_by_liobn(liobn);
> >> +    if (!tcet) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
> > 
> > If the guest delets the default 32-bit window, then requests a really
> > big 64-bit DMA window, will that work ok with the big window at 0
> > instead of the usual 64-bit window address?
> 
> 
> There is no valid guest to try that as they keep 32bit window.

Right, but we should aim to work in general, not just with known
guests.

> There was a relatively short period of time in v3.0-ish era (sles11 did
> have it and sles11sp3 did not if I remember correctly) when the guest would
> remove all windows and create one huge window but for some reason it
> expected the window to start non from zero (perhaps pHyp implementation
> detail) so it would fail. I did an experiment and removed that particular
> check and it worked just fine.

You mean removed the check for non-zero address from the guest?

> Today guests always keep a 32bit window as the platform cannot tell if all
> the drivers on a specific PHB will request 64bit DMA.
> 
> 
> 
> 
> >> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
> >> +                           1ULL << (window_shift - page_shift));
> >> +    if (!tcet->nb_table) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> >> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, liobn);
> >> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> >> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> >> +
> >> +    return;
> >> +
> >> +hw_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRTCETable *tcet;
> >> +    uint32_t liobn;
> >> +
> >> +    if ((nargs != 1) || (nret != 1)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    liobn = rtas_ld(args, 0);
> >> +    tcet = spapr_tce_find_by_liobn(liobn);
> >> +    if (!tcet) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> >> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    spapr_tce_table_disable(tcet);
> >> +    trace_spapr_iommu_ddw_remove(liobn);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> >> +                                         sPAPRMachineState *spapr,
> >> +                                         uint32_t token, uint32_t nargs,
> >> +                                         target_ulong args,
> >> +                                         uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    uint64_t buid;
> >> +    uint32_t addr;
> >> +
> >> +    if ((nargs != 3) || (nret != 1)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    spapr_phb_dma_reset(sphb);
> >> +    trace_spapr_iommu_ddw_reset(buid, addr);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void spapr_rtas_ddw_init(void)
> >> +{
> >> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> >> +                        "ibm,query-pe-dma-window",
> >> +                        rtas_ibm_query_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> >> +                        "ibm,create-pe-dma-window",
> >> +                        rtas_ibm_create_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> >> +                        "ibm,remove-pe-dma-window",
> >> +                        rtas_ibm_remove_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> >> +                        "ibm,reset-pe-dma-window",
> >> +                        rtas_ibm_reset_pe_dma_window);
> >> +}
> >> +
> >> +type_init(spapr_rtas_ddw_init)
> >> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >> index 7848366..92aa610 100644
> >> --- a/include/hw/pci-host/spapr.h
> >> +++ b/include/hw/pci-host/spapr.h
> >> @@ -32,6 +32,8 @@
> >>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
> >>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
> >>  
> >> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> >> +
> >>  typedef struct sPAPRPHBState sPAPRPHBState;
> >>  
> >>  typedef struct spapr_pci_msi {
> >> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
> >>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
> >>      MemoryRegion memwindow, iowindow, msiwindow;
> >>  
> >> -    uint32_t dma_liobn;
> >> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
> >>      hwaddr dma_win_addr, dma_win_size;
> >>      AddressSpace iommu_as;
> >>      MemoryRegion iommu_root;
> >> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
> >>      spapr_pci_msi_mig *msi_devs;
> >>  
> >>      QLIST_ENTRY(sPAPRPHBState) list;
> >> +
> >> +    bool ddw_enabled;
> >> +    uint64_t page_size_mask;
> >> +    uint64_t dma64_win_addr;
> >>  };
> >>  
> >>  #define SPAPR_PCI_MAX_INDEX          255
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index e1f8274..36d1748 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
> >>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
> >>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
> >>  
> >> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> >> +#define RTAS_DDW_PGSIZE_4K       0x01
> >> +#define RTAS_DDW_PGSIZE_64K      0x02
> >> +#define RTAS_DDW_PGSIZE_16M      0x04
> >> +#define RTAS_DDW_PGSIZE_32M      0x08
> >> +#define RTAS_DDW_PGSIZE_64M      0x10
> >> +#define RTAS_DDW_PGSIZE_128M     0x20
> >> +#define RTAS_DDW_PGSIZE_256M     0x40
> >> +#define RTAS_DDW_PGSIZE_16G      0x80
> >> +
> >>  /* RTAS tokens */
> >>  #define RTAS_TOKEN_BASE      0x2000
> >>  
> >> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
> >>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
> >>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
> >>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> >> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> >> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> >> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> >> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
> >>  
> >> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> >> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
> >>  
> >>  /* RTAS ibm,get-system-parameter token values */
> >>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> >> diff --git a/trace-events b/trace-events
> >> index 7e94d92..5b52634 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
> >>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> >>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> >>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> >> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> >> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
> >> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
> >> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
> >>  
> >>  # hw/ppc/ppc.c
> >>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-22  7:01       ` David Gibson
@ 2016-06-22  8:26         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-22  8:26 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 27255 bytes --]

On 22/06/16 17:01, David Gibson wrote:
> On Wed, Jun 22, 2016 at 01:23:51PM +1000, Alexey Kardashevskiy wrote:
>> On 22/06/16 12:35, David Gibson wrote:
>>> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>
>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>> the pseries-2.6 machine and older disable it.
>>>> This also creates a single DMA window for the older machines to
>>>> maintain backward migration.
>>>>
>>>> This implements DDW for PHB with emulated and VFIO devices. The host
>>>> kernel support is required. The advertised IOMMU page sizes are 4K and
>>>> 64K; 16M pages are supported but not advertised by default, in order to
>>>> enable them, the user has to specify "pgsz" property for PHB and
>>>> enable huge pages for RAM.
>>>>
>>>> The existing linux guests try creating one additional huge DMA window
>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>> property which is a bus address for the 64bit window and by default
>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>
>>>> This adds 4 RTAS handlers:
>>>> * ibm,query-pe-dma-window
>>>> * ibm,create-pe-dma-window
>>>> * ibm,remove-pe-dma-window
>>>> * ibm,reset-pe-dma-window
>>>> These are registered from type_init() callback.
>>>>
>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>> spapr_iommu.c with PCI.
>>>>
>>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> A few queries below.  Not sure if they'll require code changes or just
>>> explanation.
>>>
>>>> ---
>>>> Changes:
>>>> v18:
>>>> * fixed bug when ddw-create rtas call was always creating window at 1<<59
>>>> offset
>>>> * update minimum supported machine version
>>>> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
>>>>
>>>> v17:
>>>> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
>>>>
>>>> v16:
>>>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
>>>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>>>>
>>>> v15:
>>>> * moved page mask filtering to PHB realize(), use "-mempath" to know
>>>> if there are huge pages
>>>> * fixed error reporting in RTAS handlers
>>>> * max window size accounts now hotpluggable memory boundaries
>>>> ---
>>>>  hw/ppc/Makefile.objs        |   1 +
>>>>  hw/ppc/spapr.c              |   7 +-
>>>>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>>>>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/pci-host/spapr.h |   8 +-
>>>>  include/hw/ppc/spapr.h      |  16 ++-
>>>>  trace-events                |   4 +
>>>>  7 files changed, 386 insertions(+), 22 deletions(-)
>>>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index 5cc6608..91a3420 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
>>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>  obj-y += spapr_pci_vfio.o
>>>>  endif
>>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>  # PowerPC 4xx boards
>>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>  obj-y += ppc4xx_pci.o
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index 778fa25..f7cff27 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
>>>>   * pseries-2.6
>>>>   */
>>>>  #define SPAPR_COMPAT_2_6 \
>>>> -    HW_COMPAT_2_6
>>>> +    HW_COMPAT_2_6 \
>>>> +    { \
>>>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>>> +        .property = "ddw",\
>>>> +        .value    = stringify(off),\
>>>> +    },
>>>>  
>>>>  static void spapr_machine_2_6_instance_options(MachineState *machine)
>>>>  {
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 9f28fb3..0cb51dd 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -35,6 +35,7 @@
>>>>  #include "hw/ppc/spapr.h"
>>>>  #include "hw/pci-host/spapr.h"
>>>>  #include "exec/address-spaces.h"
>>>> +#include "exec/ram_addr.h"
>>>>  #include <libfdt.h>
>>>>  #include "trace.h"
>>>>  #include "qemu/error-report.h"
>>>> @@ -45,6 +46,7 @@
>>>>  #include "hw/ppc/spapr_drc.h"
>>>>  #include "sysemu/device_tree.h"
>>>>  #include "sysemu/kvm.h"
>>>> +#include "sysemu/hostmem.h"
>>>>  
>>>>  #include "hw/vfio/vfio.h"
>>>>  
>>>> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>>>      int fdt_start_offset = 0, fdt_size;
>>>>  
>>>>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>>>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>>>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>>>  
>>>>          spapr_tce_set_need_vfio(tcet, true);
>>>
>>> Now that Alex took your notifier on/off patches, can you remove this
>>> chunk? 
>>
>> It will stop compiling as dma_liobn is an array now.
> 
> Sorry, I wasn't clear.  I meant remove this whole if statement, not
> just remove this hunk of the patch.


Bisect-ability will suffer then, and we can easily avoided if this patch is
applied on top of these:

vfio, memory: Notify IOMMU about starting/stopping listening
spapr_iommu: Realloc guest visible TCE table when starting/stopping listening

All we need is Alex to send pull req, Peter to merge it and you to rebase
ppc-for-2.7 on top of this :)

> 
>>> If it's still necessary, don't you need to loop over all the
>>> possible liobns, rather than just acting on liobn[0]?
>>
>> Ah, right. Forgot about it. That was the reason why I wanted those notifier
>> callbacks in this series, lost it in respins. I do need a loop here which
>> I'll have to remove soon though.
> 
> Ok.
>>>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>>>      /* Default DMA window is 0..1GB */
>>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>>>> +                       0x800000000000000ULL),
>>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>> +                       (1ULL << 12) | (1ULL << 16)),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>>>      .post_load = spapr_pci_post_load,
>>>>      .fields = (VMStateField[]) {
>>>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>>>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>>>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>>
>>> It's not obvious to me why this change is necessary.
>>
>> It is not. But I was touching liobn and this is a proper cleanup which
>> needs to be done anyway as _EQUAL() macros are sort of deprecated and
>> rather pointless. Since I am adding a new 64bit LIOBN in this patch, should
>> I add it in VMSTATE as 32bit one and bump the vmstate version? Or not add
>> it (leaving some inconsistency)?
> 
> Ah, ok, I see your point.  Yeah, I guess we can drop it.
> 
>>>>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>>>>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>>>>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
>>>> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>      uint32_t interrupt_map_mask[] = {
>>>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>>>> +    uint32_t ddw_applicable[] = {
>>>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>>>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>>>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>>>> +    };
>>>> +    uint32_t ddw_extensions[] = {
>>>> +        cpu_to_be32(1),
>>>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>>>> +    };
>>>>      sPAPRTCETable *tcet;
>>>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>>>      sPAPRFDT s_fdt;
>>>> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>>>  
>>>> +    /* Dynamic DMA window */
>>>> +    if (phb->ddw_enabled) {
>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>>>> +                         sizeof(ddw_applicable)));
>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>>>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>>>> +    }
>>>> +
>>>>      /* Build the interrupt-map, this must matches what is done
>>>>       * in pci_spapr_map_irq
>>>>       */
>>>> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>>>>                       sizeof(interrupt_map)));
>>>>  
>>>> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>>>> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>>>      if (!tcet) {
>>>>          return -1;
>>>>      }
>>>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>>>> new file mode 100644
>>>> index 0000000..177dcff
>>>> --- /dev/null
>>>> +++ b/hw/ppc/spapr_rtas_ddw.c
>>>> @@ -0,0 +1,295 @@
>>>> +/*
>>>> + * QEMU sPAPR Dynamic DMA windows support
>>>> + *
>>>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>>>> + *
>>>> + *  This program is free software; you can redistribute it and/or modify
>>>> + *  it under the terms of the GNU General Public License as published by
>>>> + *  the Free Software Foundation; either version 2 of the License,
>>>> + *  or (at your option) any later version.
>>>> + *
>>>> + *  This program is distributed in the hope that it will be useful,
>>>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + *  GNU General Public License for more details.
>>>> + *
>>>> + *  You should have received a copy of the GNU General Public License
>>>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "cpu.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "hw/ppc/spapr.h"
>>>> +#include "hw/pci-host/spapr.h"
>>>> +#include "trace.h"
>>>> +
>>>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>>>> +{
>>>> +    sPAPRTCETable *tcet;
>>>> +
>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>> +    if (tcet && tcet->nb_table) {
>>>> +        ++*(unsigned *)opaque;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>>>> +{
>>>> +    unsigned ret = 0;
>>>> +
>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>>>> +{
>>>> +    sPAPRTCETable *tcet;
>>>> +
>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>> +    if (tcet && !tcet->nb_table) {
>>>> +        *(uint32_t *)opaque = tcet->liobn;
>>>> +        return 1;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>>>> +{
>>>> +    uint32_t liobn = 0;
>>>> +
>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>>>> +
>>>> +    return liobn;
>>>> +}
>>>> +
>>>> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
>>>> +{
>>>> +    int i;
>>>> +    uint32_t mask = 0;
>>>> +    const struct { int shift; uint32_t mask; } masks[] = {
>>>> +        { 12, RTAS_DDW_PGSIZE_4K },
>>>> +        { 16, RTAS_DDW_PGSIZE_64K },
>>>> +        { 24, RTAS_DDW_PGSIZE_16M },
>>>> +        { 25, RTAS_DDW_PGSIZE_32M },
>>>> +        { 26, RTAS_DDW_PGSIZE_64M },
>>>> +        { 27, RTAS_DDW_PGSIZE_128M },
>>>> +        { 28, RTAS_DDW_PGSIZE_256M },
>>>> +        { 34, RTAS_DDW_PGSIZE_16G },
>>>> +    };
>>>> +
>>>> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
>>>> +        if (page_mask & (1ULL << masks[i].shift)) {
>>>> +            mask |= masks[i].mask;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return mask;
>>>> +}
>>>> +
>>>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                         sPAPRMachineState *spapr,
>>>> +                                         uint32_t token, uint32_t nargs,
>>>> +                                         target_ulong args,
>>>> +                                         uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    uint64_t buid, max_window_size;
>>>> +    uint32_t avail, addr, pgmask = 0;
>>>> +    MachineState *machine = MACHINE(spapr);
>>>> +
>>>> +    if ((nargs != 3) || (nret != 5)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    /* Translate page mask to LoPAPR format */
>>>> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>>>> +
>>>> +    /*
>>>> +     * This is "Largest contiguous block of TCEs allocated specifically
>>>> +     * for (that is, are reserved for) this PE".
>>>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>>>> +     */
>>>> +    if (machine->ram_size == machine->maxram_size) {
>>>> +        max_window_size = machine->ram_size;
>>>> +    } else {
>>>> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
>>>> +
>>>> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
>>>> +    }
>>>> +
>>>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    rtas_st(rets, 1, avail);
>>>> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
>>>> +    rtas_st(rets, 3, pgmask);
>>>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>>>> +
>>>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                          sPAPRMachineState *spapr,
>>>> +                                          uint32_t token, uint32_t nargs,
>>>> +                                          target_ulong args,
>>>> +                                          uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    sPAPRTCETable *tcet = NULL;
>>>> +    uint32_t addr, page_shift, window_shift, liobn;
>>>> +    uint64_t buid, win_addr;
>>>> +    int windows;
>>>> +
>>>> +    if ((nargs != 5) || (nret != 4)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    page_shift = rtas_ld(args, 3);
>>>> +    window_shift = rtas_ld(args, 4);
>>>> +    liobn = spapr_phb_get_free_liobn(sphb);
>>>> +    windows = spapr_phb_get_active_win_num(sphb);
>>>> +
>>>> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
>>>> +        (window_shift < page_shift)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    tcet = spapr_tce_find_by_liobn(liobn);
>>>> +    if (!tcet) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
>>>
>>> If the guest delets the default 32-bit window, then requests a really
>>> big 64-bit DMA window, will that work ok with the big window at 0
>>> instead of the usual 64-bit window address?
>>
>>
>> There is no valid guest to try that as they keep 32bit window.
> 
> Right, but we should aim to work in general, not just with known
> guests.


Well, I did the experiment described below.

The term "in general" is vague though - if pHyp did something not exactly
as PAPR said (and therefore guests expected that), what behavior should I
pick for QEMU? For example, the guest did not expect a new window to start
from zero so there should have been reason for that, something like pHyp
only can allocate a single window and only at 1<<59 offset or nobody
actually tested it (always a possibility).


>> There was a relatively short period of time in v3.0-ish era (sles11 did
>> have it and sles11sp3 did not if I remember correctly) when the guest would
>> remove all windows and create one huge window but for some reason it
>> expected the window to start non from zero (perhaps pHyp implementation
>> detail) so it would fail. I did an experiment and removed that particular
>> check and it worked just fine.
> 
> You mean removed the check for non-zero address from the guest?

Yes, that one.


>> Today guests always keep a 32bit window as the platform cannot tell if all
>> the drivers on a specific PHB will request 64bit DMA.
>>
>>
>>
>>
>>>> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
>>>> +                           1ULL << (window_shift - page_shift));
>>>> +    if (!tcet->nb_table) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>>>> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    rtas_st(rets, 1, liobn);
>>>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>>>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>>>> +
>>>> +    return;
>>>> +
>>>> +hw_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                          sPAPRMachineState *spapr,
>>>> +                                          uint32_t token, uint32_t nargs,
>>>> +                                          target_ulong args,
>>>> +                                          uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    sPAPRTCETable *tcet;
>>>> +    uint32_t liobn;
>>>> +
>>>> +    if ((nargs != 1) || (nret != 1)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    liobn = rtas_ld(args, 0);
>>>> +    tcet = spapr_tce_find_by_liobn(liobn);
>>>> +    if (!tcet) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>>>> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    spapr_tce_table_disable(tcet);
>>>> +    trace_spapr_iommu_ddw_remove(liobn);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                         sPAPRMachineState *spapr,
>>>> +                                         uint32_t token, uint32_t nargs,
>>>> +                                         target_ulong args,
>>>> +                                         uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    uint64_t buid;
>>>> +    uint32_t addr;
>>>> +
>>>> +    if ((nargs != 3) || (nret != 1)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    spapr_phb_dma_reset(sphb);
>>>> +    trace_spapr_iommu_ddw_reset(buid, addr);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void spapr_rtas_ddw_init(void)
>>>> +{
>>>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>>>> +                        "ibm,query-pe-dma-window",
>>>> +                        rtas_ibm_query_pe_dma_window);
>>>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>>>> +                        "ibm,create-pe-dma-window",
>>>> +                        rtas_ibm_create_pe_dma_window);
>>>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>>>> +                        "ibm,remove-pe-dma-window",
>>>> +                        rtas_ibm_remove_pe_dma_window);
>>>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>>>> +                        "ibm,reset-pe-dma-window",
>>>> +                        rtas_ibm_reset_pe_dma_window);
>>>> +}
>>>> +
>>>> +type_init(spapr_rtas_ddw_init)
>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>>> index 7848366..92aa610 100644
>>>> --- a/include/hw/pci-host/spapr.h
>>>> +++ b/include/hw/pci-host/spapr.h
>>>> @@ -32,6 +32,8 @@
>>>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>>>  
>>>> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
>>>> +
>>>>  typedef struct sPAPRPHBState sPAPRPHBState;
>>>>  
>>>>  typedef struct spapr_pci_msi {
>>>> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>>>>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>>>>      MemoryRegion memwindow, iowindow, msiwindow;
>>>>  
>>>> -    uint32_t dma_liobn;
>>>> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>>>>      hwaddr dma_win_addr, dma_win_size;
>>>>      AddressSpace iommu_as;
>>>>      MemoryRegion iommu_root;
>>>> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>>>>      spapr_pci_msi_mig *msi_devs;
>>>>  
>>>>      QLIST_ENTRY(sPAPRPHBState) list;
>>>> +
>>>> +    bool ddw_enabled;
>>>> +    uint64_t page_size_mask;
>>>> +    uint64_t dma64_win_addr;
>>>>  };
>>>>  
>>>>  #define SPAPR_PCI_MAX_INDEX          255
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index e1f8274..36d1748 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>>>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>>>>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>>>>  
>>>> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
>>>> +#define RTAS_DDW_PGSIZE_4K       0x01
>>>> +#define RTAS_DDW_PGSIZE_64K      0x02
>>>> +#define RTAS_DDW_PGSIZE_16M      0x04
>>>> +#define RTAS_DDW_PGSIZE_32M      0x08
>>>> +#define RTAS_DDW_PGSIZE_64M      0x10
>>>> +#define RTAS_DDW_PGSIZE_128M     0x20
>>>> +#define RTAS_DDW_PGSIZE_256M     0x40
>>>> +#define RTAS_DDW_PGSIZE_16G      0x80
>>>> +
>>>>  /* RTAS tokens */
>>>>  #define RTAS_TOKEN_BASE      0x2000
>>>>  
>>>> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>>>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>>>>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>>>>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
>>>> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
>>>> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
>>>> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
>>>> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>>>>  
>>>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
>>>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>>>>  
>>>>  /* RTAS ibm,get-system-parameter token values */
>>>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>>>> diff --git a/trace-events b/trace-events
>>>> index 7e94d92..5b52634 100644
>>>> --- a/trace-events
>>>> +++ b/trace-events
>>>> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>>>>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>>>>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>>>>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>>>> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
>>>> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
>>>> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
>>>> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>>>>  
>>>>  # hw/ppc/ppc.c
>>>>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-22  3:23     ` Alexey Kardashevskiy
  2016-06-22  7:01       ` David Gibson
@ 2016-06-22  9:44       ` Thomas Huth
  2016-06-23  2:00         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Huth @ 2016-06-22  9:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4214 bytes --]

On 22.06.2016 05:23, Alexey Kardashevskiy wrote:
> On 22/06/16 12:35, David Gibson wrote:
>> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>> the SPAPR specification which allows to have additional DMA window(s)
>>>
>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>> the pseries-2.6 machine and older disable it.
>>> This also creates a single DMA window for the older machines to
>>> maintain backward migration.
>>>
>>> This implements DDW for PHB with emulated and VFIO devices. The host
>>> kernel support is required. The advertised IOMMU page sizes are 4K and
>>> 64K; 16M pages are supported but not advertised by default, in order to
>>> enable them, the user has to specify "pgsz" property for PHB and
>>> enable huge pages for RAM.
>>>
>>> The existing linux guests try creating one additional huge DMA window
>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>> property which is a bus address for the 64bit window and by default
>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>
>>> This adds 4 RTAS handlers:
>>> * ibm,query-pe-dma-window
>>> * ibm,create-pe-dma-window
>>> * ibm,remove-pe-dma-window
>>> * ibm,reset-pe-dma-window
>>> These are registered from type_init() callback.
>>>
>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>> spapr_iommu.c with PCI.
>>>
>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[...]
>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>> index 9f28fb3..0cb51dd 100644
>>> --- a/hw/ppc/spapr_pci.c
>>> +++ b/hw/ppc/spapr_pci.c
[...]
>>> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>>  static Property spapr_phb_properties[] = {
>>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>>      /* Default DMA window is 0..1GB */
>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>>> +                       0x800000000000000ULL),
>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>> +                       (1ULL << 12) | (1ULL << 16)),
>>>      DEFINE_PROP_END_OF_LIST(),
>>>  };
>>>  
>>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>>      .post_load = spapr_pci_post_load,
>>>      .fields = (VMStateField[]) {
>>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>
>> It's not obvious to me why this change is necessary.
> 
> It is not. But I was touching liobn and this is a proper cleanup which
> needs to be done anyway as _EQUAL() macros are sort of deprecated and
> rather pointless.

Not sure, but if you mark this field as unused now, is migration
backwards to an older version of QEMU still working? If not, you might
need to bump the version number, too?

 Thomas



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
  2016-06-22  1:29   ` David Gibson
@ 2016-06-22 14:38   ` Laurent Vivier
  2016-06-23  3:59     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 23+ messages in thread
From: Laurent Vivier @ 2016-06-22 14:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, David Gibson



On 21/06/2016 03:14, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds a helper to vfio_listener_region_add which makes
> VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
> the host IOMMU list; the opposite action is taken in
> vfio_listener_region_del.
> 
> When creating a new window, this uses heuristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v18:
> * moved trace definitions under hw/vfio/spapr.c section
> * moved trace_vfio_spapr_remove_window to vfio_spapr_remove_window()
> * vfio_host_win_del() now checks for exact window size
> * one ctz() less in vfio_spapr_create_window()
> 
> v17:
> * moved spapr window create/remove helpers to separate file
> * added hw_error() if vfio_host_win_del() failed
> 
> v16:
> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> * enforced no intersections between windows
> 
> v14:
> * new to the series
> ---
>  hw/vfio/common.c              | 79 +++++++++++++++++++++++++++++++++++++------
>  hw/vfio/spapr.c               | 71 ++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  6 ++++
>  trace-events                  |  2 ++
>  4 files changed, 148 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index b53a1db..8e3466c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -265,6 +265,21 @@ static void vfio_host_win_add(VFIOContainer *container,
>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
>  }
>  
> +static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> +                             hwaddr max_iova)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
> +            QLIST_REMOVE(hostwin, hostwin_next);
> +            return 0;
> +        }
> +    }
> +
> +    return -1;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -380,6 +395,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        VFIOHostDMAWindow *hostwin;
> +        hwaddr pgsize = 0;
> +
> +        /* For now intersections are not allowed, we may relax this later */
> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +            if (ranges_overlap(hostwin->min_iova,
> +                               hostwin->max_iova - hostwin->min_iova + 1,
> +                               section->offset_within_address_space,
> +                               int128_get64(section->size))) {
> +                goto fail;

ret is not initialized and it is used in "fail:".

hw/vfio/common.c: In function ‘vfio_listener_region_add’:
hw/vfio/common.c:493:30: error: ‘ret’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]
             container->error = ret;

Laurent

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
  2016-06-21  6:46   ` David Gibson
@ 2016-06-22 16:49   ` Alex Williamson
  1 sibling, 0 replies; 23+ messages in thread
From: Alex Williamson @ 2016-06-22 16:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, David Gibson

On Tue, 21 Jun 2016 11:14:02 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This enforces guest RAM blocks to be host page size aligned; however
> this is not new as KVM already requires memory slots to be host page
> size aligned.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v18:
> * made a copy of listener trace points in spapr.c
> * fixed cleanup in vfio_connect_container
> * removed assert in vfio_prereg_listener_region_add()
> * created "prereg" copy of traces
> 
> v17:
> * s/prereg\.c/spapr.c/
> * s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
> * vfio_prereg_listener_skipped_section does hw_error() on IOMMUs
> 
> v16:
> * switched to 64bit math everywhere as there is no chance to see
> region_add on RAM blocks even remotely close to 1<<64bytes.
> 
> v15:
> * banned unaligned sections
> * added an vfio_prereg_gpa_to_ua() helper
> 
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  42 ++++++++++---
>  hw/vfio/spapr.c               | 139 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   6 ++
>  5 files changed, 182 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/spapr.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..c25e32b 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += spapr.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 27cc159..22be48b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -502,6 +502,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
>  }
>  
>  static struct vfio_info_cap_header *
> @@ -860,8 +863,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -886,8 +889,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -895,7 +900,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -907,11 +914,23 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                memory_listener_unregister(&container->prereg_listener);
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto free_container_exit;
> +            }
>          }
>  
>          /*
> @@ -924,7 +943,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if (ret) {
>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>              ret = -errno;
> -            goto free_container_exit;
> +            if (v2) {
> +                memory_listener_unregister(&container->prereg_listener);
> +            }
> +            goto listener_release_exit;


So we changed from free_container_exit to listener_release_exit, which
adds a call to vfio_listener_release().  As in the diff above, that
unconditionally calls memory_listener_unregister(&container->listener),
which is not initialized by this point.  nak.

>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> new file mode 100644
> index 0000000..5c29bec
> --- /dev/null
> +++ b/hw/vfio/spapr.c
> @@ -0,0 +1,139 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "cpu.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/hw.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    if (memory_region_is_iommu(section->mr)) {
> +        hw_error("Cannot possibly preregister IOMMU memory");
> +    }
> +
> +    return !memory_region_is_ram(section->mr) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
> +{
> +    return memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_prereg_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_prereg_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: Memory registering failed, unable to continue");
> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_prereg_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_prereg_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0610377..405c3b2 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
> +    MemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>                               uint32_t subtype, struct vfio_region_info **info);
>  #endif
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index da0d060..0b1583f 100644
> --- a/trace-events
> +++ b/trace-events

This needs a respin for the trace files moving anyway.

> @@ -1770,6 +1770,12 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  
> +# hw/vfio/spapr.c
> +vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64
> +vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del %"PRIx64" - %"PRIx64

"SKIPPING region_add/del" is a little redundant since the trace name
gets printed anyway, isn't it?

> +vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>  vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 3/5] vfio: Add host side DMA window capabilities
  2016-06-21  6:50   ` David Gibson
@ 2016-06-22 17:03     ` Alex Williamson
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Williamson @ 2016-06-22 17:03 UTC (permalink / raw)
  To: David Gibson; +Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc

On Tue, 21 Jun 2016 16:50:17 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jun 21, 2016 at 11:14:03AM +1000, Alexey Kardashevskiy wrote:
> > There are going to be multiple IOMMUs per a container. This moves
> > the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
> > 
> > This should cause no behavioral change and will be used later by
> > the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>  
> 
> Looks ok to me.  Again, Alex, your tree or mine?

I gave the previous patch a nak, it needs a respin, but this one looks
ok.  I don't currently have anything pending that would conflict with
this, afaik, so it's ok with me if you want to pull it through your
tree.  I'll ack the respin.
 
> One minor point..
> [snip]
> > @@ -878,17 +908,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >           * existing Type1 IOMMUs generally support any IOVA we're
> >           * going to actually try in practice.
> >           */
> > -        container->min_iova = 0;
> > -        container->max_iova = (hwaddr)-1;
> > -
> > -        /* Assume just 4K IOVA page size */
> > -        container->iova_pgsizes = 0x1000;
> >          info.argsz = sizeof(info);
> >          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
> >          /* Ignore errors */
> > -        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> > -            container->iova_pgsizes = info.iova_pgsizes;
> > +        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> > +            /* Assume 4k IOVA page size */
> > +            info.iova_pgsizes = 4096;
> >          }
> > +        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);  
> 
> I don't think it needs to hold this patch up, but at some point we
> should work out the real range covered by the x86 IOMMU tables and put
> that in here.  I'm pretty sure it won't actually be 2^64-1.

Between this patch, some work that Eric is doing that would allow us to
exclude the MSI range, and the capability chains that we can add to the
IOMMU_GET_INFO ioctl to describe both the extent and the reserved MSI
area, I think we're getting close to being able to do that.  On AMD I
think we do have a full 64bit address space, but VT-d is definitely
not.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-22  9:44       ` Thomas Huth
@ 2016-06-23  2:00         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-23  2:00 UTC (permalink / raw)
  To: Thomas Huth, David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4438 bytes --]

On 22/06/16 19:44, Thomas Huth wrote:
> On 22.06.2016 05:23, Alexey Kardashevskiy wrote:
>> On 22/06/16 12:35, David Gibson wrote:
>>> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>
>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>> the pseries-2.6 machine and older disable it.
>>>> This also creates a single DMA window for the older machines to
>>>> maintain backward migration.
>>>>
>>>> This implements DDW for PHB with emulated and VFIO devices. The host
>>>> kernel support is required. The advertised IOMMU page sizes are 4K and
>>>> 64K; 16M pages are supported but not advertised by default, in order to
>>>> enable them, the user has to specify "pgsz" property for PHB and
>>>> enable huge pages for RAM.
>>>>
>>>> The existing linux guests try creating one additional huge DMA window
>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>> property which is a bus address for the 64bit window and by default
>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>
>>>> This adds 4 RTAS handlers:
>>>> * ibm,query-pe-dma-window
>>>> * ibm,create-pe-dma-window
>>>> * ibm,remove-pe-dma-window
>>>> * ibm,reset-pe-dma-window
>>>> These are registered from type_init() callback.
>>>>
>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>> spapr_iommu.c with PCI.
>>>>
>>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> [...]
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 9f28fb3..0cb51dd 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
> [...]
>>>> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>>>  static Property spapr_phb_properties[] = {
>>>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>>>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>>>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>>>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>>>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>>>      /* Default DMA window is 0..1GB */
>>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>>>> +                       0x800000000000000ULL),
>>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>> +                       (1ULL << 12) | (1ULL << 16)),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>>>      .post_load = spapr_pci_post_load,
>>>>      .fields = (VMStateField[]) {
>>>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>>>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>>>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>>
>>> It's not obvious to me why this change is necessary.
>>
>> It is not. But I was touching liobn and this is a proper cleanup which
>> needs to be done anyway as _EQUAL() macros are sort of deprecated and
>> rather pointless.
> 
> Not sure, but if you mark this field as unused now, is migration
> backwards to an older version of QEMU still working? If not, you might
> need to bump the version number, too?

Oh. Correct, it will fail. So I still need this field here. Ok, will fix
when resend.



-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-06-22 14:38   ` Laurent Vivier
@ 2016-06-23  3:59     ` Alexey Kardashevskiy
  2016-06-23  4:55       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-23  3:59 UTC (permalink / raw)
  To: Laurent Vivier, qemu-devel; +Cc: Alex Williamson, qemu-ppc, David Gibson

On 23/06/16 00:38, Laurent Vivier wrote:
> 
> 
> On 21/06/2016 03:14, Alexey Kardashevskiy wrote:
>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>> This adds ability to VFIO common code to dynamically allocate/remove
>> DMA windows in the host kernel when new VFIO container is added/removed.
>>
>> This adds a helper to vfio_listener_region_add which makes
>> VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
>> the host IOMMU list; the opposite action is taken in
>> vfio_listener_region_del.
>>
>> When creating a new window, this uses heuristic to decide on the TCE table
>> levels number.
>>
>> This should cause no guest visible change in behavior.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v18:
>> * moved trace definitions under hw/vfio/spapr.c section
>> * moved trace_vfio_spapr_remove_window to vfio_spapr_remove_window()
>> * vfio_host_win_del() now checks for exact window size
>> * one ctz() less in vfio_spapr_create_window()
>>
>> v17:
>> * moved spapr window create/remove helpers to separate file
>> * added hw_error() if vfio_host_win_del() failed
>>
>> v16:
>> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
>> * enforced no intersections between windows
>>
>> v14:
>> * new to the series
>> ---
>>  hw/vfio/common.c              | 79 +++++++++++++++++++++++++++++++++++++------
>>  hw/vfio/spapr.c               | 71 ++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  6 ++++
>>  trace-events                  |  2 ++
>>  4 files changed, 148 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index b53a1db..8e3466c 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -265,6 +265,21 @@ static void vfio_host_win_add(VFIOContainer *container,
>>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
>>  }
>>  
>> +static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>> +                             hwaddr max_iova)
>> +{
>> +    VFIOHostDMAWindow *hostwin;
>> +
>> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +        if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
>> +            QLIST_REMOVE(hostwin, hostwin_next);
>> +            return 0;
>> +        }
>> +    }
>> +
>> +    return -1;
>> +}
>> +
>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>  {
>>      return (!memory_region_is_ram(section->mr) &&
>> @@ -380,6 +395,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>      }
>>      end = int128_get64(int128_sub(llend, int128_one()));
>>  
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        VFIOHostDMAWindow *hostwin;
>> +        hwaddr pgsize = 0;
>> +
>> +        /* For now intersections are not allowed, we may relax this later */
>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +            if (ranges_overlap(hostwin->min_iova,
>> +                               hostwin->max_iova - hostwin->min_iova + 1,
>> +                               section->offset_within_address_space,
>> +                               int128_get64(section->size))) {
>> +                goto fail;
> 
> ret is not initialized and it is used in "fail:".
> 
> hw/vfio/common.c: In function ‘vfio_listener_region_add’:
> hw/vfio/common.c:493:30: error: ‘ret’ may be used uninitialized in this
> function [-Werror=maybe-uninitialized]
>              container->error = ret;

Oh. Thanks for reporting. I use cross gcc and there must be something I am
doing wrong as I do not see these warnings but I do see them when compile
with native compiler...



-- 
Alexey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-06-23  3:59     ` Alexey Kardashevskiy
@ 2016-06-23  4:55       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 23+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-23  4:55 UTC (permalink / raw)
  To: Laurent Vivier, qemu-devel; +Cc: Alex Williamson, qemu-ppc, David Gibson

On 23/06/16 13:59, Alexey Kardashevskiy wrote:

>> ret is not initialized and it is used in "fail:".
>>
>> hw/vfio/common.c: In function ‘vfio_listener_region_add’:
>> hw/vfio/common.c:493:30: error: ‘ret’ may be used uninitialized in this
>> function [-Werror=maybe-uninitialized]
>>              container->error = ret;
> 
> Oh. Thanks for reporting. I use cross gcc and there must be something I am
> doing wrong as I do not see these warnings but I do see them when compile
> with native compiler...

Ah, figured out - gcc -O2 enables the warning, and I always configure with
--enable-debug so gcc is not getting -O2. Will pay attention to this from
now on.


-- 
Alexey

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-06-23  4:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-21  1:14 [Qemu-devel] [PATCH qemu v18 0/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 1/5] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-06-21  6:16   ` David Gibson
2016-06-21 10:23     ` Paolo Bonzini
2016-06-22  1:13       ` David Gibson
2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 2/5] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
2016-06-21  6:46   ` David Gibson
2016-06-22 16:49   ` Alex Williamson
2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 3/5] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
2016-06-21  6:50   ` David Gibson
2016-06-22 17:03     ` Alex Williamson
2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 4/5] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
2016-06-22  1:29   ` David Gibson
2016-06-22 14:38   ` Laurent Vivier
2016-06-23  3:59     ` Alexey Kardashevskiy
2016-06-23  4:55       ` Alexey Kardashevskiy
2016-06-21  1:14 ` [Qemu-devel] [PATCH qemu v18 5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-06-22  2:35   ` David Gibson
2016-06-22  3:23     ` Alexey Kardashevskiy
2016-06-22  7:01       ` David Gibson
2016-06-22  8:26         ` Alexey Kardashevskiy
2016-06-22  9:44       ` Thomas Huth
2016-06-23  2:00         ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.