All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
@ 2015-04-25 12:24 Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 01/14] spapr_pci: Finish making find_phb()/find_dev() public Alexey Kardashevskiy
                   ` (14 more replies)
  0 siblings, 15 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson


(cut-n-paste from kernel patchset)

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows.

Once a Linux guest discovers the presence of DDW, it does:
1. query hypervisor about number of available windows and page size masks;
2. create a window with the biggest possible page size (today 4K/64K/16M);
3. map the entire guest RAM via H_PUT_TCE* hypercalls;
4. switche dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
the guest does not waste time on DMA map/unmap operations.

Note that 32bit devices won't use DDW and will keep using the default
DMA window so KVM optimizations will be required (to be posted later).

This patchset adds DDW support for pseries. The host kernel changes are
required, posted as:

[PATCH kernel v9 00/32] powerpc/iommu/vfio: Enable Dynamic DMA windows

This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
This is also pushed to git@github.com:aik/qemu.git
 + a64ff6f...64ac9a4 64ac9a4 -> vfio-for-github (forced update)

Please comment. Thanks!

Changes:
v7:
* bunch of cleanups, renames after David+Thomas+Michael review
* patches are reorganized and those which do not need the host kernel headers
update are put first and can be pulled if these are good enough :)

v6:
* spapr-pci-vfio-host-bridge is now a synonim of spapr-pci-host-bridge -
same PHB can host emulated and VFIO devices
* changed patches order
* lot of small changes

v5:
* TCE tables got "enabled" state and are persistent, i.e. not recreated
every reboot
* added v2 of SPAPR_TCE_IOMMU
* fixed migration for emulated PHB with enabled DDW
* huge pile of other changes

v4:
* reimplemented the whole thing
* machine reset and ddw-reset RTAS call both remove all TCE tables and
create the default one
* IOMMU group id is not needed to use VFIO PHB anymore, multiple groups
are supported on the same VFIO container and virtual PHB

v3:
* removed "reset" from API now
* reworked machine versions
* applied multiple comments
* includes David's machine QOM rework as this patchset adds a new machine type

v2:
* tested on emulated PHB
* removed "ddw" machine property, now it is PHB property
* disabled by default
* defined "pseries-2.2" machine which enables DDW by default
* fixed reset() and reference counting




Alexey Kardashevskiy (14):
  spapr_pci: Finish making find_phb()/find_dev() public
  vmstate: Define VARRAY with VMS_ALLOC
  vfio: spapr: Move SPAPR-related code to a separate file
  spapr_pci_vfio: Enable multiple groups per container
  spapr_pci: Convert finish_realize() to
    dma_capabilities_update()+dma_init_window()
  spapr_iommu: Introduce "enabled" state for TCE table
  spapr_iommu: Add root memory region
  spapr_pci: Do complete reset of DMA config when resetting PHB
  spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge
  linux headers update for DDW on SPAPR
  vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  spapr: Add pseries-2.4 machine
  spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  vfio: Enable DDW ioctls to VFIO IOMMU driver

 hw/ppc/Makefile.objs          |   3 +
 hw/ppc/spapr.c                |  32 ++++-
 hw/ppc/spapr_iommu.c          | 144 +++++++++++++------
 hw/ppc/spapr_pci.c            | 208 ++++++++++++++++++----------
 hw/ppc/spapr_pci_vfio.c       | 147 ++++++++++++--------
 hw/ppc/spapr_rtas_ddw.c       | 300 ++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_vio.c            |   9 +-
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              | 180 +++++-------------------
 hw/vfio/spapr.c               | 312 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h   |  49 +++++--
 include/hw/ppc/spapr.h        |  30 +++-
 include/hw/vfio/vfio-common.h |  16 +++
 include/hw/vfio/vfio.h        |   2 +-
 include/migration/vmstate.h   |  10 ++
 linux-headers/linux/vfio.h    |  88 +++++++++++-
 trace-events                  |   5 +
 17 files changed, 1188 insertions(+), 348 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c
 create mode 100644 hw/vfio/spapr.c

-- 
2.0.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 01/14] spapr_pci: Finish making find_phb()/find_dev() public
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

add8aa99bfbadabee129a0b0295f7717ba100a37 already converted many of there,
however EEH was not affected. This may be squashed there.

This makes find_phb()/find_dev() public and changed its names
to spapr_pci_find_phb()/spapr_pci_find_dev() as they are going to
be used from other parts of QEMU such as VFIO DDW (dynamic DMA window)
or VFIO PCI error injection or VFIO EEH handling - in all these
cases there are RTAS calls which are addressed to BUID+config_addr
in IEEE1275 format.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v7:
* updated subj as it is a folloup for add8aa99bfbadabee129a0b0295f7717ba100a37
---
 hw/ppc/spapr_pci.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 609a8ae..52c5c73 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -426,7 +426,7 @@ static void rtas_ibm_set_eeh_option(PowerPCCPU *cpu,
     addr = rtas_ld(args, 0);
     option = rtas_ld(args, 3);
 
-    sphb = find_phb(spapr, buid);
+    sphb = spapr_pci_find_phb(spapr, buid);
     if (!sphb) {
         goto param_error_exit;
     }
@@ -461,7 +461,7 @@ static void rtas_ibm_get_config_addr_info2(PowerPCCPU *cpu,
     }
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
-    sphb = find_phb(spapr, buid);
+    sphb = spapr_pci_find_phb(spapr, buid);
     if (!sphb) {
         goto param_error_exit;
     }
@@ -479,7 +479,7 @@ static void rtas_ibm_get_config_addr_info2(PowerPCCPU *cpu,
     switch (option) {
     case RTAS_GET_PE_ADDR:
         addr = rtas_ld(args, 0);
-        pdev = find_dev(spapr, buid, addr);
+        pdev = spapr_pci_find_dev(spapr, buid, addr);
         if (!pdev) {
             goto param_error_exit;
         }
@@ -516,7 +516,7 @@ static void rtas_ibm_read_slot_reset_state2(PowerPCCPU *cpu,
     }
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
-    sphb = find_phb(spapr, buid);
+    sphb = spapr_pci_find_phb(spapr, buid);
     if (!sphb) {
         goto param_error_exit;
     }
@@ -562,7 +562,7 @@ static void rtas_ibm_set_slot_reset(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     option = rtas_ld(args, 3);
-    sphb = find_phb(spapr, buid);
+    sphb = spapr_pci_find_phb(spapr, buid);
     if (!sphb) {
         goto param_error_exit;
     }
@@ -596,7 +596,7 @@ static void rtas_ibm_configure_pe(PowerPCCPU *cpu,
     }
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
-    sphb = find_phb(spapr, buid);
+    sphb = spapr_pci_find_phb(spapr, buid);
     if (!sphb) {
         goto param_error_exit;
     }
@@ -631,7 +631,7 @@ static void rtas_ibm_slot_error_detail(PowerPCCPU *cpu,
     }
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
-    sphb = find_phb(spapr, buid);
+    sphb = spapr_pci_find_phb(spapr, buid);
     if (!sphb) {
         goto param_error_exit;
     }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 02/14] vmstate: Define VARRAY with VMS_ALLOC
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 01/14] spapr_pci: Finish making find_phb()/find_dev() public Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 03/14] vfio: spapr: Move SPAPR-related code to a separate file Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This allows dynamic allocation for migrating arrays.

Already existing VMSTATE_VARRAY_UINT32 requires an array to be
pre-allocated, however there are cases when the size is not known in
advance and there is no real need to enforce it.

This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
flag which tells the receiving side to allocate memory for the array
before receiving the data.

The first user of it is a dynamic DMA window which existence and size
are totally dynamic.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/migration/vmstate.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index bc7616a..73b9d67 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -299,6 +299,16 @@ extern const VMStateInfo vmstate_info_bitmap;
     .offset     = vmstate_offset_pointer(_state, _field, _type),     \
 }
 
+#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+    .info       = &(_info),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
+    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
     .name       = (stringify(_field)),                               \
     .version_id = (_version),                                        \
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 03/14] vfio: spapr: Move SPAPR-related code to a separate file
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 01/14] spapr_pci: Finish making find_phb()/find_dev() public Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This moves SPAPR bits to a separate file to avoid pollution of x86 code.

This enables spapr-vfio on CONFIG_SOFTMMU (not CONFIG_PSERIES) as
the config options are only visible in makefiles and not in the source code
so there is no an obvious way of implementing stubs if hw/vfio/spapr.c is
not compiled.

This is a mechanical patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              | 134 ++-----------------------
 hw/vfio/spapr.c               | 226 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  13 +++
 4 files changed, 246 insertions(+), 128 deletions(-)
 create mode 100644 hw/vfio/spapr.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index e31f30e..b987ffb 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,5 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_SOFTMMU) += spapr.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b012620..3e4c685 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -190,8 +190,8 @@ const MemoryRegionOps vfio_region_ops = {
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
-static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
+int vfio_dma_unmap(VFIOContainer *container,
+                   hwaddr iova, ram_addr_t size)
 {
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
@@ -208,8 +208,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
     return 0;
 }
 
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                        ram_addr_t size, void *vaddr, bool readonly)
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly)
 {
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
@@ -238,7 +238,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
             !memory_region_is_iommu(section->mr)) ||
@@ -251,64 +251,6 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
-static void vfio_iommu_map_notify(Notifier *n, void *data)
-{
-    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
-    VFIOContainer *container = giommu->container;
-    IOMMUTLBEntry *iotlb = data;
-    MemoryRegion *mr;
-    hwaddr xlat;
-    hwaddr len = iotlb->addr_mask + 1;
-    void *vaddr;
-    int ret;
-
-    trace_vfio_iommu_map_notify(iotlb->iova,
-                                iotlb->iova + iotlb->addr_mask);
-
-    /*
-     * The IOMMU TLB entry we have just covers translation through
-     * this IOMMU to its immediate target.  We need to translate
-     * it the rest of the way through to memory.
-     */
-    mr = address_space_translate(&address_space_memory,
-                                 iotlb->translated_addr,
-                                 &xlat, &len, iotlb->perm & IOMMU_WO);
-    if (!memory_region_is_ram(mr)) {
-        error_report("iommu map to non memory area %"HWADDR_PRIx"",
-                     xlat);
-        return;
-    }
-    /*
-     * Translation truncates length to the IOMMU page size,
-     * check that it did not truncate too much.
-     */
-    if (len & iotlb->addr_mask) {
-        error_report("iommu has granularity incompatible with target AS");
-        return;
-    }
-
-    if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        vaddr = memory_region_get_ram_ptr(mr) + xlat;
-        ret = vfio_dma_map(container, iotlb->iova,
-                           iotlb->addr_mask + 1, vaddr,
-                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
-        if (ret) {
-            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                         container, iotlb->iova,
-                         iotlb->addr_mask + 1, vaddr, ret);
-        }
-    } else {
-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
-        if (ret) {
-            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iotlb->iova,
-                         iotlb->addr_mask + 1, ret);
-        }
-    }
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -344,45 +286,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
     memory_region_ref(section->mr);
 
-    if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        trace_vfio_listener_region_add_iommu(iova,
-                    int128_get64(int128_sub(llend, int128_one())));
-        /*
-         * FIXME: We should do some checking to see if the
-         * capabilities of the host VFIO IOMMU are adequate to model
-         * the guest IOMMU
-         *
-         * FIXME: For VFIO iommu types which have KVM acceleration to
-         * avoid bouncing all map/unmaps through qemu this way, this
-         * would be the right place to wire that up (tell the KVM
-         * device emulation the VFIO iommu handles to use).
-         */
-        /*
-         * This assumes that the guest IOMMU is empty of
-         * mappings at this point.
-         *
-         * One way of doing this is:
-         * 1. Avoid sharing IOMMUs between emulated devices or different
-         * IOMMU groups.
-         * 2. Implement VFIO_IOMMU_ENABLE in the host kernel to fail if
-         * there are some mappings in IOMMU.
-         *
-         * VFIO on SPAPR does that. Other IOMMU models may do that different,
-         * they must make sure there are no existing mappings or
-         * loop through existing mappings to map them into VFIO.
-         */
-        giommu = g_malloc0(sizeof(*giommu));
-        giommu->iommu = section->mr;
-        giommu->container = container;
-        giommu->n.notify = vfio_iommu_map_notify;
-        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
-        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
-
-        return;
-    }
-
     /* Here we assume that memory_region_is_ram(section->mr)==true */
 
     end = int128_get64(llend);
@@ -435,27 +338,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
         return;
     }
 
-    if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (giommu->iommu == section->mr) {
-                memory_region_unregister_iommu_notifier(&giommu->n);
-                QLIST_REMOVE(giommu, giommu_next);
-                g_free(giommu);
-                break;
-            }
-        }
-
-        /*
-         * FIXME: We assume the one big unmap below is adequate to
-         * remove any individual page mappings in the IOMMU which
-         * might have been copied into VFIO. This works for a page table
-         * based IOMMU where a big unmap flattens a large range of IO-PTEs.
-         * That may not be true for all IOMMU types.
-         */
-    }
-
     iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
     end = (section->offset_within_address_space + int128_get64(section->size)) &
           TARGET_PAGE_MASK;
@@ -721,11 +603,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        container->iommu_data.type1.listener = vfio_memory_listener;
-        container->iommu_data.release = vfio_listener_release;
-
-        memory_listener_register(&container->iommu_data.type1.listener,
-                                 container->space->as);
+        spapr_memory_listener_register(container);
 
     } else {
         error_report("vfio: No available IOMMU models");
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
new file mode 100644
index 0000000..5f79194
--- /dev/null
+++ b/hw/vfio/spapr.c
@@ -0,0 +1,226 @@
+/*
+ * QEMU sPAPR VFIO IOMMU
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "hw/vfio/vfio-common.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+static void vfio_iommu_map_notify(Notifier *n, void *data)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    VFIOContainer *container = giommu->container;
+    IOMMUTLBEntry *iotlb = data;
+    MemoryRegion *mr;
+    hwaddr xlat;
+    hwaddr len = iotlb->addr_mask + 1;
+    void *vaddr;
+    int ret;
+
+    trace_vfio_iommu_map_notify(iotlb->iova,
+                                iotlb->iova + iotlb->addr_mask);
+
+    /*
+     * The IOMMU TLB entry we have just covers translation through
+     * this IOMMU to its immediate target.  We need to translate
+     * it the rest of the way through to memory.
+     */
+    mr = address_space_translate(&address_space_memory,
+                                 iotlb->translated_addr,
+                                 &xlat, &len, iotlb->perm & IOMMU_WO);
+    if (!memory_region_is_ram(mr)) {
+        error_report("iommu map to non memory area %"HWADDR_PRIx,
+                     xlat);
+        return;
+    }
+    /*
+     * Translation truncates length to the IOMMU page size,
+     * check that it did not truncate too much.
+     */
+    if (len & iotlb->addr_mask) {
+        error_report("iommu has granularity incompatible with target AS");
+        return;
+    }
+
+    if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
+        vaddr = memory_region_get_ram_ptr(mr) + xlat;
+        ret = vfio_dma_map(container, iotlb->iova,
+                           iotlb->addr_mask + 1, vaddr,
+                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
+        if (ret) {
+            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         container, iotlb->iova,
+                         iotlb->addr_mask + 1, vaddr, ret);
+        }
+    } else {
+        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iotlb->iova,
+                         iotlb->addr_mask + 1, ret);
+        }
+    }
+}
+
+static void vfio_spapr_listener_region_add(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.listener);
+    hwaddr iova;
+    Int128 llend;
+    VFIOGuestIOMMU *giommu;
+
+    if (vfio_listener_skipped_section(section)) {
+        trace_vfio_listener_region_add_skip(
+            section->offset_within_address_space,
+            section->offset_within_address_space +
+            int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return;
+    }
+
+    memory_region_ref(section->mr);
+
+    trace_vfio_listener_region_add_iommu(iova,
+         int128_get64(int128_sub(llend, int128_one())));
+    /*
+     * FIXME: We should do some checking to see if the
+     * capabilities of the host VFIO IOMMU are adequate to model
+     * the guest IOMMU
+     *
+     * FIXME: For VFIO iommu types which have KVM acceleration to
+     * avoid bouncing all map/unmaps through qemu this way, this
+     * would be the right place to wire that up (tell the KVM
+     * device emulation the VFIO iommu handles to use).
+     */
+    /*
+     * This assumes that the guest IOMMU is empty of
+     * mappings at this point.
+     *
+     * One way of doing this is:
+     * 1. Avoid sharing IOMMUs between emulated devices or different
+     * IOMMU groups.
+     * 2. Implement VFIO_IOMMU_ENABLE in the host kernel to fail if
+     * there are some mappings in IOMMU.
+     *
+     * VFIO on SPAPR does that. Other IOMMU models may do that different,
+     * they must make sure there are no existing mappings or
+     * loop through existing mappings to map them into VFIO.
+     */
+    giommu = g_malloc0(sizeof(*giommu));
+    giommu->iommu = section->mr;
+    giommu->container = container;
+    giommu->n.notify = vfio_iommu_map_notify;
+    QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
+    memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+}
+
+static void vfio_spapr_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.listener);
+    hwaddr iova, end;
+    int ret;
+    VFIOGuestIOMMU *giommu;
+
+    if (vfio_listener_skipped_section(section)) {
+        trace_vfio_listener_region_del_skip(
+            section->offset_within_address_space,
+            section->offset_within_address_space +
+            int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+        if (giommu->iommu == section->mr) {
+            memory_region_unregister_iommu_notifier(&giommu->n);
+            QLIST_REMOVE(giommu, giommu_next);
+            g_free(giommu);
+            break;
+        }
+    }
+
+    /*
+     * FIXME: We assume the one big unmap below is adequate to
+     * remove any individual page mappings in the IOMMU which
+     * might have been copied into VFIO. This works for a page table
+     * based IOMMU where a big unmap flattens a large range of IO-PTEs.
+     * That may not be true for all IOMMU types.
+     */
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    end = (section->offset_within_address_space + int128_get64(section->size)) &
+        TARGET_PAGE_MASK;
+
+    if (iova >= end) {
+        return;
+    }
+
+    trace_vfio_listener_region_del(iova, end - 1);
+
+    ret = vfio_dma_unmap(container, iova, end - iova);
+    memory_region_unref(section->mr);
+    if (ret) {
+        error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                     "0x%"HWADDR_PRIx") = %d (%m)",
+                     container, iova, end - iova, ret);
+    }
+}
+
+static const MemoryListener vfio_spapr_memory_listener = {
+    .region_add = vfio_spapr_listener_region_add,
+    .region_del = vfio_spapr_listener_region_del,
+};
+
+static void vfio_spapr_listener_release(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->iommu_data.spapr.listener);
+}
+
+void spapr_memory_listener_register(VFIOContainer *container)
+{
+    container->iommu_data.spapr.listener = vfio_spapr_memory_listener;
+    container->iommu_data.release = vfio_spapr_listener_release;
+
+    memory_listener_register(&container->iommu_data.spapr.listener,
+                             container->space->as);
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0d1fb80..06b96ad 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -69,6 +69,10 @@ typedef struct VFIOType1 {
     bool initialized;
 } VFIOType1;
 
+typedef struct VFIOSPAPR {
+    MemoryListener listener;
+} VFIOSPAPR;
+
 typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
@@ -76,6 +80,7 @@ typedef struct VFIOContainer {
         /* enable abstraction to support various iommu backends */
         union {
             VFIOType1 type1;
+            VFIOSPAPR spapr;
         };
         void (*release)(struct VFIOContainer *);
     } iommu_data;
@@ -145,4 +150,12 @@ extern const MemoryRegionOps vfio_region_ops;
 extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
 extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
 
+extern int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                        ram_addr_t size, void *vaddr, bool readonly);
+extern int vfio_dma_unmap(VFIOContainer *container,
+                          hwaddr iova, ram_addr_t size);
+bool vfio_listener_skipped_section(MemoryRegionSection *section);
+
+extern void spapr_memory_listener_register(VFIOContainer *container);
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 03/14] vfio: spapr: Move SPAPR-related code to a separate file Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 05/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This enables multiple IOMMU groups in one VFIO container which means
that multiple devices from different groups can share the same IOMMU
table (or tables if DDW).

This removes a group id from vfio_container_ioctl(). The kernel support
is required for this; if the host kernel does not have the support,
it will allow only one group per container. The PHB's "iommuid" property
is ignored. The ioctl is called for every container attached to
the address space. At the moment there is just one container anyway.

If there is no container attached to the address space,
vfio_container_do_ioctl() returns -1.

This removes casts to sPAPRPHBVFIOState as none of sPAPRPHBVFIOState
members is accessed here.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci_vfio.c | 21 ++++++---------------
 hw/vfio/common.c        | 20 ++++++--------------
 include/hw/vfio/vfio.h  |  2 +-
 3 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index 99a1be5..e89cbff 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -35,12 +35,7 @@ static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
     sPAPRTCETable *tcet;
     uint32_t liobn = svphb->phb.dma_liobn;
 
-    if (svphb->iommugroupid == -1) {
-        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
-        return;
-    }
-
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&svphb->phb.iommu_as,
                                VFIO_CHECK_EXTENSION,
                                (void *) VFIO_SPAPR_TCE_IOMMU);
     if (ret != 1) {
@@ -49,7 +44,7 @@ static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
         return;
     }
 
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as,
                                VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
     if (ret) {
         error_setg_errno(errp, -ret,
@@ -79,7 +74,6 @@ static void spapr_phb_vfio_reset(DeviceState *qdev)
 static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
                                          unsigned int addr, int option)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
@@ -116,7 +110,7 @@ static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
         return RTAS_OUT_PARAM_ERROR;
     }
 
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_HW_ERROR;
@@ -127,12 +121,11 @@ static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
 
 static int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
     op.op = VFIO_EEH_PE_GET_STATE;
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_PARAM_ERROR;
@@ -144,7 +137,6 @@ static int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
 
 static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
@@ -162,7 +154,7 @@ static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
         return RTAS_OUT_PARAM_ERROR;
     }
 
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_HW_ERROR;
@@ -173,12 +165,11 @@ static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
 
 static int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
 
     op.op = VFIO_EEH_PE_CONFIGURE;
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as, svphb->iommugroupid,
+    ret = vfio_container_ioctl(&sphb->iommu_as,
                                VFIO_EEH_PE_OP, &op);
     if (ret < 0) {
         return RTAS_OUT_PARAM_ERROR;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3e4c685..369e564 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -793,34 +793,26 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     close(vbasedev->fd);
 }
 
-static int vfio_container_do_ioctl(AddressSpace *as, int32_t groupid,
+static int vfio_container_do_ioctl(AddressSpace *as,
                                    int req, void *param)
 {
-    VFIOGroup *group;
     VFIOContainer *container;
     int ret = -1;
+    VFIOAddressSpace *space = vfio_get_address_space(as);
 
-    group = vfio_get_group(groupid, as);
-    if (!group) {
-        error_report("vfio: group %d not registered", groupid);
-        return ret;
-    }
-
-    container = group->container;
-    if (group->container) {
+    QLIST_FOREACH(container, &space->containers, next) {
         ret = ioctl(container->fd, req, param);
         if (ret < 0) {
             error_report("vfio: failed to ioctl %d to container: ret=%d, %s",
                          _IOC_NR(req) - VFIO_BASE, ret, strerror(errno));
+            return -errno;
         }
     }
 
-    vfio_put_group(group);
-
     return ret;
 }
 
-int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
+int vfio_container_ioctl(AddressSpace *as,
                          int req, void *param)
 {
     /* We allow only certain ioctls to the container */
@@ -835,5 +827,5 @@ int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
         return -1;
     }
 
-    return vfio_container_do_ioctl(as, groupid, req, param);
+    return vfio_container_do_ioctl(as, req, param);
 }
diff --git a/include/hw/vfio/vfio.h b/include/hw/vfio/vfio.h
index 0b26cd8..76b5744 100644
--- a/include/hw/vfio/vfio.h
+++ b/include/hw/vfio/vfio.h
@@ -3,7 +3,7 @@
 
 #include "qemu/typedefs.h"
 
-extern int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
+extern int vfio_container_ioctl(AddressSpace *as,
                                 int req, void *param);
 
 #endif
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 05/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window()
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This reworks finish_realize() which used to finalize DMA setup with
an assumption that it will not change later.

New callbacks supports various window parameters such as page and
windows sizes. The new callback return error code rather than Error**.

This is a mechanical change so no change in behaviour is expected.
This is a part of getting rid of spapr-pci-vfio-host-bridge type.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci.c          | 59 ++++++++++++++++++++++++++-------------------
 hw/ppc/spapr_pci_vfio.c     | 47 +++++++++++++++---------------------
 include/hw/pci-host/spapr.h |  8 +++++-
 3 files changed, 61 insertions(+), 53 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 52c5c73..8c0d2eb 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -741,6 +741,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     int i;
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
+    sPAPRTCETable *tcet;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -880,33 +881,40 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         sphb->lsi_table[i].irq = irq;
     }
 
-    if (!info->finish_realize) {
-        error_setg(errp, "finish_realize not defined");
-        return;
-    }
-
-    info->finish_realize(sphb, errp);
-
-    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
-}
-
-static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
-{
-    sPAPRTCETable *tcet;
-    uint32_t nb_table;
-
-    nb_table = SPAPR_PCI_DMA32_SIZE >> SPAPR_TCE_PAGE_SHIFT;
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
+    info->dma_capabilities_update(sphb);
+    info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
+                          sphb->dma32_window_size);
+    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
     if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return ;
+        error_setg(errp, "failed to create TCE table");
+        return;
     }
-
-    /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, 0,
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
+
+    sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
+}
+
+static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
+{
+    sphb->dma32_window_start = 0;
+    sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
+
+    return 0;
+}
+
+static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
+                                     uint32_t liobn, uint32_t page_shift,
+                                     uint64_t window_size)
+{
+    uint64_t bus_offset = sphb->dma32_window_start;
+    sPAPRTCETable *tcet;
+
+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
+                               window_size >> page_shift,
+                               false);
+
+    return tcet ? 0 : -1;
 }
 
 static int spapr_phb_children_reset(Object *child, void *opaque)
@@ -1057,7 +1065,8 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
     dc->vmsd = &vmstate_spapr_pci;
     set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
     dc->cannot_instantiate_with_device_add_yet = false;
-    spc->finish_realize = spapr_phb_finish_realize;
+    spc->dma_capabilities_update = spapr_phb_dma_capabilities_update;
+    spc->dma_init_window = spapr_phb_dma_init_window;
 }
 
 static const TypeInfo spapr_phb_info = {
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index e89cbff..f1dd28c 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -27,43 +27,35 @@ static Property spapr_phb_vfio_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
+static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
 {
-    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
     struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
     int ret;
-    sPAPRTCETable *tcet;
-    uint32_t liobn = svphb->phb.dma_liobn;
-
-    ret = vfio_container_ioctl(&svphb->phb.iommu_as,
-                               VFIO_CHECK_EXTENSION,
-                               (void *) VFIO_SPAPR_TCE_IOMMU);
-    if (ret != 1) {
-        error_setg_errno(errp, -ret,
-                         "spapr-vfio: SPAPR extension is not supported");
-        return;
-    }
 
     ret = vfio_container_ioctl(&sphb->iommu_as,
                                VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
     if (ret) {
-        error_setg_errno(errp, -ret,
-                         "spapr-vfio: get info from container failed");
-        return;
+        return ret;
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, info.dma32_window_start,
-                               SPAPR_TCE_PAGE_SHIFT,
-                               info.dma32_window_size >> SPAPR_TCE_PAGE_SHIFT,
+    sphb->dma32_window_start = info.dma32_window_start;
+    sphb->dma32_window_size = info.dma32_window_size;
+
+    return ret;
+}
+
+static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
+                                          uint32_t liobn, uint32_t page_shift,
+                                          uint64_t window_size)
+{
+    uint64_t bus_offset = sphb->dma32_window_start;
+    sPAPRTCETable *tcet;
+
+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
+                               window_size >> page_shift,
                                true);
-    if (!tcet) {
-        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
-        return;
-    }
 
-    /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
-                                spapr_tce_get_iommu(tcet));
+    return tcet ? 0 : -1;
 }
 
 static void spapr_phb_vfio_reset(DeviceState *qdev)
@@ -185,7 +177,8 @@ static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
 
     dc->props = spapr_phb_vfio_properties;
     dc->reset = spapr_phb_vfio_reset;
-    spc->finish_realize = spapr_phb_vfio_finish_realize;
+    spc->dma_capabilities_update = spapr_phb_vfio_dma_capabilities_update;
+    spc->dma_init_window = spapr_phb_vfio_dma_init_window;
     spc->eeh_set_option = spapr_phb_vfio_eeh_set_option;
     spc->eeh_get_state = spapr_phb_vfio_eeh_get_state;
     spc->eeh_reset = spapr_phb_vfio_eeh_reset;
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 5b497ce..3074145 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -48,7 +48,10 @@ typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
 
-    void (*finish_realize)(sPAPRPHBState *sphb, Error **errp);
+    int (*dma_capabilities_update)(sPAPRPHBState *sphb);
+    int (*dma_init_window)(sPAPRPHBState *sphb,
+                           uint32_t liobn, uint32_t page_shift,
+                           uint64_t window_size);
     int (*eeh_set_option)(sPAPRPHBState *sphb, unsigned int addr, int option);
     int (*eeh_get_state)(sPAPRPHBState *sphb, int *state);
     int (*eeh_reset)(sPAPRPHBState *sphb, int option);
@@ -89,6 +92,9 @@ struct sPAPRPHBState {
     int32_t msi_devs_num;
     spapr_pci_msi_mig *msi_devs;
 
+    uint32_t dma32_window_start;
+    uint32_t dma32_window_size;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 05/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-05-05 12:28   ` David Gibson
  2015-05-25 15:05   ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

Currently TCE tables are created once at start and their size never
changes. We are going to change that by introducing a Dynamic DMA windows
support where DMA configuration may change during the guest execution.

This changes spapr_tce_new_table() to create an empty stub object. Only
LIOBN is assigned by the time of creation. It still will be called once
at the owner object (VIO or PHB) creation.

This introduces an "enabled" state for TCE table objects with two
helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
spapr_tce_table_enable() receives TCE table parameters and allocates
a guest view of the TCE table (in the user space or KVM).
spapr_tce_table_disable() disposes the table.

Follow up patches will disable+enable tables on reset (system reset
or DDW reset).

No visible change in behaviour is expected except the actual table
will be reallocated every reset. We might optimize this later.

The other way to implement this would be dynamically create/remove
the TCE table QOM objects but this would make migration impossible
as migration expects all QOM objects to exist at the receiver
so we have to have TCE table objects created when migration begins.

spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
as later it will be called at the sPAPRTCETable post-migration stage when
it has all the properties set after the migration.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v7:
* s'tmp[64]'tmp[32]' as we need less than 64bytes and more than 16 bytes
and 32 is the closest power-of-two (just looks nices to have power-of-two
values)
* updated commit log about having spapr_tce_table_do_enable() splitted
from spapr_tce_table_enable()

v6:
* got rid of set_props()
---
 hw/ppc/spapr_iommu.c    | 104 +++++++++++++++++++++++++++++++-----------------
 hw/ppc/spapr_pci.c      |  16 +++++---
 hw/ppc/spapr_pci_vfio.c |  10 ++---
 hw/ppc/spapr_vio.c      |   9 ++---
 include/hw/ppc/spapr.h  |  11 ++---
 5 files changed, 93 insertions(+), 57 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index a14cdc4..a3f2b83 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -126,8 +126,47 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+
+    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
+
+    vmstate_register(DEVICE(tcet), tcet->liobn, &vmstate_spapr_tce_table,
+                     tcet);
+
+    return 0;
+}
+
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
+{
+    sPAPRTCETable *tcet;
+    char tmp[32];
+
+    if (spapr_tce_find_by_liobn(liobn)) {
+        fprintf(stderr, "Attempted to create TCE table with duplicate"
+                " LIOBN 0x%x\n", liobn);
+        return NULL;
+    }
+
+    tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
+    tcet->liobn = liobn;
+
+    snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
+    object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
+
+    object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
+
+    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
+
+    return tcet;
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+{
     uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
 
+    if (!tcet->nb_table) {
+        return;
+    }
+
     if (kvm_enabled() && !(window_size >> 32)) {
         tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
                                               window_size,
@@ -140,65 +179,56 @@ static int spapr_tce_table_realize(DeviceState *dev)
         tcet->table = g_malloc0(table_size);
     }
 
-    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
-
-    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
+    memory_region_init_iommu(&tcet->iommu, OBJECT(tcet), &spapr_iommu_ops,
                              "iommu-spapr",
                              (uint64_t)tcet->nb_table << tcet->page_shift);
 
-    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
-
-    vmstate_register(DEVICE(tcet), tcet->liobn, &vmstate_spapr_tce_table,
-                     tcet);
-
-    return 0;
+    tcet->enabled = true;
 }
 
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool vfio_accel)
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint64_t bus_offset, uint32_t page_shift,
+                            uint32_t nb_table, bool vfio_accel)
 {
-    sPAPRTCETable *tcet;
-    char tmp[64];
-
-    if (spapr_tce_find_by_liobn(liobn)) {
-        fprintf(stderr, "Attempted to create TCE table with duplicate"
-                " LIOBN 0x%x\n", liobn);
-        return NULL;
-    }
-
-    if (!nb_table) {
-        return NULL;
+    if (tcet->enabled) {
+        return;
     }
 
-    tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
-    tcet->liobn = liobn;
     tcet->bus_offset = bus_offset;
     tcet->page_shift = page_shift;
     tcet->nb_table = nb_table;
     tcet->vfio_accel = vfio_accel;
 
-    snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
-    object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
-
-    object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
-
-    return tcet;
+    spapr_tce_table_do_enable(tcet);
 }
 
-static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
+void spapr_tce_table_disable(sPAPRTCETable *tcet)
 {
-    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
-
-    QLIST_REMOVE(tcet, list);
+    if (!tcet->enabled) {
+        return;
+    }
 
     if (!kvm_enabled() ||
         (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
                                  tcet->nb_table) != 0)) {
+        tcet->fd = -1;
         g_free(tcet->table);
     }
+    tcet->table = NULL;
+    tcet->enabled = false;
+    tcet->bus_offset = 0;
+    tcet->page_shift = 0;
+    tcet->nb_table = 0;
+    tcet->vfio_accel = false;
+}
+
+static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+
+    QLIST_REMOVE(tcet, list);
+
+    spapr_tce_table_disable(tcet);
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 8c0d2eb..c3410b8 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -881,6 +881,12 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         sphb->lsi_table[i].irq = irq;
     }
 
+    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
+    if (!tcet) {
+            error_setg(errp, "failed to create TCE table");
+            return;
+    }
+
     info->dma_capabilities_update(sphb);
     info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
                           sphb->dma32_window_size);
@@ -908,13 +914,13 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
                                      uint64_t window_size)
 {
     uint64_t bus_offset = sphb->dma32_window_start;
-    sPAPRTCETable *tcet;
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
-                               window_size >> page_shift,
-                               false);
+    spapr_tce_table_enable(tcet, bus_offset, page_shift,
+                           window_size >> page_shift,
+                           false);
 
-    return tcet ? 0 : -1;
+    return 0;
 }
 
 static int spapr_phb_children_reset(Object *child, void *opaque)
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index f1dd28c..a5b97d0 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -49,13 +49,13 @@ static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
                                           uint64_t window_size)
 {
     uint64_t bus_offset = sphb->dma32_window_start;
-    sPAPRTCETable *tcet;
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
-                               window_size >> page_shift,
-                               true);
+    spapr_tce_table_enable(tcet, bus_offset, page_shift,
+                           window_size >> page_shift,
+                           true);
 
-    return tcet ? 0 : -1;
+    return 0;
 }
 
 static void spapr_phb_vfio_reset(DeviceState *qdev)
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index 174033d..3e28835 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -479,11 +479,10 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
         memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
         address_space_init(&dev->as, &dev->mrroot, qdev->id);
 
-        dev->tcet = spapr_tce_new_table(qdev, liobn,
-                                        0,
-                                        SPAPR_TCE_PAGE_SHIFT,
-                                        pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT, false);
+        dev->tcet = spapr_tce_new_table(qdev, liobn);
+        spapr_tce_table_enable(dev->tcet, 0, SPAPR_TCE_PAGE_SHIFT,
+                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT,
+                               false);
         dev->tcet->vdev = dev;
         memory_region_add_subregion_overlap(&dev->mrroot, 0,
                                             spapr_tce_get_iommu(dev->tcet), 2);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 7d9ab9d..074d837 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -498,6 +498,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 
 struct sPAPRTCETable {
     DeviceState parent;
+    bool enabled;
     uint32_t liobn;
     uint32_t nb_table;
     uint64_t bus_offset;
@@ -515,11 +516,11 @@ sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn);
 void spapr_events_init(sPAPREnvironment *spapr);
 void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(target_ulong addr, target_ulong size);
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool vfio_accel);
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint64_t bus_offset, uint32_t page_shift,
+                            uint32_t nb_table, bool vfio_accel);
+void spapr_tce_table_disable(sPAPRTCETable *tcet);
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
 int spapr_dma_dt(void *fdt, int node_off, const char *propname,
                  uint32_t liobn, uint64_t window, uint32_t size);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 07/14] spapr_iommu: Add root memory region
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-05-05 12:31   ` David Gibson
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

We are going to have multiple DMA windows at different offsets on
a PCI bus. For the sake of migration, we will have as many TCE table
objects pre-created as many windows supported.
So we need a way to map windows dynamically onto a PCI bus
when migration of a table is completed but at this stage a TCE table
object does not have access to a PHB to ask it to map a DMA window
backed by just migrated TCE table.

This adds a "root" memory region (UINT64_MAX long) to the TCE object.
This new region is mapped on a PCI bus with enabled overlapping as
there will be one root MR per TCE table, each of them mapped at 0.
The actual IOMMU memory region is a subregion of the root region and
a TCE table enables/disables this subregion and maps it at
the specific offset inside the root MR which is 1:1 mapping of
a PCI address space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   | 9 ++++++++-
 hw/ppc/spapr_pci.c     | 2 +-
 include/hw/ppc/spapr.h | 2 +-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index a3f2b83..245534f 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -150,6 +150,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
     tcet->liobn = liobn;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
+    memory_region_init(&tcet->root, OBJECT(tcet), tmp, UINT64_MAX);
+
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
 
     object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
@@ -183,6 +185,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
                              "iommu-spapr",
                              (uint64_t)tcet->nb_table << tcet->page_shift);
 
+    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
+
     tcet->enabled = true;
 }
 
@@ -208,6 +212,8 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
         return;
     }
 
+    memory_region_del_subregion(&tcet->root, &tcet->iommu);
+
     if (!kvm_enabled() ||
         (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
                                  tcet->nb_table) != 0)) {
@@ -215,6 +221,7 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
         g_free(tcet->table);
     }
     tcet->table = NULL;
+    object_unref(OBJECT(&tcet->iommu));
     tcet->enabled = false;
     tcet->bus_offset = 0;
     tcet->page_shift = 0;
@@ -233,7 +240,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
 {
-    return &tcet->iommu;
+    return &tcet->root;
 }
 
 static void spapr_tce_reset(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index c3410b8..664687c 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -895,7 +895,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         error_setg(errp, "failed to create TCE table");
         return;
     }
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
+    memory_region_add_subregion(&sphb->iommu_root, 0,
                                 spapr_tce_get_iommu(tcet));
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 074d837..c8ac03f 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -507,7 +507,7 @@ struct sPAPRTCETable {
     bool bypass;
     bool vfio_accel;
     int fd;
-    MemoryRegion iommu;
+    MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
     QLIST_ENTRY(sPAPRTCETable) list;
 };
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-05-05 12:34   ` David Gibson
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

On a system reset, DMA configuration has to reset too. At the moment
it clears the table content. This is enough for the single table case
but with DDW, we will also have to disable all DMA windows except
the default one. Furthermore according to sPAPR, if the guest removed
the default window and created a huge one at the same zero offset on
a PCI bus, the reset handler has to recreate the default window with
the default properties (2GB big, 4K pages).

This reworks SPAPR PHB code to disable the existing DMA window on reset
and then configure and enable the default window.
Without DDW that means that the same window will be disabled and then
enabled with no other change in behaviour.

This changes the table creation to do it in one place in PHB (VFIO PHB
just inherits the behaviour from PHB). The actual table allocation is
done from the reset handler and this is where dma_init_window() is called.

This disables all DMA windows on a PHB reset. It does not make any
difference now as there is just one DMA window but it will later with DDW
patches.

This makes spapr_phb_dma_reset() and spapr_phb_dma_remove_window() public
as these will be used in DDW RTAS "ibm,reset-pe-dma-window" and
"ibm,remove-pe-dma-window" handlers later; the handlers will reside in
hw/ppc/spapr_rtas_ddw.c.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v7:
* s'finish_realize'dma_init_window' in the commit log
* added details (initial clause about reuse was there :) )
why exactly spapr_phb_dma_remove_window is public
---
 hw/ppc/spapr_pci.c          | 45 ++++++++++++++++++++++++++++++++++++---------
 hw/ppc/spapr_pci_vfio.c     |  6 ------
 include/hw/pci-host/spapr.h |  3 +++
 3 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 664687c..3d40f5b 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -736,7 +736,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     SysBusDevice *s = SYS_BUS_DEVICE(dev);
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
     PCIHostState *phb = PCI_HOST_BRIDGE(s);
-    sPAPRPHBClass *info = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(s);
     char *namebuf;
     int i;
     PCIBus *bus;
@@ -887,14 +886,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
             return;
     }
 
-    info->dma_capabilities_update(sphb);
-    info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
-                          sphb->dma32_window_size);
-    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
-    if (!tcet) {
-        error_setg(errp, "failed to create TCE table");
-        return;
-    }
     memory_region_add_subregion(&sphb->iommu_root, 0,
                                 spapr_tce_get_iommu(tcet));
 
@@ -923,6 +914,40 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
     return 0;
 }
 
+int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
+                                sPAPRTCETable *tcet)
+{
+    spapr_tce_table_disable(tcet);
+
+    return 0;
+}
+
+static int spapr_phb_disable_dma_windows(Object *child, void *opaque)
+{
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
+    sPAPRTCETable *tcet = (sPAPRTCETable *)
+        object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+
+    if (tcet) {
+        spapr_phb_dma_remove_window(sphb, tcet);
+    }
+
+    return 0;
+}
+
+int spapr_phb_dma_reset(sPAPRPHBState *sphb)
+{
+    const uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
+
+    spc->dma_capabilities_update(sphb); /* Refresh @has_vfio status */
+    object_child_foreach(OBJECT(sphb), spapr_phb_disable_dma_windows, sphb);
+    spc->dma_init_window(sphb, liobn, SPAPR_TCE_PAGE_SHIFT,
+                         sphb->dma32_window_size);
+
+    return 0;
+}
+
 static int spapr_phb_children_reset(Object *child, void *opaque)
 {
     DeviceState *dev = (DeviceState *) object_dynamic_cast(child, TYPE_DEVICE);
@@ -936,6 +961,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 static void spapr_phb_reset(DeviceState *qdev)
 {
+    spapr_phb_dma_reset(SPAPR_PCI_HOST_BRIDGE(qdev));
+
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
 }
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index a5b97d0..f89e053 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -58,11 +58,6 @@ static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
     return 0;
 }
 
-static void spapr_phb_vfio_reset(DeviceState *qdev)
-{
-    /* Do nothing */
-}
-
 static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
                                          unsigned int addr, int option)
 {
@@ -176,7 +171,6 @@ static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
     sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
 
     dc->props = spapr_phb_vfio_properties;
-    dc->reset = spapr_phb_vfio_reset;
     spc->dma_capabilities_update = spapr_phb_vfio_dma_capabilities_update;
     spc->dma_init_window = spapr_phb_vfio_dma_init_window;
     spc->eeh_set_option = spapr_phb_vfio_eeh_set_option;
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 3074145..7fda78e 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -140,5 +140,8 @@ void spapr_pci_rtas_init(void);
 sPAPRPHBState *spapr_pci_find_phb(sPAPREnvironment *spapr, uint64_t buid);
 PCIDevice *spapr_pci_find_dev(sPAPREnvironment *spapr, uint64_t buid,
                               uint32_t config_addr);
+int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
+                                sPAPRTCETable *tcet);
+int spapr_phb_dma_reset(sPAPRPHBState *sphb);
 
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 10/14] linux headers update for DDW on SPAPR Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

sPAPRTCETable is handling 2 TCE tables already:

1) guest view of the TCE table - emulated devices use only this table;

2) hardware IOMMU table - VFIO PCI devices use it for actual work but
it does not replace 1) and it is not visible to the guest.
The initialization of this table is driven by vfio-pci device,
DMA map/unmap requests are handled via MemoryListener so there is very
little to do in spapr-pci-vfio-host-bridge.

This moves VFIO bits to the generic spapr-pci-host-bridge which allows
putting emulated and VFIO devices on the same PHB. It is still possible
to create multiple PHBs and avoid sharing PHB resouces for emulated and
VFIO devices.

If there is no VFIO-PCI device attaches, no special ioctls will be called.
If there are some VFIO-PCI devices attached, PHB may refuse to attach
another VFIO-PCI device if a VFIO container on the host kernel side
does not support
container sharing.

This changes spapr-pci-host-bridge to support properties of
spapr-pci-vfio-host-bridge. This makes spapr-pci-vfio-host-bridge type
equal to spapr-pci-host-bridge except it has an additional "iommu" property
for backward compatibility reasons.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci.c          | 78 ++++++++++++++-------------------------------
 hw/ppc/spapr_pci_vfio.c     | 35 ++++----------------
 include/hw/pci-host/spapr.h | 25 ++++++---------
 3 files changed, 41 insertions(+), 97 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 3d40f5b..d097cce 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -413,7 +413,6 @@ static void rtas_ibm_set_eeh_option(PowerPCCPU *cpu,
                                     target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     uint32_t addr, option;
     uint64_t buid;
     int ret;
@@ -427,16 +426,11 @@ static void rtas_ibm_set_eeh_option(PowerPCCPU *cpu,
     option = rtas_ld(args, 3);
 
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_set_option) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_set_option(sphb, addr, option);
+    ret = spapr_phb_vfio_eeh_set_option(sphb, addr, option);
     rtas_st(rets, 0, ret);
     return;
 
@@ -451,7 +445,6 @@ static void rtas_ibm_get_config_addr_info2(PowerPCCPU *cpu,
                                            target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     PCIDevice *pdev;
     uint32_t addr, option;
     uint64_t buid;
@@ -462,12 +455,7 @@ static void rtas_ibm_get_config_addr_info2(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
-        goto param_error_exit;
-    }
-
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_set_option) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
@@ -507,7 +495,6 @@ static void rtas_ibm_read_slot_reset_state2(PowerPCCPU *cpu,
                                             target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     uint64_t buid;
     int state, ret;
 
@@ -517,16 +504,11 @@ static void rtas_ibm_read_slot_reset_state2(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_get_state) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_get_state(sphb, &state);
+    ret = spapr_phb_vfio_eeh_get_state(sphb, &state);
     rtas_st(rets, 0, ret);
     if (ret != RTAS_OUT_SUCCESS) {
         return;
@@ -551,7 +533,6 @@ static void rtas_ibm_set_slot_reset(PowerPCCPU *cpu,
                                     target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     uint32_t option;
     uint64_t buid;
     int ret;
@@ -563,16 +544,11 @@ static void rtas_ibm_set_slot_reset(PowerPCCPU *cpu,
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     option = rtas_ld(args, 3);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_reset) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_reset(sphb, option);
+    ret = spapr_phb_vfio_eeh_reset(sphb, option);
     rtas_st(rets, 0, ret);
     return;
 
@@ -587,7 +563,6 @@ static void rtas_ibm_configure_pe(PowerPCCPU *cpu,
                                   target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     uint64_t buid;
     int ret;
 
@@ -597,16 +572,11 @@ static void rtas_ibm_configure_pe(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_configure) {
-        goto param_error_exit;
-    }
-
-    ret = spc->eeh_configure(sphb);
+    ret = spapr_phb_vfio_eeh_configure(sphb);
     rtas_st(rets, 0, ret);
     return;
 
@@ -622,7 +592,6 @@ static void rtas_ibm_slot_error_detail(PowerPCCPU *cpu,
                                        target_ulong rets)
 {
     sPAPRPHBState *sphb;
-    sPAPRPHBClass *spc;
     int option;
     uint64_t buid;
 
@@ -632,12 +601,7 @@ static void rtas_ibm_slot_error_detail(PowerPCCPU *cpu,
 
     buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
     sphb = spapr_pci_find_phb(spapr, buid);
-    if (!sphb) {
-        goto param_error_exit;
-    }
-
-    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
-    if (!spc->eeh_set_option) {
+    if (!sphb || !sphb->has_vfio) {
         goto param_error_exit;
     }
 
@@ -742,6 +706,11 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
 
+    if ((sphb->iommugroupid != -1) &&
+        object_dynamic_cast(OBJECT(sphb), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)) {
+        error_report("Warning: iommugroupid shall not be used");
+    }
+
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
 
@@ -894,9 +863,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
 
 static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
 {
+    int ret;
+
     sphb->dma32_window_start = 0;
     sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
 
+    ret = spapr_phb_vfio_dma_capabilities_update(sphb);
+    sphb->has_vfio = (ret == 0);
+
     return 0;
 }
 
@@ -909,7 +883,7 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
 
     spapr_tce_table_enable(tcet, bus_offset, page_shift,
                            window_size >> page_shift,
-                           false);
+                           sphb->has_vfio);
 
     return 0;
 }
@@ -938,12 +912,11 @@ static int spapr_phb_disable_dma_windows(Object *child, void *opaque)
 int spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
     const uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
-    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
 
-    spc->dma_capabilities_update(sphb); /* Refresh @has_vfio status */
+    spapr_phb_dma_capabilities_update(sphb); /* Refresh @has_vfio status */
     object_child_foreach(OBJECT(sphb), spapr_phb_disable_dma_windows, sphb);
-    spc->dma_init_window(sphb, liobn, SPAPR_TCE_PAGE_SHIFT,
-                         sphb->dma32_window_size);
+    spapr_phb_dma_init_window(sphb, liobn, SPAPR_TCE_PAGE_SHIFT,
+                              sphb->dma32_window_size);
 
     return 0;
 }
@@ -1089,7 +1062,6 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
 {
     PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
     DeviceClass *dc = DEVICE_CLASS(klass);
-    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
 
     hc->root_bus_path = spapr_phb_root_bus_path;
     dc->realize = spapr_phb_realize;
@@ -1098,8 +1070,6 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
     dc->vmsd = &vmstate_spapr_pci;
     set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
     dc->cannot_instantiate_with_device_add_yet = false;
-    spc->dma_capabilities_update = spapr_phb_dma_capabilities_update;
-    spc->dma_init_window = spapr_phb_dma_init_window;
 }
 
 static const TypeInfo spapr_phb_info = {
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index f89e053..6f91b39 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -23,11 +23,11 @@
 #include "hw/vfio/vfio.h"
 
 static Property spapr_phb_vfio_properties[] = {
-    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
+    DEFINE_PROP_INT32("iommu", sPAPRPHBState, iommugroupid, -1),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
+int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
 {
     struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
     int ret;
@@ -44,21 +44,7 @@ static int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
     return ret;
 }
 
-static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
-                                          uint32_t liobn, uint32_t page_shift,
-                                          uint64_t window_size)
-{
-    uint64_t bus_offset = sphb->dma32_window_start;
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
-
-    spapr_tce_table_enable(tcet, bus_offset, page_shift,
-                           window_size >> page_shift,
-                           true);
-
-    return 0;
-}
-
-static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
+int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
                                          unsigned int addr, int option)
 {
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
@@ -106,7 +92,7 @@ static int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
     return RTAS_OUT_SUCCESS;
 }
 
-static int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
+int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
 {
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
@@ -122,7 +108,7 @@ static int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state)
     return RTAS_OUT_SUCCESS;
 }
 
-static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
+int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
 {
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
@@ -150,7 +136,7 @@ static int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option)
     return RTAS_OUT_SUCCESS;
 }
 
-static int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
+int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
 {
     struct vfio_eeh_pe_op op = { .argsz = sizeof(op) };
     int ret;
@@ -168,21 +154,14 @@ static int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
 static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
-    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
 
     dc->props = spapr_phb_vfio_properties;
-    spc->dma_capabilities_update = spapr_phb_vfio_dma_capabilities_update;
-    spc->dma_init_window = spapr_phb_vfio_dma_init_window;
-    spc->eeh_set_option = spapr_phb_vfio_eeh_set_option;
-    spc->eeh_get_state = spapr_phb_vfio_eeh_get_state;
-    spc->eeh_reset = spapr_phb_vfio_eeh_reset;
-    spc->eeh_configure = spapr_phb_vfio_eeh_configure;
 }
 
 static const TypeInfo spapr_phb_vfio_info = {
     .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
     .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
-    .instance_size = sizeof(sPAPRPHBVFIOState),
+    .instance_size = sizeof(sPAPRPHBState),
     .class_init    = spapr_phb_vfio_class_init,
     .class_size    = sizeof(sPAPRPHBClass),
 };
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7fda78e..484291c 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -47,15 +47,6 @@ typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
-
-    int (*dma_capabilities_update)(sPAPRPHBState *sphb);
-    int (*dma_init_window)(sPAPRPHBState *sphb,
-                           uint32_t liobn, uint32_t page_shift,
-                           uint64_t window_size);
-    int (*eeh_set_option)(sPAPRPHBState *sphb, unsigned int addr, int option);
-    int (*eeh_get_state)(sPAPRPHBState *sphb, int *state);
-    int (*eeh_reset)(sPAPRPHBState *sphb, int option);
-    int (*eeh_configure)(sPAPRPHBState *sphb);
 };
 
 typedef struct spapr_pci_msi {
@@ -94,16 +85,12 @@ struct sPAPRPHBState {
 
     uint32_t dma32_window_start;
     uint32_t dma32_window_size;
+    bool has_vfio;
+    int32_t iommugroupid; /* obsolete */
 
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
-struct sPAPRPHBVFIOState {
-    sPAPRPHBState phb;
-
-    int32_t iommugroupid;
-};
-
 #define SPAPR_PCI_MAX_INDEX          255
 
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
@@ -144,4 +131,12 @@ int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
                                 sPAPRTCETable *tcet);
 int spapr_phb_dma_reset(sPAPRPHBState *sphb);
 
+int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb);
+int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
+                                  unsigned int addr, int option);
+int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
+int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
+int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
+
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 10/14] linux headers update for DDW on SPAPR
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 11/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

Since the changes are not in upstream yet, no tag or branch is specified here.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 linux-headers/linux/vfio.h | 88 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 85 insertions(+), 3 deletions(-)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 95ba870..ce05371 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -36,6 +36,8 @@
 /* Two-stage IOMMU */
 #define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
 
+#define VFIO_SPAPR_TCE_v2_IOMMU		7
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -441,6 +443,23 @@ struct vfio_iommu_type1_dma_unmap {
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
+ * The SPAPR TCE DDW info struct provides the information about
+ * the details of Dynamic DMA window capability.
+ *
+ * @pgsizes contains a page size bitmask, 4K/64K/16M are supported.
+ * @max_dynamic_windows_supported tells the maximum number of windows
+ * which the platform can create.
+ * @levels tells the maximum number of levels in multi-level IOMMU tables;
+ * this allows splitting a table into smaller chunks which reduces
+ * the amount of physically contiguous memory required for the table.
+ */
+struct vfio_iommu_spapr_tce_ddw_info {
+	__u64 pgsizes;			/* Bitmap of supported page sizes */
+	__u32 max_dynamic_windows_supported;
+	__u32 levels;
+};
+
+/*
  * The SPAPR TCE info struct provides the information about the PCI bus
  * address ranges available for DMA, these values are programmed into
  * the hardware so the guest has to know that information.
@@ -450,14 +469,17 @@ struct vfio_iommu_type1_dma_unmap {
  * addresses too so the window works as a filter rather than an offset
  * for IOVA addresses.
  *
- * A flag will need to be added if other page sizes are supported,
- * so as defined here, it is always 4k.
+ * Flags supported:
+ * - VFIO_IOMMU_SPAPR_INFO_DDW: informs the userspace that dynamic DMA windows
+ *   (DDW) support is present. @ddw is only supported when DDW is present.
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+	struct vfio_iommu_spapr_tce_ddw_info ddw;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -493,6 +515,66 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+/**
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get faster.
+ */
+struct vfio_iommu_spapr_register_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
+ *
+ * Unregisters user space memory registered with
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
+ * Uses vfio_iommu_spapr_register_memory for parameters.
+ */
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_CREATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, struct vfio_iommu_spapr_tce_create)
+ *
+ * Creates an additional TCE table and programs it (sets a new DMA window)
+ * to every IOMMU group in the container. It receives page shift, window
+ * size and number of levels in the TCE table being created.
+ *
+ * It allocates and returns an offset on a PCI bus of the new DMA window.
+ */
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u32 page_shift;
+	__u64 window_size;
+	__u32 levels;
+	/* out */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_REMOVE - _IOW(VFIO_TYPE, VFIO_BASE + 20, struct vfio_iommu_spapr_tce_remove)
+ *
+ * Unprograms a TCE table from all groups in the container and destroys it.
+ * It receives a PCI bus offset as a window id.
+ */
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 11/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 10/14] linux headers update for DDW on SPAPR Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 12/14] spapr: Add pseries-2.4 machine Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a guest RAM memory listener which notifies a VFIO container
about memory which needs to be pinned/unpinned. VFIO MMIO regions
(i.e. "skip dump" regions) are skipped.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This does not change the guest visible interface.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v7:
* in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
* s'ramlistener'register_listener'

v6:
* fixed commit log (s/guest/userspace/), added note about no guest visible
change
* fixed error checking if ram registration failed
* added alignment check for section->offset_within_region

v5:
* simplified the patch
* added trace points
* added round_up() for the size
* SPAPR IOMMU v2 used
---
 hw/vfio/common.c              | 26 +++++++++----
 hw/vfio/spapr.c               | 88 ++++++++++++++++++++++++++++++++++++++++++-
 include/hw/vfio/vfio-common.h |  5 ++-
 trace-events                  |  1 +
 4 files changed, 110 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 369e564..9e3e0b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -577,14 +577,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 
         container->iommu_data.type1.initialized = true;
 
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
+
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
             error_report("vfio: failed to set group container: %m");
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        ret = ioctl(fd, VFIO_SET_IOMMU,
+                v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -596,14 +600,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
         }
 
-        spapr_memory_listener_register(container);
+        ret = spapr_memory_listener_register(container, v2 ? 2 : 1);
+        if (ret) {
+            error_report("vfio: RAM memory listener initialization failed for container");
+            goto listener_release_exit;
+        }
 
     } else {
         error_report("vfio: No available IOMMU models");
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 5f79194..62d9067 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -17,6 +17,9 @@
  *  along with this program; if not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
 #include "hw/vfio/vfio-common.h"
 #include "qemu/error-report.h"
 #include "trace.h"
@@ -211,16 +214,99 @@ static const MemoryListener vfio_spapr_memory_listener = {
     .region_del = vfio_spapr_listener_region_del,
 };
 
+static void vfio_ram_do_region(VFIOContainer *container,
+                              MemoryRegionSection *section, unsigned long req)
+{
+    int ret;
+    struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) };
+
+    if (!memory_region_is_ram(section->mr) ||
+        memory_region_is_skip_dump(section->mr)) {
+        return;
+    }
+
+    if (unlikely((section->offset_within_region & (getpagesize() - 1)))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region;
+    reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE);
+
+    ret = ioctl(container->fd, req, &reg);
+    trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size,
+            ret ? -errno : 0);
+    if (!ret) {
+        return;
+    }
+
+    /*
+     * On the initfn path, store the first error in the container so we
+     * can gracefully fail.  Runtime, there's not much we can do other
+     * than throw a hardware error.
+     */
+    if (!container->iommu_data.spapr.ram_reg_initialized) {
+        if (!container->iommu_data.spapr.ram_reg_error) {
+            container->iommu_data.spapr.ram_reg_error = -errno;
+        }
+    } else {
+        hw_error("vfio: RAM registering failed, unable to continue");
+    }
+}
+
+static void vfio_spapr_ram_listener_region_add(MemoryListener *listener,
+                                               MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.register_listener);
+    memory_region_ref(section->mr);
+    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
+}
+
+static void vfio_spapr_ram_listener_region_del(MemoryListener *listener,
+                                               MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.register_listener);
+    vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
+    memory_region_unref(section->mr);
+}
+
+static const MemoryListener vfio_spapr_ram_memory_listener = {
+    .region_add = vfio_spapr_ram_listener_region_add,
+    .region_del = vfio_spapr_ram_listener_region_del,
+};
+
 static void vfio_spapr_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->iommu_data.spapr.listener);
 }
 
-void spapr_memory_listener_register(VFIOContainer *container)
+static void vfio_spapr_listener_release_v2(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->iommu_data.spapr.listener);
+    vfio_spapr_listener_release(container);
+}
+
+int spapr_memory_listener_register(VFIOContainer *container, int ver)
 {
     container->iommu_data.spapr.listener = vfio_spapr_memory_listener;
     container->iommu_data.release = vfio_spapr_listener_release;
 
     memory_listener_register(&container->iommu_data.spapr.listener,
                              container->space->as);
+    if (ver < 2) {
+        return 0;
+    }
+
+    container->iommu_data.spapr.register_listener =
+            vfio_spapr_ram_memory_listener;
+    container->iommu_data.release = vfio_spapr_listener_release_v2;
+    memory_listener_register(&container->iommu_data.spapr.register_listener,
+                             &address_space_memory);
+
+    container->iommu_data.spapr.ram_reg_initialized = true;
+
+    return container->iommu_data.spapr.ram_reg_error;
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 06b96ad..1e4e1f1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -71,6 +71,9 @@ typedef struct VFIOType1 {
 
 typedef struct VFIOSPAPR {
     MemoryListener listener;
+    MemoryListener register_listener;
+    int ram_reg_error;
+    bool ram_reg_initialized;
 } VFIOSPAPR;
 
 typedef struct VFIOContainer {
@@ -156,6 +159,6 @@ extern int vfio_dma_unmap(VFIOContainer *container,
                           hwaddr iova, ram_addr_t size);
 bool vfio_listener_skipped_section(MemoryRegionSection *section);
 
-extern void spapr_memory_listener_register(VFIOContainer *container);
+extern int spapr_memory_listener_register(VFIOContainer *container, int ver);
 
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index 1231ba4..2739140 100644
--- a/trace-events
+++ b/trace-events
@@ -1563,6 +1563,7 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
+vfio_ram_register(int req, uint64_t va, uint64_t size, int ret) "req=%d va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 #hw/acpi/memory_hotplug.c
 mhp_acpi_invalid_slot_selected(uint32_t slot) "0x%"PRIx32
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 12/14] spapr: Add pseries-2.4 machine
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 11/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

The next patch implements dynamic DMA windows and disables them by default
for older pseries machines.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7febff7..b28209f 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1852,12 +1852,15 @@ static const TypeInfo spapr_machine_2_2_info = {
 
 static void spapr_machine_2_3_class_init(ObjectClass *oc, void *data)
 {
+    static GlobalProperty compat_props[] = {
+        SPAPR_COMPAT_2_2,
+        { /* end of list */ }
+    };
     MachineClass *mc = MACHINE_CLASS(oc);
 
     mc->name = "pseries-2.3";
     mc->desc = "pSeries Logical Partition (PAPR compliant) v2.3";
-    mc->alias = "pseries";
-    mc->is_default = 1;
+    mc->compat_props = compat_props;
 }
 
 static const TypeInfo spapr_machine_2_3_info = {
@@ -1866,12 +1869,29 @@ static const TypeInfo spapr_machine_2_3_info = {
     .class_init    = spapr_machine_2_3_class_init,
 };
 
+static void spapr_machine_2_4_class_init(ObjectClass *oc, void *data)
+{
+    MachineClass *mc = MACHINE_CLASS(oc);
+
+    mc->name = "pseries-2.4";
+    mc->desc = "pSeries Logical Partition (PAPR compliant) v2.4";
+    mc->alias = "pseries";
+    mc->is_default = 1;
+}
+
+static const TypeInfo spapr_machine_2_4_info = {
+    .name          = TYPE_SPAPR_MACHINE "2.4",
+    .parent        = TYPE_SPAPR_MACHINE,
+    .class_init    = spapr_machine_2_4_class_init,
+};
+
 static void spapr_machine_register_types(void)
 {
     type_register_static(&spapr_machine_info);
     type_register_static(&spapr_machine_2_1_info);
     type_register_static(&spapr_machine_2_2_info);
     type_register_static(&spapr_machine_2_3_info);
+    type_register_static(&spapr_machine_2_4_info);
 }
 
 type_init(spapr_machine_register_types)
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 12/14] spapr: Add pseries-2.4 machine Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-05-05 12:49   ` David Gibson
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 14/14] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
  2015-05-05  9:30 ` [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  14 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

This implements DDW for emulated and VFIO devices. As all TCE root regions
are mapped at 0 and 64bit long (and actual tables are child regions),
this replaces memory_region_add_subregion() with _overlap() to make
QEMU memory API happy.

This reserves RTAS token numbers for DDW calls.

This implements helpers to interact with VFIO kernel interface.

This changes the TCE table migration descriptor to support dynamic
tables as from now on, PHB will create as many stub TCE table objects
as PHB can possibly support but not all of them might be initialized at
the time of migration because DDW might or might not be requested by
the guest.

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.3 machine and older disable it.

This implements DDW for VFIO. The host kernel support is required.
This adds a "levels" property to PHB to control the number of levels
in the actual TCE table allocated by the host kernel, 0 is the default
value to tell QEMU to calculate the correct value. Current hardware
supports up to 5 levels.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v6:
* rework as there is no more special device for VFIO PHB

v5:
* total rework
* enabled for machines >2.3
* fixed migration
* merged rtas handlers here

v4:
* reset handler is back in generalized form

v3:
* removed reset
* windows_num is now 1 or bigger rather than 0-based value and it is only
changed in PHB code, not in RTAS
* added page mask check in create()
* added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
created

v2:
* tested on hacked emulated E1000
* implemented DDW reset on the PHB reset
* spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
---
 hw/ppc/Makefile.objs        |   3 +
 hw/ppc/spapr.c              |  10 +-
 hw/ppc/spapr_iommu.c        |  35 +++++-
 hw/ppc/spapr_pci.c          |  66 ++++++++--
 hw/ppc/spapr_pci_vfio.c     |  80 ++++++++++++
 hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  21 ++++
 include/hw/ppc/spapr.h      |  17 ++-
 trace-events                |   4 +
 9 files changed, 521 insertions(+), 15 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 437955d..c6b344f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
+obj-y += spapr_rtas_ddw.o
+endif
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index b28209f..fd7fdb3 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1801,7 +1801,15 @@ static const TypeInfo spapr_machine_info = {
     },
 };
 
+#define SPAPR_COMPAT_2_3 \
+        {\
+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+            .property = "ddw",\
+            .value    = stringify(off),\
+        }
+
 #define SPAPR_COMPAT_2_2 \
+        SPAPR_COMPAT_2_3, \
         {\
             .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
             .property = "mem_win_size",\
@@ -1853,7 +1861,7 @@ static const TypeInfo spapr_machine_2_2_info = {
 static void spapr_machine_2_3_class_init(ObjectClass *oc, void *data)
 {
     static GlobalProperty compat_props[] = {
-        SPAPR_COMPAT_2_2,
+        SPAPR_COMPAT_2_3,
         { /* end of list */ }
     };
     MachineClass *mc = MACHINE_CLASS(oc);
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 245534f..df4c72d 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -90,6 +90,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->migtable = tcet->table;
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -98,22 +107,42 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (!tcet->migtable) {
+        return 0;
+    }
+
+    if (tcet->enabled) {
+        if (!tcet->table) {
+            tcet->enabled = false;
+            spapr_tce_table_do_enable(tcet);
+        }
+        memcpy(tcet->table, tcet->migtable,
+               tcet->nb_table * sizeof(tcet->table[0]));
+        free(tcet->migtable);
+        tcet->migtable = NULL;
+    }
+
     return 0;
 }
 
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
-    .version_id = 2,
+    .version_id = 3,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
+        VMSTATE_UINT32(nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index d097cce..d3d8f12 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -849,15 +849,17 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         sphb->lsi_table[i].irq = irq;
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-            error_setg(errp, "failed to create TCE table");
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb),
+                                   SPAPR_PCI_LIOBN(sphb->index, i));
+        if (!tcet) {
+            error_setg(errp, "spapr_tce_new_table failed");
             return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
     }
 
-    memory_region_add_subregion(&sphb->iommu_root, 0,
-                                spapr_tce_get_iommu(tcet));
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -867,6 +869,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
 
     sphb->dma32_window_start = 0;
     sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
+    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
+    sphb->page_size_mask = (1 << 12) | (1 << 16) | (1 << 24);
+    sphb->dma64_window_size = pow2ceil(ram_size);
 
     ret = spapr_phb_vfio_dma_capabilities_update(sphb);
     sphb->has_vfio = (ret == 0);
@@ -874,12 +879,29 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
     return 0;
 }
 
-static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
-                                     uint32_t liobn, uint32_t page_shift,
-                                     uint64_t window_size)
+int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
+                              uint32_t liobn, uint32_t page_shift,
+                              uint64_t window_size)
 {
     uint64_t bus_offset = sphb->dma32_window_start;
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    int ret;
+
+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
+        return -1;
+    }
+
+    if (sphb->ddw_enabled) {
+        if (sphb->has_vfio) {
+            ret = spapr_phb_vfio_dma_init_window(sphb,
+                                                 page_shift, window_size,
+                                                 &bus_offset);
+        }
+
+        if (ret && SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
+            bus_offset = SPAPR_PCI_DMA64_START;
+        }
+    }
 
     spapr_tce_table_enable(tcet, bus_offset, page_shift,
                            window_size >> page_shift,
@@ -891,9 +913,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
 int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
                                 sPAPRTCETable *tcet)
 {
+    int ret;
+
+    if (sphb->has_vfio && sphb->ddw_enabled) {
+        ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
+    }
     spapr_tce_table_disable(tcet);
 
-    return 0;
+    return ret;
 }
 
 static int spapr_phb_disable_dma_windows(Object *child, void *opaque)
@@ -950,6 +977,8 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT64("io_win_addr", sPAPRPHBState, io_win_addr, -1),
     DEFINE_PROP_UINT64("io_win_size", sPAPRPHBState, io_win_size,
                        SPAPR_PCI_IO_WIN_SIZE),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT8("levels", sPAPRPHBState, levels, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1140,6 +1169,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
 
     /* Start populating the FDT */
@@ -1170,6 +1208,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index 6f91b39..7372d91 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -41,6 +41,86 @@ int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
     sphb->dma32_window_start = info.dma32_window_start;
     sphb->dma32_window_size = info.dma32_window_size;
 
+    if (sphb->ddw_enabled && (info.flags & VFIO_IOMMU_SPAPR_INFO_DDW)) {
+        sphb->windows_supported = info.ddw.max_dynamic_windows_supported;
+        sphb->page_size_mask = info.ddw.pgsizes;
+        sphb->dma64_window_size = pow2ceil(ram_size);
+        sphb->max_levels = info.ddw.levels;
+    } else {
+        /* If VFIO_IOMMU_INFO_DDW is not set, disable DDW */
+        sphb->ddw_enabled = false;
+    }
+
+    return ret;
+}
+
+static int spapr_phb_vfio_levels(uint32_t entries)
+{
+    unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
+    int levels;
+
+    if (pages <= 64) {
+        levels = 1;
+    } else if (pages <= 64*64) {
+        levels = 2;
+    } else if (pages <= 64*64*64) {
+        levels = 3;
+    } else {
+        levels = 4;
+    }
+
+    return levels;
+}
+
+int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
+                                   uint32_t page_shift,
+                                   uint64_t window_size,
+                                   uint64_t *bus_offset)
+{
+    int ret;
+    struct vfio_iommu_spapr_tce_create create = {
+        .argsz = sizeof(create),
+        .page_shift = page_shift,
+        .window_size = window_size,
+        .levels = sphb->levels,
+        .start_addr = 0,
+    };
+
+    /*
+     * Dynamic windows are supported, that means that there is no
+     * pre-created window and we have to create one.
+     */
+    if (!create.levels) {
+        create.levels = spapr_phb_vfio_levels(create.window_size >>
+                                              page_shift);
+    }
+
+    if (create.levels > sphb->max_levels) {
+        return -EINVAL;
+    }
+
+    ret = vfio_container_ioctl(&sphb->iommu_as,
+                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+    if (ret) {
+        return ret;
+    }
+    *bus_offset = create.start_addr;
+
+    return 0;
+}
+
+int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
+                                            sPAPRTCETable *tcet)
+{
+    struct vfio_iommu_spapr_tce_remove remove = {
+        .argsz = sizeof(remove),
+        .start_addr = tcet->bus_offset
+    };
+    int ret;
+
+    ret = vfio_container_ioctl(&sphb->iommu_as,
+                               VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+
     return ret;
 }
 
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..7ab7572
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,300 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->enabled) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->enabled) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
+                                 uint64_t page_mask)
+{
+    int i, j;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
+            if ((sps[i].page_shift == masks[j].shift) &&
+                    (page_mask & (1ULL << masks[j].shift))) {
+                mask |= masks[j].mask;
+            }
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPREnvironment *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    CPUPPCState *env = &cpu->env;
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t avail, addr, pgmask = 0;
+    unsigned current;
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    current = spapr_phb_get_active_win_num(sphb);
+    avail = (sphb->windows_supported > current) ?
+            (sphb->windows_supported - current) : 0;
+
+    /* Work out supported page masks */
+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as all RAM was in 4K pages.
+     */
+    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
+                                pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPREnvironment *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+    long ret;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
+        goto hw_error_exit;
+    }
+
+    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
+                                    1ULL << window_shift);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d,
+                                 liobn, ret);
+    if (ret || !tcet) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPREnvironment *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+    long ret;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_remove_window(sphb, tcet);
+    trace_spapr_iommu_ddw_remove(liobn, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPREnvironment *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+    long ret;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 484291c..1d2ea8d 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -87,6 +87,12 @@ struct sPAPRPHBState {
     uint32_t dma32_window_size;
     bool has_vfio;
     int32_t iommugroupid; /* obsolete */
+    bool ddw_enabled;
+    uint32_t windows_supported;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_size;
+    uint8_t max_levels;
+    uint8_t levels;
 
     QLIST_ENTRY(sPAPRPHBState) list;
 };
@@ -109,6 +115,12 @@ struct sPAPRPHBState {
 
 #define SPAPR_PCI_DMA32_SIZE         0x40000000
 
+/* Default 64bit dynamic window offset */
+#define SPAPR_PCI_DMA64_START        0x8000000000000000ULL
+
+/* Maximum allowed number of DMA windows for emulated PHB */
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     return xics_get_qirq(spapr->icp, phb->lsi_table[pin].irq);
@@ -127,11 +139,20 @@ void spapr_pci_rtas_init(void);
 sPAPRPHBState *spapr_pci_find_phb(sPAPREnvironment *spapr, uint64_t buid);
 PCIDevice *spapr_pci_find_dev(sPAPREnvironment *spapr, uint64_t buid,
                               uint32_t config_addr);
+int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
+                              uint32_t liobn, uint32_t page_shift,
+                              uint64_t window_size);
 int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
                                 sPAPRTCETable *tcet);
 int spapr_phb_dma_reset(sPAPRPHBState *sphb);
 
 int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb);
+int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
+                                   uint32_t page_shift,
+                                   uint64_t window_size,
+                                   uint64_t *bus_offset);
+int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
+                                     sPAPRTCETable *tcet);
 int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
                                   unsigned int addr, int option);
 int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index c8ac03f..873c661 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -381,6 +381,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_SUPPORTED      -3
 #define RTAS_OUT_NOT_AUTHORIZED     -9002
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -422,8 +432,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
@@ -504,6 +518,7 @@ struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint64_t *migtable;
     bool bypass;
     bool vfio_accel;
     int fd;
diff --git a/trace-events b/trace-events
index 2739140..fd8ea7a 100644
--- a/trace-events
+++ b/trace-events
@@ -1344,6 +1344,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, unsigned long long pg_size, unsigned long long req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%llx, requested=0x%llx, start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH qemu v7 14/14] vfio: Enable DDW ioctls to VFIO IOMMU driver
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2015-04-25 12:24 ` Alexey Kardashevskiy
  2015-05-05 12:50   ` David Gibson
  2015-05-05  9:30 ` [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  14 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-04-25 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This enables DDW RTAS-related ioctls in VFIO.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9e3e0b0..f915127 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -830,6 +830,8 @@ int vfio_container_ioctl(AddressSpace *as,
     case VFIO_CHECK_EXTENSION:
     case VFIO_IOMMU_SPAPR_TCE_GET_INFO:
     case VFIO_EEH_PE_OP:
+    case VFIO_IOMMU_SPAPR_TCE_CREATE:
+    case VFIO_IOMMU_SPAPR_TCE_REMOVE:
         break;
     default:
         /* Return an error on unknown requests */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 14/14] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
@ 2015-05-05  9:30 ` Alexey Kardashevskiy
  14 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-05  9:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, qemu-ppc, Alexander Graf, David Gibson

On 04/25/2015 10:24 PM, Alexey Kardashevskiy wrote:
> (cut-n-paste from kernel patchset)


Anyone, ping? :)


> Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
> where devices are allowed to do DMA. These ranges are called DMA windows.
> By default, there is a single DMA window, 1 or 2GB big, mapped at zero
> on a PCI bus.
>
> PAPR defines a DDW RTAS API which allows pseries guests
> querying the hypervisor about DDW support and capabilities (page size mask
> for now). A pseries guest may request an additional (to the default)
> DMA windows using this RTAS API.
> The existing pseries Linux guests request an additional window as big as
> the guest RAM and map the entire guest window which effectively creates
> direct mapping of the guest memory to a PCI bus.
>
> This patchset reworks PPC64 IOMMU code and adds necessary structures
> to support big windows.
>
> Once a Linux guest discovers the presence of DDW, it does:
> 1. query hypervisor about number of available windows and page size masks;
> 2. create a window with the biggest possible page size (today 4K/64K/16M);
> 3. map the entire guest RAM via H_PUT_TCE* hypercalls;
> 4. switche dma_ops to direct_dma_ops on the selected PE.
>
> Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
> the guest does not waste time on DMA map/unmap operations.
>
> Note that 32bit devices won't use DDW and will keep using the default
> DMA window so KVM optimizations will be required (to be posted later).
>
> This patchset adds DDW support for pseries. The host kernel changes are
> required, posted as:
>
> [PATCH kernel v9 00/32] powerpc/iommu/vfio: Enable Dynamic DMA windows
>
> This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
> This is also pushed to git@github.com:aik/qemu.git
>   + a64ff6f...64ac9a4 64ac9a4 -> vfio-for-github (forced update)
>
> Please comment. Thanks!
>
> Changes:
> v7:
> * bunch of cleanups, renames after David+Thomas+Michael review
> * patches are reorganized and those which do not need the host kernel headers
> update are put first and can be pulled if these are good enough :)
>
> v6:
> * spapr-pci-vfio-host-bridge is now a synonim of spapr-pci-host-bridge -
> same PHB can host emulated and VFIO devices
> * changed patches order
> * lot of small changes
>
> v5:
> * TCE tables got "enabled" state and are persistent, i.e. not recreated
> every reboot
> * added v2 of SPAPR_TCE_IOMMU
> * fixed migration for emulated PHB with enabled DDW
> * huge pile of other changes
>
> v4:
> * reimplemented the whole thing
> * machine reset and ddw-reset RTAS call both remove all TCE tables and
> create the default one
> * IOMMU group id is not needed to use VFIO PHB anymore, multiple groups
> are supported on the same VFIO container and virtual PHB
>
> v3:
> * removed "reset" from API now
> * reworked machine versions
> * applied multiple comments
> * includes David's machine QOM rework as this patchset adds a new machine type
>
> v2:
> * tested on emulated PHB
> * removed "ddw" machine property, now it is PHB property
> * disabled by default
> * defined "pseries-2.2" machine which enables DDW by default
> * fixed reset() and reference counting
>
>
>
>
> Alexey Kardashevskiy (14):
>    spapr_pci: Finish making find_phb()/find_dev() public
>    vmstate: Define VARRAY with VMS_ALLOC
>    vfio: spapr: Move SPAPR-related code to a separate file
>    spapr_pci_vfio: Enable multiple groups per container
>    spapr_pci: Convert finish_realize() to
>      dma_capabilities_update()+dma_init_window()
>    spapr_iommu: Introduce "enabled" state for TCE table
>    spapr_iommu: Add root memory region
>    spapr_pci: Do complete reset of DMA config when resetting PHB
>    spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge
>    linux headers update for DDW on SPAPR
>    vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
>    spapr: Add pseries-2.4 machine
>    spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
>    vfio: Enable DDW ioctls to VFIO IOMMU driver
>
>   hw/ppc/Makefile.objs          |   3 +
>   hw/ppc/spapr.c                |  32 ++++-
>   hw/ppc/spapr_iommu.c          | 144 +++++++++++++------
>   hw/ppc/spapr_pci.c            | 208 ++++++++++++++++++----------
>   hw/ppc/spapr_pci_vfio.c       | 147 ++++++++++++--------
>   hw/ppc/spapr_rtas_ddw.c       | 300 ++++++++++++++++++++++++++++++++++++++++
>   hw/ppc/spapr_vio.c            |   9 +-
>   hw/vfio/Makefile.objs         |   1 +
>   hw/vfio/common.c              | 180 +++++-------------------
>   hw/vfio/spapr.c               | 312 ++++++++++++++++++++++++++++++++++++++++++
>   include/hw/pci-host/spapr.h   |  49 +++++--
>   include/hw/ppc/spapr.h        |  30 +++-
>   include/hw/vfio/vfio-common.h |  16 +++
>   include/hw/vfio/vfio.h        |   2 +-
>   include/migration/vmstate.h   |  10 ++
>   linux-headers/linux/vfio.h    |  88 +++++++++++-
>   trace-events                  |   5 +
>   17 files changed, 1188 insertions(+), 348 deletions(-)
>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>   create mode 100644 hw/vfio/spapr.c
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2015-05-05 12:28   ` David Gibson
  2015-05-25 15:05   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 56+ messages in thread
From: David Gibson @ 2015-05-05 12:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 1863 bytes --]

On Sat, Apr 25, 2015 at 10:24:36PM +1000, Alexey Kardashevskiy wrote:
> Currently TCE tables are created once at start and their size never
> changes. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
> 
> This changes spapr_tce_new_table() to create an empty stub object. Only
> LIOBN is assigned by the time of creation. It still will be called once
> at the owner object (VIO or PHB) creation.
> 
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> spapr_tce_table_enable() receives TCE table parameters and allocates
> a guest view of the TCE table (in the user space or KVM).
> spapr_tce_table_disable() disposes the table.
> 
> Follow up patches will disable+enable tables on reset (system reset
> or DDW reset).
> 
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
> 
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as migration expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
> 
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it has all the properties set after the migration.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 07/14] spapr_iommu: Add root memory region
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2015-05-05 12:31   ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2015-05-05 12:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 1259 bytes --]

On Sat, Apr 25, 2015 at 10:24:37PM +1000, Alexey Kardashevskiy wrote:
> We are going to have multiple DMA windows at different offsets on
> a PCI bus. For the sake of migration, we will have as many TCE table
> objects pre-created as many windows supported.
> So we need a way to map windows dynamically onto a PCI bus
> when migration of a table is completed but at this stage a TCE table
> object does not have access to a PHB to ask it to map a DMA window
> backed by just migrated TCE table.
> 
> This adds a "root" memory region (UINT64_MAX long) to the TCE object.
> This new region is mapped on a PCI bus with enabled overlapping as
> there will be one root MR per TCE table, each of them mapped at 0.
> The actual IOMMU memory region is a subregion of the root region and
> a TCE table enables/disables this subregion and maps it at
> the specific offset inside the root MR which is 1:1 mapping of
> a PCI address space.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
@ 2015-05-05 12:34   ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2015-05-05 12:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]

On Sat, Apr 25, 2015 at 10:24:38PM +1000, Alexey Kardashevskiy wrote:
> On a system reset, DMA configuration has to reset too. At the moment
> it clears the table content. This is enough for the single table case
> but with DDW, we will also have to disable all DMA windows except
> the default one. Furthermore according to sPAPR, if the guest removed
> the default window and created a huge one at the same zero offset on
> a PCI bus, the reset handler has to recreate the default window with
> the default properties (2GB big, 4K pages).
> 
> This reworks SPAPR PHB code to disable the existing DMA window on reset
> and then configure and enable the default window.
> Without DDW that means that the same window will be disabled and then
> enabled with no other change in behaviour.
> 
> This changes the table creation to do it in one place in PHB (VFIO PHB
> just inherits the behaviour from PHB). The actual table allocation is
> done from the reset handler and this is where dma_init_window() is called.
> 
> This disables all DMA windows on a PHB reset. It does not make any
> difference now as there is just one DMA window but it will later with DDW
> patches.
> 
> This makes spapr_phb_dma_reset() and spapr_phb_dma_remove_window() public
> as these will be used in DDW RTAS "ibm,reset-pe-dma-window" and
> "ibm,remove-pe-dma-window" handlers later; the handlers will reside in
> hw/ppc/spapr_rtas_ddw.c.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2015-05-05 12:49   ` David Gibson
  2015-06-18 11:35     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: David Gibson @ 2015-05-05 12:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 31893 bytes --]

On Sat, Apr 25, 2015 at 10:24:43PM +1000, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices. As all TCE root regions
> are mapped at 0 and 64bit long (and actual tables are child regions),
> this replaces memory_region_add_subregion() with _overlap() to make
> QEMU memory API happy.
> 
> This reserves RTAS token numbers for DDW calls.
> 
> This implements helpers to interact with VFIO kernel interface.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.3 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v6:
> * rework as there is no more special device for VFIO PHB
> 
> v5:
> * total rework
> * enabled for machines >2.3
> * fixed migration
> * merged rtas handlers here
> 
> v4:
> * reset handler is back in generalized form
> 
> v3:
> * removed reset
> * windows_num is now 1 or bigger rather than 0-based value and it is only
> changed in PHB code, not in RTAS
> * added page mask check in create()
> * added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
> created
> 
> v2:
> * tested on hacked emulated E1000
> * implemented DDW reset on the PHB reset
> * spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
> ---
>  hw/ppc/Makefile.objs        |   3 +
>  hw/ppc/spapr.c              |  10 +-
>  hw/ppc/spapr_iommu.c        |  35 +++++-
>  hw/ppc/spapr_pci.c          |  66 ++++++++--
>  hw/ppc/spapr_pci_vfio.c     |  80 ++++++++++++
>  hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |  21 ++++
>  include/hw/ppc/spapr.h      |  17 ++-
>  trace-events                |   4 +
>  9 files changed, 521 insertions(+), 15 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 437955d..c6b344f 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
> +obj-y += spapr_rtas_ddw.o
> +endif
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index b28209f..fd7fdb3 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1801,7 +1801,15 @@ static const TypeInfo spapr_machine_info = {
>      },
>  };
>  
> +#define SPAPR_COMPAT_2_3 \
> +        {\
> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +            .property = "ddw",\
> +            .value    = stringify(off),\
> +        }
> +
>  #define SPAPR_COMPAT_2_2 \
> +        SPAPR_COMPAT_2_3, \
>          {\
>              .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>              .property = "mem_win_size",\
> @@ -1853,7 +1861,7 @@ static const TypeInfo spapr_machine_2_2_info = {
>  static void spapr_machine_2_3_class_init(ObjectClass *oc, void *data)
>  {
>      static GlobalProperty compat_props[] = {
> -        SPAPR_COMPAT_2_2,
> +        SPAPR_COMPAT_2_3,
>          { /* end of list */ }
>      };
>      MachineClass *mc = MACHINE_CLASS(oc);
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 245534f..df4c72d 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -90,6 +90,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>      return ret;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +    tcet->migtable = tcet->table;
> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -98,22 +107,42 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>      }
>  
> +    if (!tcet->migtable) {

What's the case where migtable will be NULL?  IIUC an old->new
migration will result in the data saved for "table" being loaded into
"migtable".

So "migtable" should only be NULL, when tce->enabled is also false?


> +        return 0;
> +    }
> +
> +    if (tcet->enabled) {
> +        if (!tcet->table) {
> +            tcet->enabled = false;
> +            spapr_tce_table_do_enable(tcet);
> +        }
> +        memcpy(tcet->table, tcet->migtable,
> +               tcet->nb_table * sizeof(tcet->table[0]));
> +        free(tcet->migtable);
> +        tcet->migtable = NULL;
> +    }
> +
>      return 0;
>  }
>  
>  static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
> -    .version_id = 2,
> +    .version_id = 3,
>      .minimum_version_id = 2,
> +    .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
>      .fields      = (VMStateField []) {
>          /* Sanity check */
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
> +                                    vmstate_info_uint64, uint64_t),
>  
>          VMSTATE_END_OF_LIST()
>      },
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index d097cce..d3d8f12 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -849,15 +849,17 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          sphb->lsi_table[i].irq = irq;
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -            error_setg(errp, "failed to create TCE table");
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb),
> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
> +        if (!tcet) {
> +            error_setg(errp, "spapr_tce_new_table failed");
>              return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
>      }
>  
> -    memory_region_add_subregion(&sphb->iommu_root, 0,
> -                                spapr_tce_get_iommu(tcet));
> -
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>  
> @@ -867,6 +869,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>  
>      sphb->dma32_window_start = 0;
>      sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
> +    sphb->page_size_mask = (1 << 12) | (1 << 16) | (1 << 24);
> +    sphb->dma64_window_size = pow2ceil(ram_size);
>  
>      ret = spapr_phb_vfio_dma_capabilities_update(sphb);
>      sphb->has_vfio = (ret == 0);
> @@ -874,12 +879,29 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>      return 0;
>  }
>  
> -static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
> -                                     uint32_t liobn, uint32_t page_shift,
> -                                     uint64_t window_size)
> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
> +                              uint32_t liobn, uint32_t page_shift,
> +                              uint64_t window_size)
>  {
>      uint64_t bus_offset = sphb->dma32_window_start;
>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> +    int ret;
> +
> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> +        return -1;
> +    }
> +
> +    if (sphb->ddw_enabled) {
> +        if (sphb->has_vfio) {
> +            ret = spapr_phb_vfio_dma_init_window(sphb,
> +                                                 page_shift, window_size,
> +                                                 &bus_offset);
> +        }
> +
> +        if (ret && SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
> +            bus_offset = SPAPR_PCI_DMA64_START;
> +        }
> +    }
>  
>      spapr_tce_table_enable(tcet, bus_offset, page_shift,
>                             window_size >> page_shift,
> @@ -891,9 +913,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>  int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
>                                  sPAPRTCETable *tcet)
>  {
> +    int ret;
> +
> +    if (sphb->has_vfio && sphb->ddw_enabled) {
> +        ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
> +    }
>      spapr_tce_table_disable(tcet);
>  
> -    return 0;
> +    return ret;
>  }
>  
>  static int spapr_phb_disable_dma_windows(Object *child, void *opaque)
> @@ -950,6 +977,8 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT64("io_win_addr", sPAPRPHBState, io_win_addr, -1),
>      DEFINE_PROP_UINT64("io_win_size", sPAPRPHBState, io_win_size,
>                         SPAPR_PCI_IO_WIN_SIZE),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> +    DEFINE_PROP_UINT8("levels", sPAPRPHBState, levels, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1140,6 +1169,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>  
>      /* Start populating the FDT */
> @@ -1170,6 +1208,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> index 6f91b39..7372d91 100644
> --- a/hw/ppc/spapr_pci_vfio.c
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -41,6 +41,86 @@ int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb)
>      sphb->dma32_window_start = info.dma32_window_start;
>      sphb->dma32_window_size = info.dma32_window_size;
>  
> +    if (sphb->ddw_enabled && (info.flags & VFIO_IOMMU_SPAPR_INFO_DDW)) {
> +        sphb->windows_supported = info.ddw.max_dynamic_windows_supported;
> +        sphb->page_size_mask = info.ddw.pgsizes;
> +        sphb->dma64_window_size = pow2ceil(ram_size);
> +        sphb->max_levels = info.ddw.levels;
> +    } else {
> +        /* If VFIO_IOMMU_INFO_DDW is not set, disable DDW */
> +        sphb->ddw_enabled = false;
> +    }
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_vfio_levels(uint32_t entries)
> +{
> +    unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
> +    int levels;
> +
> +    if (pages <= 64) {
> +        levels = 1;
> +    } else if (pages <= 64*64) {
> +        levels = 2;
> +    } else if (pages <= 64*64*64) {
> +        levels = 3;
> +    } else {
> +        levels = 4;
> +    }
> +
> +    return levels;
> +}
> +
> +int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
> +                                   uint32_t page_shift,
> +                                   uint64_t window_size,
> +                                   uint64_t *bus_offset)
> +{
> +    int ret;
> +    struct vfio_iommu_spapr_tce_create create = {
> +        .argsz = sizeof(create),
> +        .page_shift = page_shift,
> +        .window_size = window_size,
> +        .levels = sphb->levels,
> +        .start_addr = 0,
> +    };
> +
> +    /*
> +     * Dynamic windows are supported, that means that there is no
> +     * pre-created window and we have to create one.
> +     */
> +    if (!create.levels) {
> +        create.levels = spapr_phb_vfio_levels(create.window_size >>
> +                                              page_shift);
> +    }
> +
> +    if (create.levels > sphb->max_levels) {
> +        return -EINVAL;
> +    }
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as,
> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    if (ret) {
> +        return ret;
> +    }
> +    *bus_offset = create.start_addr;
> +
> +    return 0;
> +}
> +
> +int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
> +                                            sPAPRTCETable *tcet)
> +{
> +    struct vfio_iommu_spapr_tce_remove remove = {
> +        .argsz = sizeof(remove),
> +        .start_addr = tcet->bus_offset
> +    };
> +    int ret;
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as,
> +                               VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +
>      return ret;
>  }
>  
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..7ab7572
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,300 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->enabled) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->enabled) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> +                                 uint64_t page_mask)
> +{
> +    int i, j;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            if ((sps[i].page_shift == masks[j].shift) &&
> +                    (page_mask & (1ULL << masks[j].shift))) {
> +                mask |= masks[j].mask;
> +            }
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPREnvironment *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t avail, addr, pgmask = 0;
> +    unsigned current;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    current = spapr_phb_get_active_win_num(sphb);
> +    avail = (sphb->windows_supported > current) ?
> +            (sphb->windows_supported - current) : 0;
> +
> +    /* Work out supported page masks */
> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as all RAM was in 4K pages.
> +     */
> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
> +                                pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPREnvironment *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    long ret;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +
> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
> +        goto hw_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_init_window(sphb, liobn, page_shift,
> +                                    1ULL << window_shift);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift,
> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
> +                                 liobn, ret);
> +    if (ret || !tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPREnvironment *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +    long ret;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_remove_window(sphb, tcet);
> +    trace_spapr_iommu_ddw_remove(liobn, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPREnvironment *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +    long ret;
> +
> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 484291c..1d2ea8d 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -87,6 +87,12 @@ struct sPAPRPHBState {
>      uint32_t dma32_window_size;
>      bool has_vfio;
>      int32_t iommugroupid; /* obsolete */
> +    bool ddw_enabled;
> +    uint32_t windows_supported;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_window_size;
> +    uint8_t max_levels;
> +    uint8_t levels;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
>  };
> @@ -109,6 +115,12 @@ struct sPAPRPHBState {
>  
>  #define SPAPR_PCI_DMA32_SIZE         0x40000000
>  
> +/* Default 64bit dynamic window offset */
> +#define SPAPR_PCI_DMA64_START        0x8000000000000000ULL
> +
> +/* Maximum allowed number of DMA windows for emulated PHB */
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>  {
>      return xics_get_qirq(spapr->icp, phb->lsi_table[pin].irq);
> @@ -127,11 +139,20 @@ void spapr_pci_rtas_init(void);
>  sPAPRPHBState *spapr_pci_find_phb(sPAPREnvironment *spapr, uint64_t buid);
>  PCIDevice *spapr_pci_find_dev(sPAPREnvironment *spapr, uint64_t buid,
>                                uint32_t config_addr);
> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
> +                              uint32_t liobn, uint32_t page_shift,
> +                              uint64_t window_size);
>  int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
>                                  sPAPRTCETable *tcet);
>  int spapr_phb_dma_reset(sPAPRPHBState *sphb);
>  
>  int spapr_phb_vfio_dma_capabilities_update(sPAPRPHBState *sphb);
> +int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
> +                                   uint32_t page_shift,
> +                                   uint64_t window_size,
> +                                   uint64_t *bus_offset);
> +int spapr_phb_vfio_dma_remove_window(sPAPRPHBState *sphb,
> +                                     sPAPRTCETable *tcet);
>  int spapr_phb_vfio_eeh_set_option(sPAPRPHBState *sphb,
>                                    unsigned int addr, int option);
>  int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index c8ac03f..873c661 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -381,6 +381,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_SUPPORTED      -3
>  #define RTAS_OUT_NOT_AUTHORIZED     -9002
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -422,8 +432,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> @@ -504,6 +518,7 @@ struct sPAPRTCETable {
>      uint64_t bus_offset;
>      uint32_t page_shift;
>      uint64_t *table;
> +    uint64_t *migtable;
>      bool bypass;
>      bool vfio_accel;
>      int fd;
> diff --git a/trace-events b/trace-events
> index 2739140..fd8ea7a 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1344,6 +1344,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, unsigned long long pg_size, unsigned long long req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%llx, requested=0x%llx, start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 14/14] vfio: Enable DDW ioctls to VFIO IOMMU driver
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 14/14] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
@ 2015-05-05 12:50   ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2015-05-05 12:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 539 bytes --]

On Sat, Apr 25, 2015 at 10:24:44PM +1000, Alexey Kardashevskiy wrote:
> This enables DDW RTAS-related ioctls in VFIO.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

This patch belongs before the last one (since the last one won't work
without it).

But otherwise,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
  2015-05-05 12:28   ` David Gibson
@ 2015-05-25 15:05   ` Alexey Kardashevskiy
  2015-05-26  2:46     ` David Gibson
  1 sibling, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-25 15:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexander Graf, Michael Roth, qemu-devel, Alex Williamson,
	qemu-ppc, David Gibson

Hi Paolo,

I have had a conversation with Mike and it turns out I am not allowed to 
create/remove memory regions dynamically (docs/memory.txt:101); otherwise 
"destroying regions during reset causes assertion in RCU thread during 
PHB/IOMMU unplug/unparent". Is it because patch just missing some 
unref()/unparent() call or it is totally wrong and I have to implement 
subregions (on a PCI bus address space) myself if I want dynamic DMA 
windows? Thanks!




On 04/25/2015 10:24 PM, Alexey Kardashevskiy wrote:
> Currently TCE tables are created once at start and their size never
> changes. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
>
> This changes spapr_tce_new_table() to create an empty stub object. Only
> LIOBN is assigned by the time of creation. It still will be called once
> at the owner object (VIO or PHB) creation.
>
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> spapr_tce_table_enable() receives TCE table parameters and allocates
> a guest view of the TCE table (in the user space or KVM).
> spapr_tce_table_disable() disposes the table.
>
> Follow up patches will disable+enable tables on reset (system reset
> or DDW reset).
>
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
>
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as migration expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
>
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it has all the properties set after the migration.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v7:
> * s'tmp[64]'tmp[32]' as we need less than 64bytes and more than 16 bytes
> and 32 is the closest power-of-two (just looks nices to have power-of-two
> values)
> * updated commit log about having spapr_tce_table_do_enable() splitted
> from spapr_tce_table_enable()
>
> v6:
> * got rid of set_props()
> ---
>   hw/ppc/spapr_iommu.c    | 104 +++++++++++++++++++++++++++++++-----------------
>   hw/ppc/spapr_pci.c      |  16 +++++---
>   hw/ppc/spapr_pci_vfio.c |  10 ++---
>   hw/ppc/spapr_vio.c      |   9 ++---
>   include/hw/ppc/spapr.h  |  11 ++---
>   5 files changed, 93 insertions(+), 57 deletions(-)
>
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index a14cdc4..a3f2b83 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -126,8 +126,47 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>   static int spapr_tce_table_realize(DeviceState *dev)
>   {
>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> +
> +    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
> +
> +    vmstate_register(DEVICE(tcet), tcet->liobn, &vmstate_spapr_tce_table,
> +                     tcet);
> +
> +    return 0;
> +}
> +
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
> +{
> +    sPAPRTCETable *tcet;
> +    char tmp[32];
> +
> +    if (spapr_tce_find_by_liobn(liobn)) {
> +        fprintf(stderr, "Attempted to create TCE table with duplicate"
> +                " LIOBN 0x%x\n", liobn);
> +        return NULL;
> +    }
> +
> +    tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
> +    tcet->liobn = liobn;
> +
> +    snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
> +    object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> +
> +    object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
> +
> +    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
> +
> +    return tcet;
> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
> +{
>       uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
>
> +    if (!tcet->nb_table) {
> +        return;
> +    }
> +
>       if (kvm_enabled() && !(window_size >> 32)) {
>           tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
>                                                 window_size,
> @@ -140,65 +179,56 @@ static int spapr_tce_table_realize(DeviceState *dev)
>           tcet->table = g_malloc0(table_size);
>       }
>
> -    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
> -
> -    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
> +    memory_region_init_iommu(&tcet->iommu, OBJECT(tcet), &spapr_iommu_ops,
>                                "iommu-spapr",
>                                (uint64_t)tcet->nb_table << tcet->page_shift);
>
> -    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
> -
> -    vmstate_register(DEVICE(tcet), tcet->liobn, &vmstate_spapr_tce_table,
> -                     tcet);
> -
> -    return 0;
> +    tcet->enabled = true;
>   }
>
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool vfio_accel)
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint64_t bus_offset, uint32_t page_shift,
> +                            uint32_t nb_table, bool vfio_accel)
>   {
> -    sPAPRTCETable *tcet;
> -    char tmp[64];
> -
> -    if (spapr_tce_find_by_liobn(liobn)) {
> -        fprintf(stderr, "Attempted to create TCE table with duplicate"
> -                " LIOBN 0x%x\n", liobn);
> -        return NULL;
> -    }
> -
> -    if (!nb_table) {
> -        return NULL;
> +    if (tcet->enabled) {
> +        return;
>       }
>
> -    tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
> -    tcet->liobn = liobn;
>       tcet->bus_offset = bus_offset;
>       tcet->page_shift = page_shift;
>       tcet->nb_table = nb_table;
>       tcet->vfio_accel = vfio_accel;
>
> -    snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
> -    object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> -
> -    object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
> -
> -    return tcet;
> +    spapr_tce_table_do_enable(tcet);
>   }
>
> -static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
> +void spapr_tce_table_disable(sPAPRTCETable *tcet)
>   {
> -    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> -
> -    QLIST_REMOVE(tcet, list);
> +    if (!tcet->enabled) {
> +        return;
> +    }
>
>       if (!kvm_enabled() ||
>           (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
>                                    tcet->nb_table) != 0)) {
> +        tcet->fd = -1;
>           g_free(tcet->table);
>       }
> +    tcet->table = NULL;
> +    tcet->enabled = false;
> +    tcet->bus_offset = 0;
> +    tcet->page_shift = 0;
> +    tcet->nb_table = 0;
> +    tcet->vfio_accel = false;
> +}
> +
> +static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> +
> +    QLIST_REMOVE(tcet, list);
> +
> +    spapr_tce_table_disable(tcet);
>   }
>
>   MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 8c0d2eb..c3410b8 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -881,6 +881,12 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>           sphb->lsi_table[i].irq = irq;
>       }
>
> +    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> +    if (!tcet) {
> +            error_setg(errp, "failed to create TCE table");
> +            return;
> +    }
> +
>       info->dma_capabilities_update(sphb);
>       info->dma_init_window(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
>                             sphb->dma32_window_size);
> @@ -908,13 +914,13 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>                                        uint64_t window_size)
>   {
>       uint64_t bus_offset = sphb->dma32_window_start;
> -    sPAPRTCETable *tcet;
> +    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>
> -    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
> -                               window_size >> page_shift,
> -                               false);
> +    spapr_tce_table_enable(tcet, bus_offset, page_shift,
> +                           window_size >> page_shift,
> +                           false);
>
> -    return tcet ? 0 : -1;
> +    return 0;
>   }
>
>   static int spapr_phb_children_reset(Object *child, void *opaque)
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> index f1dd28c..a5b97d0 100644
> --- a/hw/ppc/spapr_pci_vfio.c
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -49,13 +49,13 @@ static int spapr_phb_vfio_dma_init_window(sPAPRPHBState *sphb,
>                                             uint64_t window_size)
>   {
>       uint64_t bus_offset = sphb->dma32_window_start;
> -    sPAPRTCETable *tcet;
> +    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>
> -    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, bus_offset, page_shift,
> -                               window_size >> page_shift,
> -                               true);
> +    spapr_tce_table_enable(tcet, bus_offset, page_shift,
> +                           window_size >> page_shift,
> +                           true);
>
> -    return tcet ? 0 : -1;
> +    return 0;
>   }
>
>   static void spapr_phb_vfio_reset(DeviceState *qdev)
> diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
> index 174033d..3e28835 100644
> --- a/hw/ppc/spapr_vio.c
> +++ b/hw/ppc/spapr_vio.c
> @@ -479,11 +479,10 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
>           memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
>           address_space_init(&dev->as, &dev->mrroot, qdev->id);
>
> -        dev->tcet = spapr_tce_new_table(qdev, liobn,
> -                                        0,
> -                                        SPAPR_TCE_PAGE_SHIFT,
> -                                        pc->rtce_window_size >>
> -                                        SPAPR_TCE_PAGE_SHIFT, false);
> +        dev->tcet = spapr_tce_new_table(qdev, liobn);
> +        spapr_tce_table_enable(dev->tcet, 0, SPAPR_TCE_PAGE_SHIFT,
> +                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT,
> +                               false);
>           dev->tcet->vdev = dev;
>           memory_region_add_subregion_overlap(&dev->mrroot, 0,
>                                               spapr_tce_get_iommu(dev->tcet), 2);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 7d9ab9d..074d837 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -498,6 +498,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
>
>   struct sPAPRTCETable {
>       DeviceState parent;
> +    bool enabled;
>       uint32_t liobn;
>       uint32_t nb_table;
>       uint64_t bus_offset;
> @@ -515,11 +516,11 @@ sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn);
>   void spapr_events_init(sPAPREnvironment *spapr);
>   void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
>   int spapr_h_cas_compose_response(target_ulong addr, target_ulong size);
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool vfio_accel);
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint64_t bus_offset, uint32_t page_shift,
> +                            uint32_t nb_table, bool vfio_accel);
> +void spapr_tce_table_disable(sPAPRTCETable *tcet);
>   MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
>   int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>                    uint32_t liobn, uint64_t window, uint32_t size);
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-25 15:05   ` Alexey Kardashevskiy
@ 2015-05-26  2:46     ` David Gibson
  2015-05-26  8:58       ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: David Gibson @ 2015-05-26  2:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Michael Roth, qemu-devel, Alex Williamson,
	qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1088 bytes --]

On Tue, May 26, 2015 at 01:05:56AM +1000, Alexey Kardashevskiy wrote:
> Hi Paolo,
> 
> I have had a conversation with Mike and it turns out I am not allowed to
> create/remove memory regions dynamically (docs/memory.txt:101); otherwise
> "destroying regions during reset causes assertion in RCU thread during
> PHB/IOMMU unplug/unparent". Is it because patch just missing some
> unref()/unparent() call or it is totally wrong and I have to implement
> subregions (on a PCI bus address space) myself if I want dynamic DMA
> windows? Thanks!

So, the sentences after that one note an exception for alias and
container regions.  I think iommu regions should behave similarly - in
a sense they're just a procedurally generated collection of alias
regions.

If it's not true now that they can be unparented at any time like
alias regions, we should probably try to make it true.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26  2:46     ` David Gibson
@ 2015-05-26  8:58       ` Paolo Bonzini
  2015-05-26  9:01         ` Alexander Graf
                           ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26  8:58 UTC (permalink / raw)
  To: David Gibson, Alexey Kardashevskiy
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256



On 26/05/2015 04:46, David Gibson wrote:
> On Tue, May 26, 2015 at 01:05:56AM +1000, Alexey Kardashevskiy 
> wrote:
>> Hi Paolo,
>> 
>> I have had a conversation with Mike and it turns out I am not 
>> allowed to create/remove memory regions dynamically 
>> (docs/memory.txt:101); otherwise "destroying regions during
>> reset causes assertion in RCU thread during PHB/IOMMU
>> unplug/unparent". Is it because patch just missing some
>> unref()/unparent() call or it is totally wrong and I have to
>> implement subregions (on a PCI bus address space) myself if I
>> want dynamic DMA windows? Thanks!

I'm blind, can you explain the path where that happens?

> So, the sentences after that one note an exception for alias and 
> container regions.  I think iommu regions should behave similarly
> - in a sense they're just a procedurally generated collection of 
> alias regions.

The difference is that containers and aliases are resolved at the time
the memory region tree is flattened, while IOMMU regions are resolved
at run time.

> If it's not true now that they can be unparented at any time like 
> alias regions, we should probably try to make it true.

Unfortunately it's not so easy...

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26  8:58       ` Paolo Bonzini
@ 2015-05-26  9:01         ` Alexander Graf
  2015-05-26  9:16           ` Paolo Bonzini
  2015-05-26 10:15         ` Alexey Kardashevskiy
  2015-05-27  2:54         ` David Gibson
  2 siblings, 1 reply; 56+ messages in thread
From: Alexander Graf @ 2015-05-26  9:01 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson, Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Michael Roth



On 26.05.15 10:58, Paolo Bonzini wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> 
> 
> On 26/05/2015 04:46, David Gibson wrote:
>> On Tue, May 26, 2015 at 01:05:56AM +1000, Alexey Kardashevskiy 
>> wrote:
>>> Hi Paolo,
>>>
>>> I have had a conversation with Mike and it turns out I am not 
>>> allowed to create/remove memory regions dynamically 
>>> (docs/memory.txt:101); otherwise "destroying regions during
>>> reset causes assertion in RCU thread during PHB/IOMMU
>>> unplug/unparent". Is it because patch just missing some
>>> unref()/unparent() call or it is totally wrong and I have to
>>> implement subregions (on a PCI bus address space) myself if I
>>> want dynamic DMA windows? Thanks!
> 
> I'm blind, can you explain the path where that happens?
> 
>> So, the sentences after that one note an exception for alias and 
>> container regions.  I think iommu regions should behave similarly
>> - in a sense they're just a procedurally generated collection of 
>> alias regions.
> 
> The difference is that containers and aliases are resolved at the time
> the memory region tree is flattened, while IOMMU regions are resolved
> at run time.

Can you please go into more detail here? What part exactly gets resolved
at run time? We don't flatten the memory regions for IOMMU accesses?

But even if we walk the regions rather than the flattened tree, I don't
see how we could end up with races when removing a device.


Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26  9:01         ` Alexander Graf
@ 2015-05-26  9:16           ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26  9:16 UTC (permalink / raw)
  To: Alexander Graf, David Gibson, Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Andreas Färber, Michael Roth



On 26/05/2015 11:01, Alexander Graf wrote:
>>> So, the sentences after that one note an exception for alias and 
>>> >> container regions.  I think iommu regions should behave similarly
>>> >> - in a sense they're just a procedurally generated collection of 
>>> >> alias regions.
>> > 
>> > The difference is that containers and aliases are resolved at the time
>> > the memory region tree is flattened, while IOMMU regions are resolved
>> > at run time.
> Can you please go into more detail here? What part exactly gets resolved
> at run time? We don't flatten the memory regions for IOMMU accesses?

The IOMMU is a single huge region in the FlatView, which then is
forwarded to another AddressSpace.

> But even if we walk the regions rather than the flattened tree, I don't
> see how we could end up with races when removing a device.

The problem is that there is no guarantee that the MemoryRegion dies
immediately after object_unparent.  In fact it will wait at least one
RCU grace period, because the (RCU-protected) FlatViews hold a reference
to the device via memory_region_ref.

There is a very simple (in theory) solution: create the memory region
via object_new instead of object_initialize, using a MemoryRegion*
instead of embedding the MemoryRegion directly.  But I'm not sure how to
do that without duplicating the whole memory_region_init set of APIs.

Perhaps Andreas has an idea of how to improve the QOM object creation API?

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26  8:58       ` Paolo Bonzini
  2015-05-26  9:01         ` Alexander Graf
@ 2015-05-26 10:15         ` Alexey Kardashevskiy
  2015-05-26 10:16           ` Paolo Bonzini
  2015-05-27  2:54         ` David Gibson
  2 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 10:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/26/2015 06:58 PM, Paolo Bonzini wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
>
>
> On 26/05/2015 04:46, David Gibson wrote:
>> On Tue, May 26, 2015 at 01:05:56AM +1000, Alexey Kardashevskiy
>> wrote:
>>> Hi Paolo,
>>>
>>> I have had a conversation with Mike and it turns out I am not
>>> allowed to create/remove memory regions dynamically
>>> (docs/memory.txt:101); otherwise "destroying regions during
>>> reset causes assertion in RCU thread during PHB/IOMMU
>>> unplug/unparent". Is it because patch just missing some
>>> unref()/unparent() call or it is totally wrong and I have to
>>> implement subregions (on a PCI bus address space) myself if I
>>> want dynamic DMA windows? Thanks!
>
> I'm blind, can you explain the path where that happens?


There was a "[RFC PATCH 00/15] spapr: add support for PHB hotplug" patchset 
from Mike, this patch added "unrealize" for spapr_phb:

[RFC PATCH 05/15] spapr_pci: add PHB unrealize

I believe I am dealing with the fixed version of this patch so I'll ask 
Mike to respin it.


>
>> So, the sentences after that one note an exception for alias and
>> container regions.  I think iommu regions should behave similarly
>> - in a sense they're just a procedurally generated collection of
>> alias regions.
>
> The difference is that containers and aliases are resolved at the time
> the memory region tree is flattened, while IOMMU regions are resolved
> at run time.


So they are not parts of flattened view and I should be able to add/remove 
these IOMMU subregions any time I like?


>> If it's not true now that they can be unparented at any time like
>> alias regions, we should probably try to make it true.
>
> Unfortunately it's not so easy...





-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 10:15         ` Alexey Kardashevskiy
@ 2015-05-26 10:16           ` Paolo Bonzini
  2015-05-26 12:33             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 10:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 12:15, Alexey Kardashevskiy wrote:
> There was a "[RFC PATCH 00/15] spapr: add support for PHB hotplug"
> patchset from Mike, this patch added "unrealize" for spapr_phb:
> 
> [RFC PATCH 05/15] spapr_pci: add PHB unrealize
> 
> I believe I am dealing with the fixed version of this patch so I'll ask
> Mike to respin it.
> 
> 
>>
>>> So, the sentences after that one note an exception for alias and
>>> container regions.  I think iommu regions should behave similarly
>>> - in a sense they're just a procedurally generated collection of
>>> alias regions.
>>
>> The difference is that containers and aliases are resolved at the time
>> the memory region tree is flattened, while IOMMU regions are resolved
>> at run time.
> 
> So they are not parts of flattened view and I should be able to
> add/remove these IOMMU subregions any time I like?

Yes.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 10:16           ` Paolo Bonzini
@ 2015-05-26 12:33             ` Alexey Kardashevskiy
  2015-05-26 12:50               ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 12:33 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/26/2015 08:16 PM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 12:15, Alexey Kardashevskiy wrote:
>> There was a "[RFC PATCH 00/15] spapr: add support for PHB hotplug"
>> patchset from Mike, this patch added "unrealize" for spapr_phb:
>>
>> [RFC PATCH 05/15] spapr_pci: add PHB unrealize
>>
>> I believe I am dealing with the fixed version of this patch so I'll ask
>> Mike to respin it.
>>
>>
>>>
>>>> So, the sentences after that one note an exception for alias and
>>>> container regions.  I think iommu regions should behave similarly
>>>> - in a sense they're just a procedurally generated collection of
>>>> alias regions.
>>>
>>> The difference is that containers and aliases are resolved at the time
>>> the memory region tree is flattened, while IOMMU regions are resolved
>>> at run time.
>>
>> So they are not parts of flattened view and I should be able to
>> add/remove these IOMMU subregions any time I like?
>
> Yes.


I got lost here:

 >>> If it's not true now that they can be unparented at any time like
 >>> alias regions, we should probably try to make it true.
 >>
 >> Unfortunately it's not so easy...


Uff. Tricky :)

memory_region_del_subregion() is not unparenting but just a wrapped 
object_unref(), right? But since iommu MR are resolved dynamically, the 
whole conversation we are having here now has nothing to do with my&Mike 
concern what we can and cannot do with DMA windows here. Is this correct?



-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 12:33             ` Alexey Kardashevskiy
@ 2015-05-26 12:50               ` Paolo Bonzini
  2015-05-26 13:28                 ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 12:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 14:33, Alexey Kardashevskiy wrote:
> 
>>>> If it's not true now that they can be unparented at any time like
>>>> alias regions, we should probably try to make it true.
>>>
>>> Unfortunately it's not so easy...
> 
> 
> Uff. Tricky :)
> 
> memory_region_del_subregion() is not unparenting but just a wrapped
> object_unref(), right?

Right.  The problematic thing to do is explicit object_unparent followed
by one of the following:

1) memory_region_init for the same memory region that has been unparented

2) g_free of some dynamically-allocated data structure that contained
the memory region.

> But since iommu MR are resolved dynamically, the
> whole conversation we are having here now has nothing to do with my&Mike
> concern what we can and cannot do with DMA windows here. Is this correct?

I don't understand what you're asking here, sorry.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 12:50               ` Paolo Bonzini
@ 2015-05-26 13:28                 ` Alexey Kardashevskiy
  2015-05-26 13:31                   ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 13:28 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/26/2015 10:50 PM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 14:33, Alexey Kardashevskiy wrote:
>>
>>>>> If it's not true now that they can be unparented at any time like
>>>>> alias regions, we should probably try to make it true.
>>>>
>>>> Unfortunately it's not so easy...
>>
>>
>> Uff. Tricky :)
>>
>> memory_region_del_subregion() is not unparenting but just a wrapped
>> object_unref(), right?
>
> Right.  The problematic thing to do is explicit object_unparent followed
> by one of the following:
>
> 1) memory_region_init for the same memory region that has been unparented
>
> 2) g_free of some dynamically-allocated data structure that contained
> the memory region.
>
>> But since iommu MR are resolved dynamically, the
>> whole conversation we are having here now has nothing to do with my&Mike
>> concern what we can and cannot do with DMA windows here. Is this correct?
>
> I don't understand what you're asking here, sorry.


My initial concern was if I can or cannot do:

memory_region_init_iommu + memory_region_add_subregion
and
memory_region_del_subregion + object_unref

outside of init/realize/unrealize/finalize.

You said I cannot do unparenting but as I am not doing this (and I just do 
unref()) - I am fine. That's what I meant.




-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 13:28                 ` Alexey Kardashevskiy
@ 2015-05-26 13:31                   ` Paolo Bonzini
  2015-05-26 13:42                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 13:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 15:28, Alexey Kardashevskiy wrote:
> 
> My initial concern was if I can or cannot do:
> 
> memory_region_init_iommu + memory_region_add_subregion
> and
> memory_region_del_subregion + object_unref
> 
> outside of init/realize/unrealize/finalize.
> 
> You said I cannot do unparenting but as I am not doing this (and I just
> do unref()) - I am fine. That's what I meant.

Well, if you do the above you have two different bugs:

1) you leak the original child property

2) you initialize the second region on top of the first, so you have two
regions pointing to the same memory

This is even worse than unparenting :) and would have been wrong even
without the RCU changes.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 13:31                   ` Paolo Bonzini
@ 2015-05-26 13:42                     ` Alexey Kardashevskiy
  2015-05-26 13:48                       ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 13:42 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/26/2015 11:31 PM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 15:28, Alexey Kardashevskiy wrote:
>>
>> My initial concern was if I can or cannot do:
>>
>> memory_region_init_iommu + memory_region_add_subregion
>> and
>> memory_region_del_subregion + object_unref
>>
>> outside of init/realize/unrealize/finalize.
>>
>> You said I cannot do unparenting but as I am not doing this (and I just
>> do unref()) - I am fine. That's what I meant.
>
> Well, if you do the above you have two different bugs:
>
> 1) you leak the original child property
>
> 2) you initialize the second region on top of the first, so you have two
> regions pointing to the same memory


The next patch of this patchset changes:
spapr_tce_table_do_enable()
	memory_region_init_iommu(&iommu)
	memory_region_add_subregion(&root, &iommu)

spapr_tce_table_disable()
	memory_region_del_subregion(&root, &iommu)
	object_unref(&iommu)

These spapr_tce_xxx are called by request from the guest. &root is a 
container and exists as long as sPAPRTCETable exists.


Where do I get a leaking child property here?


> This is even worse than unparenting :) and would have been wrong even
> without the RCU changes.

I believe you :) But do not understand :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 13:42                     ` Alexey Kardashevskiy
@ 2015-05-26 13:48                       ` Paolo Bonzini
  2015-05-26 14:00                         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 13:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
> 
> 
> The next patch of this patchset changes:
> spapr_tce_table_do_enable()
>     memory_region_init_iommu(&iommu)
>     memory_region_add_subregion(&root, &iommu)
> 
> spapr_tce_table_disable()
>     memory_region_del_subregion(&root, &iommu)
>     object_unref(&iommu)
> 
> These spapr_tce_xxx are called by request from the guest. &root is a
> container and exists as long as sPAPRTCETable exists.
> 
> Where do I get a leaking child property here?

When you unref iommu and not unparent it.  The next
memory_region_init_iommu creates a second child property, and the first
is gone.

What is different between the various IOMMU regions, so that you cannot
create just one?

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 13:48                       ` Paolo Bonzini
@ 2015-05-26 14:00                         ` Alexey Kardashevskiy
  2015-05-26 14:03                           ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 14:00 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/26/2015 11:48 PM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
>>
>>
>> The next patch of this patchset changes:
>> spapr_tce_table_do_enable()
>>      memory_region_init_iommu(&iommu)
>>      memory_region_add_subregion(&root, &iommu)
>>
>> spapr_tce_table_disable()
>>      memory_region_del_subregion(&root, &iommu)
>>      object_unref(&iommu)
>>
>> These spapr_tce_xxx are called by request from the guest. &root is a
>> container and exists as long as sPAPRTCETable exists.
>>
>> Where do I get a leaking child property here?
>
> When you unref iommu and not unparent it.  The next
> memory_region_init_iommu creates a second child property, and the first
> is gone.

But when do I get this child property? In memory_region_add_subregion()? 
And memory_region_del_subregion() does not do the opposite thing (unparent)?


> What is different between the various IOMMU regions, so that you cannot
> create just one?

There are two DMA windows on the same PCI bus (in hardware too), at 
different offset and with a different page size.


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:00                         ` Alexey Kardashevskiy
@ 2015-05-26 14:03                           ` Paolo Bonzini
  2015-05-26 14:17                             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 14:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 16:00, Alexey Kardashevskiy wrote:
> On 05/26/2015 11:48 PM, Paolo Bonzini wrote:
>>
>>
>> On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
>>>
>>>
>>> The next patch of this patchset changes:
>>> spapr_tce_table_do_enable()
>>>      memory_region_init_iommu(&iommu)
>>>      memory_region_add_subregion(&root, &iommu)
>>>
>>> spapr_tce_table_disable()
>>>      memory_region_del_subregion(&root, &iommu)
>>>      object_unref(&iommu)
>>>
>>> These spapr_tce_xxx are called by request from the guest. &root is a
>>> container and exists as long as sPAPRTCETable exists.
>>>
>>> Where do I get a leaking child property here?
>>
>> When you unref iommu and not unparent it.  The next
>> memory_region_init_iommu creates a second child property, and the first
>> is gone.
> 
> But when do I get this child property? In memory_region_add_subregion()?
> And memory_region_del_subregion() does not do the opposite thing
> (unparent)?

In memory_region_init_iommu.

>> What is different between the various IOMMU regions, so that you cannot
>> create just one?
> 
> There are two DMA windows on the same PCI bus (in hardware too), at
> different offset and with a different page size.

Why do you need different regions?  Why can't you have always the same
IOMMU regions, and either:

1) create/destroy an alias to that region

2) change the behavior of the translation function, while keeping a
single region?

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:03                           ` Paolo Bonzini
@ 2015-05-26 14:17                             ` Alexey Kardashevskiy
  2015-05-26 14:24                               ` Paolo Bonzini
  2015-05-26 14:36                               ` Michael Roth
  0 siblings, 2 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 14:17 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/27/2015 12:03 AM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 16:00, Alexey Kardashevskiy wrote:
>> On 05/26/2015 11:48 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> The next patch of this patchset changes:
>>>> spapr_tce_table_do_enable()
>>>>       memory_region_init_iommu(&iommu)
>>>>       memory_region_add_subregion(&root, &iommu)
>>>>
>>>> spapr_tce_table_disable()
>>>>       memory_region_del_subregion(&root, &iommu)
>>>>       object_unref(&iommu)
>>>>
>>>> These spapr_tce_xxx are called by request from the guest. &root is a
>>>> container and exists as long as sPAPRTCETable exists.
>>>>
>>>> Where do I get a leaking child property here?
>>>
>>> When you unref iommu and not unparent it.  The next
>>> memory_region_init_iommu creates a second child property, and the first
>>> is gone.
>>
>> But when do I get this child property? In memory_region_add_subregion()?
>> And memory_region_del_subregion() does not do the opposite thing
>> (unparent)?
>
> In memory_region_init_iommu.

Ah. So I need at least s/object_unref/object_unparent/ in my current code, 
right?


>>> What is different between the various IOMMU regions, so that you cannot
>>> create just one?
>>
>> There are two DMA windows on the same PCI bus (in hardware too), at
>> different offset and with a different page size.
>
> Why do you need different regions?  Why can't you have always the same
> IOMMU regions, and either:

They may change a size. These are dynamic DMA windows, guest may remove all 
and create randomly. Each region is backed by a separate TCE table with 
different page size.

> 1) create/destroy an alias to that region

How does this change things compared to iommus in regard to parenting?


> 2) change the behavior of the translation function, while keeping a
> single region?

Have one sPAPRTCETable object with 0, 1 or 2 (and potentially more) actual 
TCE tables? I can do that too but I thought subregions are just natural for 
that. I even wanted to create sPAPRTCETable' dynamically but this would 
break migration (because we cannot start QEMU with an additional 
sPAPRTCETable if it exists in the source which is not always the case).

Ok. I'll redo this thing again and try using less QOM objects...


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:17                             ` Alexey Kardashevskiy
@ 2015-05-26 14:24                               ` Paolo Bonzini
  2015-05-26 14:55                                 ` Michael Roth
  2015-05-26 15:00                                 ` Alexey Kardashevskiy
  2015-05-26 14:36                               ` Michael Roth
  1 sibling, 2 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 14:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 16:17, Alexey Kardashevskiy wrote:
> On 05/27/2015 12:03 AM, Paolo Bonzini wrote:
>>
>>
>> On 26/05/2015 16:00, Alexey Kardashevskiy wrote:
>>> On 05/26/2015 11:48 PM, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> The next patch of this patchset changes:
>>>>> spapr_tce_table_do_enable()
>>>>>       memory_region_init_iommu(&iommu)
>>>>>       memory_region_add_subregion(&root, &iommu)
>>>>>
>>>>> spapr_tce_table_disable()
>>>>>       memory_region_del_subregion(&root, &iommu)
>>>>>       object_unref(&iommu)
>>>>>
>>>>> These spapr_tce_xxx are called by request from the guest. &root is a
>>>>> container and exists as long as sPAPRTCETable exists.
>>>>>
>>>>> Where do I get a leaking child property here?
>>>>
>>>> When you unref iommu and not unparent it.  The next
>>>> memory_region_init_iommu creates a second child property, and the first
>>>> is gone.
>>>
>>> But when do I get this child property? In memory_region_add_subregion()?
>>> And memory_region_del_subregion() does not do the opposite thing
>>> (unparent)?
>>
>> In memory_region_init_iommu.
> 
> Ah. So I need at least s/object_unref/object_unparent/ in my current
> code, right?

Yes, and then you hit the situation documented in docs/memory.txt.

>> Why do you need different regions?  Why can't you have always the same
>> IOMMU regions, and either:
> 
> They may change a size.

That's not a problem, there's memory_region_set_size for that.

> These are dynamic DMA windows, guest may remove
> all and create randomly. Each region is backed by a separate TCE table
> with different page size.

Okay.

>> 1) create/destroy an alias to that region
> 
> How does this change things compared to iommus in regard to parenting?

Aliases do not have the same restriction.  But this doesn't help your
case if you have separate TCE tables etc.

>> 2) change the behavior of the translation function, while keeping a
>> single region?
> 
> Have one sPAPRTCETable object with 0, 1 or 2 (and potentially more)
> actual TCE tables? I can do that too but I thought subregions are just
> natural for that.

They may be.  You may need more than one though.

What guest actions trigger the change?  Is it a hypercall?  If so, what
hypercall is it so I can look at the documentation?

> I even wanted to create sPAPRTCETable' dynamically but
> this would break migration (because we cannot start QEMU with an
> additional sPAPRTCETable if it exists in the source which is not always
> the case).

Creating sPAPRTCETables dynamically would be a fix as well.  You _can_
unparent the sPAPRTCETable whenever you want.  But it's not necessarily
the right solution.

Why does it break migration?  There is only one migration handler for
all htabs, I think.  Or is this a different thing than the htabs?

The sPAPRTCETable would be created in its parent device's post_load handler.

> Ok. I'll redo this thing again and try using less QOM objects...

Wait, I haven't understood the problem yet.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:17                             ` Alexey Kardashevskiy
  2015-05-26 14:24                               ` Paolo Bonzini
@ 2015-05-26 14:36                               ` Michael Roth
  1 sibling, 0 replies; 56+ messages in thread
From: Michael Roth @ 2015-05-26 14:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Paolo Bonzini, David Gibson
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

Quoting Alexey Kardashevskiy (2015-05-26 09:17:42)
> On 05/27/2015 12:03 AM, Paolo Bonzini wrote:
> >
> >
> > On 26/05/2015 16:00, Alexey Kardashevskiy wrote:
> >> On 05/26/2015 11:48 PM, Paolo Bonzini wrote:
> >>>
> >>>
> >>> On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> The next patch of this patchset changes:
> >>>> spapr_tce_table_do_enable()
> >>>>       memory_region_init_iommu(&iommu)
> >>>>       memory_region_add_subregion(&root, &iommu)
> >>>>
> >>>> spapr_tce_table_disable()
> >>>>       memory_region_del_subregion(&root, &iommu)
> >>>>       object_unref(&iommu)
> >>>>
> >>>> These spapr_tce_xxx are called by request from the guest. &root is a
> >>>> container and exists as long as sPAPRTCETable exists.
> >>>>
> >>>> Where do I get a leaking child property here?
> >>>
> >>> When you unref iommu and not unparent it.  The next
> >>> memory_region_init_iommu creates a second child property, and the first
> >>> is gone.
> >>
> >> But when do I get this child property? In memory_region_add_subregion()?
> >> And memory_region_del_subregion() does not do the opposite thing
> >> (unparent)?
> >
> > In memory_region_init_iommu.
> 
> Ah. So I need at least s/object_unref/object_unparent/ in my current code, 
> right?

I've actually tried that. In that case, FlatView still holds a
reference to the region, and when RCU thread finally unrefs, it sees
that the MR no longer as an owner in the code below:

void memory_region_unref(MemoryRegion *mr)
{
    Object *obj = OBJECT(mr);
    if (obj && obj->parent) {
        object_unref(obj->parent);
    } else {
        object_unref(obj);
    }
    g_free(ref_tag);
}

Since the region, prior to the object_unparent(), had an owner, it
gave it's ownership over to the owner (which takes a ref on it as
a child property) and unref'd itself. Now that it's been orphaned,
any attempts by RCU to call memory_region_unref() on it result in
it MR attempting to unref itself, rather than it's owner, but it's
ref is already 0: object_unparent() caused it to be finalized already,
so we actually end up triggering the same assertion as with
object_unref: g_assert(obj->ref > 0);

When we use object_unref() instead of object_unparent(), we hit that
same assertion when the owner get's finalized, since it still sees the
MR as a child and attempts to unref it again.

> 
> >>> What is different between the various IOMMU regions, so that you cannot
> >>> create just one?
> >>
> >> There are two DMA windows on the same PCI bus (in hardware too), at
> >> different offset and with a different page size.
> >
> > Why do you need different regions?  Why can't you have always the same
> > IOMMU regions, and either:
> 
> They may change a size. These are dynamic DMA windows, guest may remove all 
> and create randomly. Each region is backed by a separate TCE table with 
> different page size.
> 
> > 1) create/destroy an alias to that region
> 
> How does this change things compared to iommus in regard to parenting?
> 
> 
> > 2) change the behavior of the translation function, while keeping a
> > single region?
> 
> Have one sPAPRTCETable object with 0, 1 or 2 (and potentially more) actual 
> TCE tables? I can do that too but I thought subregions are just natural for 
> that. I even wanted to create sPAPRTCETable' dynamically but this would 
> break migration (because we cannot start QEMU with an additional 
> sPAPRTCETable if it exists in the source which is not always the case).
> 
> Ok. I'll redo this thing again and try using less QOM objects...
> 
> 
> -- 
> Alexey
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:24                               ` Paolo Bonzini
@ 2015-05-26 14:55                                 ` Michael Roth
  2015-05-26 14:58                                   ` Paolo Bonzini
  2015-05-26 15:00                                 ` Alexey Kardashevskiy
  1 sibling, 1 reply; 56+ messages in thread
From: Michael Roth @ 2015-05-26 14:55 UTC (permalink / raw)
  To: Paolo Bonzini, Alexey Kardashevskiy, David Gibson
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

Quoting Paolo Bonzini (2015-05-26 09:24:57)
> 
> 
> On 26/05/2015 16:17, Alexey Kardashevskiy wrote:
> > On 05/27/2015 12:03 AM, Paolo Bonzini wrote:
> >>
> >>
> >> On 26/05/2015 16:00, Alexey Kardashevskiy wrote:
> >>> On 05/26/2015 11:48 PM, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
> >>>>>
> >>>>>
> >>>>> The next patch of this patchset changes:
> >>>>> spapr_tce_table_do_enable()
> >>>>>       memory_region_init_iommu(&iommu)
> >>>>>       memory_region_add_subregion(&root, &iommu)
> >>>>>
> >>>>> spapr_tce_table_disable()
> >>>>>       memory_region_del_subregion(&root, &iommu)
> >>>>>       object_unref(&iommu)
> >>>>>
> >>>>> These spapr_tce_xxx are called by request from the guest. &root is a
> >>>>> container and exists as long as sPAPRTCETable exists.
> >>>>>
> >>>>> Where do I get a leaking child property here?
> >>>>
> >>>> When you unref iommu and not unparent it.  The next
> >>>> memory_region_init_iommu creates a second child property, and the first
> >>>> is gone.
> >>>
> >>> But when do I get this child property? In memory_region_add_subregion()?
> >>> And memory_region_del_subregion() does not do the opposite thing
> >>> (unparent)?
> >>
> >> In memory_region_init_iommu.
> > 
> > Ah. So I need at least s/object_unref/object_unparent/ in my current
> > code, right?
> 
> Yes, and then you hit the situation documented in docs/memory.txt.
> 
> >> Why do you need different regions?  Why can't you have always the same
> >> IOMMU regions, and either:
> > 
> > They may change a size.
> 
> That's not a problem, there's memory_region_set_size for that.

What on earth, I could've sworn I looked for this... yes I think that
would solve the issue here. mr_add/mr_del can handle the change in
offsets, set size can deal with the change and size, and we can then
move to using an MR allocated at IOMMU creation time.

> 
> > These are dynamic DMA windows, guest may remove
> > all and create randomly. Each region is backed by a separate TCE table
> > with different page size.
> 
> Okay.
> 
> >> 1) create/destroy an alias to that region
> > 
> > How does this change things compared to iommus in regard to parenting?
> 
> Aliases do not have the same restriction.  But this doesn't help your
> case if you have separate TCE tables etc.
> 
> >> 2) change the behavior of the translation function, while keeping a
> >> single region?
> > 
> > Have one sPAPRTCETable object with 0, 1 or 2 (and potentially more)
> > actual TCE tables? I can do that too but I thought subregions are just
> > natural for that.
> 
> They may be.  You may need more than one though.
> 
> What guest actions trigger the change?  Is it a hypercall?  If so, what
> hypercall is it so I can look at the documentation?
> 
> > I even wanted to create sPAPRTCETable' dynamically but
> > this would break migration (because we cannot start QEMU with an
> > additional sPAPRTCETable if it exists in the source which is not always
> > the case).
> 
> Creating sPAPRTCETables dynamically would be a fix as well.  You _can_
> unparent the sPAPRTCETable whenever you want.  But it's not necessarily
> the right solution.

Yah, I think this would work too, simply resizing the IOMMU MR seems
more straightforward in our case though.

> 
> Why does it break migration?  There is only one migration handler for
> all htabs, I think.  Or is this a different thing than the htabs?

I think the issue was that migration expects all objects in destination
to be instantiated prior to the start of migration, so any scheme where
the IOMMU objects are creating/destroyed at essentially random times
causes problems in terms of figuring out where to load in the migrated
TCE tables.

> 
> The sPAPRTCETable would be created in its parent device's post_load handler.
> 
> > Ok. I'll redo this thing again and try using less QOM objects...
> 
> Wait, I haven't understood the problem yet.

AFAIK you've given us an ideal solution using memory_region_set_size()
so we can avoid the dynamic MR creation during reset. Not sure if
there's anything else that's missing.

> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:55                                 ` Michael Roth
@ 2015-05-26 14:58                                   ` Paolo Bonzini
  2015-05-26 15:49                                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 14:58 UTC (permalink / raw)
  To: Michael Roth, Alexey Kardashevskiy, David Gibson
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 16:55, Michael Roth wrote:
> > That's not a problem, there's memory_region_set_size for that.
> 
> What on earth, I could've sworn I looked for this... yes I think that
> would solve the issue here. mr_add/mr_del can handle the change in
> offsets, set size can deal with the change and size, and we can then
> move to using an MR allocated at IOMMU creation time.

It's very little used, but that's just because it's not too common.
There's nothing wrong with it. :)

If you do del/set_size/add, you may want to put a
memory_region_transaction_{begin,commit} around the whole dance.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:24                               ` Paolo Bonzini
  2015-05-26 14:55                                 ` Michael Roth
@ 2015-05-26 15:00                                 ` Alexey Kardashevskiy
  2015-05-26 15:08                                   ` Paolo Bonzini
  1 sibling, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 15:00 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/27/2015 12:24 AM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 16:17, Alexey Kardashevskiy wrote:
>> On 05/27/2015 12:03 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 26/05/2015 16:00, Alexey Kardashevskiy wrote:
>>>> On 05/26/2015 11:48 PM, Paolo Bonzini wrote:
>>>>>
>>>>>
>>>>> On 26/05/2015 15:42, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> The next patch of this patchset changes:
>>>>>> spapr_tce_table_do_enable()
>>>>>>        memory_region_init_iommu(&iommu)
>>>>>>        memory_region_add_subregion(&root, &iommu)
>>>>>>
>>>>>> spapr_tce_table_disable()
>>>>>>        memory_region_del_subregion(&root, &iommu)
>>>>>>        object_unref(&iommu)
>>>>>>
>>>>>> These spapr_tce_xxx are called by request from the guest. &root is a
>>>>>> container and exists as long as sPAPRTCETable exists.
>>>>>>
>>>>>> Where do I get a leaking child property here?
>>>>>
>>>>> When you unref iommu and not unparent it.  The next
>>>>> memory_region_init_iommu creates a second child property, and the first
>>>>> is gone.
>>>>
>>>> But when do I get this child property? In memory_region_add_subregion()?
>>>> And memory_region_del_subregion() does not do the opposite thing
>>>> (unparent)?
>>>
>>> In memory_region_init_iommu.
>>
>> Ah. So I need at least s/object_unref/object_unparent/ in my current
>> code, right?
>
> Yes, and then you hit the situation documented in docs/memory.txt.

Oh. ok.


>>> Why do you need different regions?  Why can't you have always the same
>>> IOMMU regions, and either:
>>
>> They may change a size.
>
> That's not a problem, there's memory_region_set_size for that.


It was not there when I started doing this DDW :) If so, I can keep the 
existing structure and just set size to zero instead of 
memory_region_del_subregion().


>> These are dynamic DMA windows, guest may remove
>> all and create randomly. Each region is backed by a separate TCE table
>> with different page size.
>
> Okay.
>
>>> 1) create/destroy an alias to that region
>>
>> How does this change things compared to iommus in regard to parenting?
>
> Aliases do not have the same restriction.  But this doesn't help your
> case if you have separate TCE tables etc.

I need windows appear and disappear on a bus dynamically, that's it. The 
actual sPAPRTCETable objects exist always. Aliases will do the job as far 
as I can tell.

>>> 2) change the behavior of the translation function, while keeping a
>>> single region?
>>
>> Have one sPAPRTCETable object with 0, 1 or 2 (and potentially more)
>> actual TCE tables? I can do that too but I thought subregions are just
>> natural for that.
>
> They may be.  You may need more than one though.

I fail to see when :)


> What guest actions trigger the change?  Is it a hypercall?  If so, what
> hypercall is it so I can look at the documentation?

It is a bunch of RTAS calls which are highly classified in PAPR spec :)

Linux guests do this:
1. load a driver
2. driver calls set_dma_mask()
3. if mask < 64, usual old-style &dma_iommu_ops is used; exit
4. platform code calls enable_ddw()
5. enable_ddw() looks at PHB "ddw-applicable"
6. enable_ddw() calls ibm,query-pe-dma-window (returns page mask supported)
7. enable_ddw() calls ibm,create-pe-dma-window to create actual window with 
specific size (which is entire guest RAM in the case of linux but might be 
different for the other OS) and know its bus address (rtas returns it, the 
guest does not choose it)
8. enable_ddw() calls H_PUT_TCE in a loop to map all guest RAM pages onto a 
bus and does set_dma_ops(dev, &dma_direct_ops) so H_PUT_TCE is not called 
again till guest reboot.

If any step in 5..8 fails, then &dma_iommu_ops is used.

The pseries platform expects the default DMA window (4K pages, <=2GB) to 
exist. And there is an extra ibm,remove-pe-dma-window call to remove any 
window (including default one) so a following ibm,create-pe-dma-window will 
create a new window at zero offset on a bus (as big as the guest RAM and 
page size bigger than 4K).

Aaaaand there is an extension - ibm,reset-pe-dma-window which should delete 
all windows and create the default one (kernels before v3.10 or so used to 
do this). The machine reset should do the same thing.



>> I even wanted to create sPAP
RTCETable' dynamically but
>> this would break migration (because we cannot start QEMU with an
>> additional sPAPRTCETable if it exists in the source which is not always
>> the case).
>
> Creating sPAPRTCETables dynamically would be a fix as well.  You _can_
> unparent the sPAPRTCETable whenever you want.  But it's not necessarily
> the right solution.
>
> Why does it break migration?  There is only one migration handler for
> all htabs, I think.  Or is this a different thing than the htabs?


sPAPRTCETable stores the actual table and if I want it to migrate, the 
destination QEMU must have the object created-and-vmstate_register'ated. 
But the table (and class) may be absent or present on the source side so I 
need to start the destination with or without -device sPAPRTCETable, and if 
I need to create this object, I need to make it a child of a PHB and last 
time I checked - there is no command line interface for linking children.


>
> The sPAPRTCETable would be created in its parent device's post_load handler.
>
>> Ok. I'll redo this thing again and try using less QOM objects...
>
> Wait, I haven't understood the problem yet.

Oookay :)

But I started thinking that always having 2 sPAPRTCETable objects (some may 
be "disabled") it not better than a single sPAPRTCETable with multiple TCE 
tables...


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 15:00                                 ` Alexey Kardashevskiy
@ 2015-05-26 15:08                                   ` Paolo Bonzini
  2015-05-26 15:49                                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 15:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 17:00, Alexey Kardashevskiy wrote:
>>>> Why do you need different regions?  Why can't you have always the same
>>>> IOMMU regions, and either:
>>>
>>> They may change a size.
>>
>> That's not a problem, there's memory_region_set_size for that.
> 
> It was not there when I started doing this DDW :) If so, I can keep the
> existing structure and just set size to zero instead of
> memory_region_del_subregion().

del/add_subregion is okay.  It's just init/unparent that is wrong.

> I need windows appear and disappear on a bus dynamically, that's it. The
> actual sPAPRTCETable objects exist always.

Great.

> Aliases will do the job as far as I can tell.

Then you can choose between init_alias/add/del/unparent(alias) and
del/set_size/add which Michael has mentioned.  The latter is probably
cleaner and faster.

> sPAPRTCETable stores the actual table and if I want it to migrate, the
> destination QEMU must have the object created-and-vmstate_register'ated.
> But the table (and class) may be absent or present on the source side so
> I need to start the destination with or without -device sPAPRTCETable,
> and if I need to create this object, I need to make it a child of a PHB
> and last time I checked - there is no command line interface for linking
> children.

Yup, understood now.

> But I started thinking that always having 2 sPAPRTCETable objects (some
> may be "disabled") it not better than a single sPAPRTCETable with
> multiple TCE tables...

Whatever works best for you.  Either is okay.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 15:08                                   ` Paolo Bonzini
@ 2015-05-26 15:49                                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 15:49 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: Michael Roth, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/27/2015 01:08 AM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 17:00, Alexey Kardashevskiy wrote:
>>>>> Why do you need different regions?  Why can't you have always the same
>>>>> IOMMU regions, and either:
>>>>
>>>> They may change a size.
>>>
>>> That's not a problem, there's memory_region_set_size for that.
>>
>> It was not there when I started doing this DDW :) If so, I can keep the
>> existing structure and just set size to zero instead of
>> memory_region_del_subregion().
>
> del/add_subregion is okay.  It's just init/unparent that is wrong.

Yup, right, my bad, I do not need init/unparent and I still need 
del/add_subregion for setting an offset.


>
>> I need windows appear and disappear on a bus dynamically, that's it. The
>> actual sPAPRTCETable objects exist always.
>
> Great.
>
>> Aliases will do the job as far as I can tell.
>
> Then you can choose between init_alias/add/del/unparent(alias) and
> del/set_size/add which Michael has mentioned.  The latter is probably
> cleaner and faster.
>
>> sPAPRTCETable stores the actual table and if I want it to migrate, the
>> destination QEMU must have the object created-and-vmstate_register'ated.
>> But the table (and class) may be absent or present on the source side so
>> I need to start the destination with or without -device sPAPRTCETable,
>> and if I need to create this object, I need to make it a child of a PHB
>> and last time I checked - there is no command line interface for linking
>> children.
>
> Yup, understood now.
>
>> But I started thinking that always having 2 sPAPRTCETable objects (some
>> may be "disabled") it not better than a single sPAPRTCETable with
>> multiple TCE tables...
>
> Whatever works best for you.  Either is okay.

Yeah... multiple sPAPRTCETables and set_size() it is then. A single 
sPAPRTCETable has a problem that if I need to migrate multiple tables - 
vmstate will look ugly (especially for backward compatibility).


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 14:58                                   ` Paolo Bonzini
@ 2015-05-26 15:49                                     ` Alexey Kardashevskiy
  2015-05-26 15:51                                       ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 15:49 UTC (permalink / raw)
  To: Paolo Bonzini, Michael Roth, David Gibson
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/27/2015 12:58 AM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 16:55, Michael Roth wrote:
>>> That's not a problem, there's memory_region_set_size for that.
>>
>> What on earth, I could've sworn I looked for this... yes I think that
>> would solve the issue here. mr_add/mr_del can handle the change in
>> offsets, set size can deal with the change and size, and we can then
>> move to using an MR allocated at IOMMU creation time.
>
> It's very little used, but that's just because it's not too common.
> There's nothing wrong with it. :)
>
> If you do del/set_size/add, you may want to put a
> memory_region_transaction_{begin,commit} around the whole dance.


Here I lost you again :)
Why? These are IOMMU MRs -> they are dynamic, what will begin()/commit() 
change here?


-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 15:49                                     ` Alexey Kardashevskiy
@ 2015-05-26 15:51                                       ` Paolo Bonzini
  2015-05-26 23:55                                         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-26 15:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Roth, David Gibson
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 26/05/2015 17:49, Alexey Kardashevskiy wrote:
>>
>> It's very little used, but that's just because it's not too common.
>> There's nothing wrong with it. :)
>>
>> If you do del/set_size/add, you may want to put a
>> memory_region_transaction_{begin,commit} around the whole dance.
> 
> 
> Here I lost you again :)
> Why? These are IOMMU MRs -> they are dynamic, what will begin()/commit()
> change here?

If you don't add them, the memory core may create two or three different
flatviews.  With begin/commit, it will only do one change.  It's just an
optimization.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 15:51                                       ` Paolo Bonzini
@ 2015-05-26 23:55                                         ` Alexey Kardashevskiy
  2015-05-27  7:05                                           ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-26 23:55 UTC (permalink / raw)
  To: Paolo Bonzini, Michael Roth, David Gibson
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/27/2015 01:51 AM, Paolo Bonzini wrote:
>
>
> On 26/05/2015 17:49, Alexey Kardashevskiy wrote:
>>>
>>> It's very little used, but that's just because it's not too common.
>>> There's nothing wrong with it. :)
>>>
>>> If you do del/set_size/add, you may want to put a
>>> memory_region_transaction_{begin,commit} around the whole dance.


>> Here I lost you again :)
>> Why? These are IOMMU MRs -> they are dynamic, what will begin()/commit()
>> change here?
>
> If you don't add them, the memory core may create two or three different
> flatviews.  With begin/commit, it will only do one change.  It's just an
> optimization.



One step back :) Whole dance is what here? There are:
1) del+set_size(0)
2) set_size(not zero)+add

1) and 2) are called in different places, between those the guest get 
control so I cannot wrap these in begin/commit.  If you suggest 
begin+commit around, for example, 1)- this is just a single call and there 
be one change to flatview, where many changes may happen? And only the last 
one is used, right, or there is a list of them somewhere? :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26  8:58       ` Paolo Bonzini
  2015-05-26  9:01         ` Alexander Graf
  2015-05-26 10:15         ` Alexey Kardashevskiy
@ 2015-05-27  2:54         ` David Gibson
  2 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2015-05-27  2:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexander Graf, Alexey Kardashevskiy, Michael Roth, qemu-devel,
	Alex Williamson, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 1628 bytes --]

On Tue, May 26, 2015 at 10:58:02AM +0200, Paolo Bonzini wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> 
> 
> On 26/05/2015 04:46, David Gibson wrote:
> > On Tue, May 26, 2015 at 01:05:56AM +1000, Alexey Kardashevskiy 
> > wrote:
> >> Hi Paolo,
> >> 
> >> I have had a conversation with Mike and it turns out I am not 
> >> allowed to create/remove memory regions dynamically 
> >> (docs/memory.txt:101); otherwise "destroying regions during
> >> reset causes assertion in RCU thread during PHB/IOMMU
> >> unplug/unparent". Is it because patch just missing some
> >> unref()/unparent() call or it is totally wrong and I have to
> >> implement subregions (on a PCI bus address space) myself if I
> >> want dynamic DMA windows? Thanks!
> 
> I'm blind, can you explain the path where that happens?
> 
> > So, the sentences after that one note an exception for alias and 
> > container regions.  I think iommu regions should behave similarly
> > - in a sense they're just a procedurally generated collection of 
> > alias regions.
> 
> The difference is that containers and aliases are resolved at the time
> the memory region tree is flattened, while IOMMU regions are resolved
> at run time.
> 
> > If it's not true now that they can be unparented at any time like 
> > alias regions, we should probably try to make it true.
> 
> Unfortunately it's not so easy...

Ah, yeah, drat.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-26 23:55                                         ` Alexey Kardashevskiy
@ 2015-05-27  7:05                                           ` Paolo Bonzini
  2015-07-04  1:12                                             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-05-27  7:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Roth, David Gibson
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf



On 27/05/2015 01:55, Alexey Kardashevskiy wrote:
> One step back :) Whole dance is what here? There are:
> 1) del+set_size(0)
> 2) set_size(not zero)+add

Then no need for begin/commit. :)

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-05-05 12:49   ` David Gibson
@ 2015-06-18 11:35     ` Alexey Kardashevskiy
  2015-06-19  1:45       ` David Gibson
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-06-18 11:35 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 05/05/2015 10:49 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:24:43PM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices. As all TCE root regions
>> are mapped at 0 and 64bit long (and actual tables are child regions),
>> this replaces memory_region_add_subregion() with _overlap() to make
>> QEMU memory API happy.
>>
>> This reserves RTAS token numbers for DDW calls.
>>
>> This implements helpers to interact with VFIO kernel interface.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.3 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

I saw this and decided there are no more coments but I was wrong :)



>> ---
>> Changes:
>> v6:
>> * rework as there is no more special device for VFIO PHB
>>
>> v5:
>> * total rework
>> * enabled for machines >2.3
>> * fixed migration
>> * merged rtas handlers here
>>
>> v4:
>> * reset handler is back in generalized form
>>
>> v3:
>> * removed reset
>> * windows_num is now 1 or bigger rather than 0-based value and it is only
>> changed in PHB code, not in RTAS
>> * added page mask check in create()
>> * added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
>> created
>>
>> v2:
>> * tested on hacked emulated E1000
>> * implemented DDW reset on the PHB reset
>> * spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
>> ---
>>   hw/ppc/Makefile.objs        |   3 +
>>   hw/ppc/spapr.c              |  10 +-
>>   hw/ppc/spapr_iommu.c        |  35 +++++-
>>   hw/ppc/spapr_pci.c          |  66 ++++++++--
>>   hw/ppc/spapr_pci_vfio.c     |  80 ++++++++++++
>>   hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/pci-host/spapr.h |  21 ++++
>>   include/hw/ppc/spapr.h      |  17 ++-
>>   trace-events                |   4 +
>>   9 files changed, 521 insertions(+), 15 deletions(-)
>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index 437955d..c6b344f 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o
>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>   obj-y += spapr_pci_vfio.o
>>   endif
>> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
>> +obj-y += spapr_rtas_ddw.o
>> +endif
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index b28209f..fd7fdb3 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1801,7 +1801,15 @@ static const TypeInfo spapr_machine_info = {
>>       },
>>   };
>>
>> +#define SPAPR_COMPAT_2_3 \
>> +        {\
>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +            .property = "ddw",\
>> +            .value    = stringify(off),\
>> +        }
>> +
>>   #define SPAPR_COMPAT_2_2 \
>> +        SPAPR_COMPAT_2_3, \
>>           {\
>>               .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>               .property = "mem_win_size",\
>> @@ -1853,7 +1861,7 @@ static const TypeInfo spapr_machine_2_2_info = {
>>   static void spapr_machine_2_3_class_init(ObjectClass *oc, void *data)
>>   {
>>       static GlobalProperty compat_props[] = {
>> -        SPAPR_COMPAT_2_2,
>> +        SPAPR_COMPAT_2_3,
>>           { /* end of list */ }
>>       };
>>       MachineClass *mc = MACHINE_CLASS(oc);
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 245534f..df4c72d 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -90,6 +90,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>>       return ret;
>>   }
>>
>> +static void spapr_tce_table_pre_save(void *opaque)
>> +{
>> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +
>> +    tcet->migtable = tcet->table;
>> +}
>> +
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>> +
>>   static int spapr_tce_table_post_load(void *opaque, int version_id)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -98,22 +107,42 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>>           spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>>       }
>>
>> +    if (!tcet->migtable) {
>
> What's the case where migtable will be NULL?  IIUC an old->new
> migration will result in the data saved for "table" being loaded into
> "migtable".
>
> So "migtable" should only be NULL, when tce->enabled is also false?


Seems to be true and this is just extra precaution. Remove?



-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-06-18 11:35     ` Alexey Kardashevskiy
@ 2015-06-19  1:45       ` David Gibson
  2015-06-19  6:49         ` Markus Armbruster
  0 siblings, 1 reply; 56+ messages in thread
From: David Gibson @ 2015-06-19  1:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 3688 bytes --]

On Thu, Jun 18, 2015 at 09:35:44PM +1000, Alexey Kardashevskiy wrote:
> On 05/05/2015 10:49 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:24:43PM +1000, Alexey Kardashevskiy wrote:
> >>This adds support for Dynamic DMA Windows (DDW) option defined by
> >>the SPAPR specification which allows to have additional DMA window(s)
> >>
> >>This implements DDW for emulated and VFIO devices. As all TCE root regions
> >>are mapped at 0 and 64bit long (and actual tables are child regions),
> >>this replaces memory_region_add_subregion() with _overlap() to make
> >>QEMU memory API happy.
> >>
> >>This reserves RTAS token numbers for DDW calls.
> >>
> >>This implements helpers to interact with VFIO kernel interface.
> >>
> >>This changes the TCE table migration descriptor to support dynamic
> >>tables as from now on, PHB will create as many stub TCE table objects
> >>as PHB can possibly support but not all of them might be initialized at
> >>the time of migration because DDW might or might not be requested by
> >>the guest.
> >>
> >>The "ddw" property is enabled by default on a PHB but for compatibility
> >>the pseries-2.3 machine and older disable it.
> >>
> >>This implements DDW for VFIO. The host kernel support is required.
> >>This adds a "levels" property to PHB to control the number of levels
> >>in the actual TCE table allocated by the host kernel, 0 is the default
> >>value to tell QEMU to calculate the correct value. Current hardware
> >>supports up to 5 levels.
> >>
> >>The existing linux guests try creating one additional huge DMA window
> >>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>and not waste time on map/unmap later.
> >>
> >>This adds 4 RTAS handlers:
> >>* ibm,query-pe-dma-window
> >>* ibm,create-pe-dma-window
> >>* ibm,remove-pe-dma-window
> >>* ibm,reset-pe-dma-window
> >>These are registered from type_init() callback.
> >>
> >>These RTAS handlers are implemented in a separate file to avoid polluting
> >>spapr_iommu.c with PCI.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> I saw this and decided there are no more coments but I was wrong :)

Right.  Note that if I add a Reviewed-by but also make comments, then
those comments are seeking clarification and maybe suggesting later
cleanups, but I think the problems are small enough that the patch is
still ready to go as it is.

[snip]
> >>+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> >>+
> >>  static int spapr_tce_table_post_load(void *opaque, int version_id)
> >>  {
> >>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> >>@@ -98,22 +107,42 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
> >>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
> >>      }
> >>
> >>+    if (!tcet->migtable) {
> >
> >What's the case where migtable will be NULL?  IIUC an old->new
> >migration will result in the data saved for "table" being loaded into
> >"migtable".
> >
> >So "migtable" should only be NULL, when tce->enabled is also false?
> 
> 
> Seems to be true and this is just extra precaution. Remove?

Yes, remove as a cleanup, but that can be done later, I won't hold up
the main patch series for this.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-06-19  1:45       ` David Gibson
@ 2015-06-19  6:49         ` Markus Armbruster
  2015-06-22  2:00           ` David Gibson
  0 siblings, 1 reply; 56+ messages in thread
From: Markus Armbruster @ 2015-06-19  6:49 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, qemu-devel,
	Alexander Graf

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Jun 18, 2015 at 09:35:44PM +1000, Alexey Kardashevskiy wrote:
>> On 05/05/2015 10:49 PM, David Gibson wrote:
>> >On Sat, Apr 25, 2015 at 10:24:43PM +1000, Alexey Kardashevskiy wrote:
>> >>This adds support for Dynamic DMA Windows (DDW) option defined by
>> >>the SPAPR specification which allows to have additional DMA window(s)
>> >>
>> >>This implements DDW for emulated and VFIO devices. As all TCE root regions
>> >>are mapped at 0 and 64bit long (and actual tables are child regions),
>> >>this replaces memory_region_add_subregion() with _overlap() to make
>> >>QEMU memory API happy.
>> >>
>> >>This reserves RTAS token numbers for DDW calls.
>> >>
>> >>This implements helpers to interact with VFIO kernel interface.
>> >>
>> >>This changes the TCE table migration descriptor to support dynamic
>> >>tables as from now on, PHB will create as many stub TCE table objects
>> >>as PHB can possibly support but not all of them might be initialized at
>> >>the time of migration because DDW might or might not be requested by
>> >>the guest.
>> >>
>> >>The "ddw" property is enabled by default on a PHB but for compatibility
>> >>the pseries-2.3 machine and older disable it.
>> >>
>> >>This implements DDW for VFIO. The host kernel support is required.
>> >>This adds a "levels" property to PHB to control the number of levels
>> >>in the actual TCE table allocated by the host kernel, 0 is the default
>> >>value to tell QEMU to calculate the correct value. Current hardware
>> >>supports up to 5 levels.
>> >>
>> >>The existing linux guests try creating one additional huge DMA window
>> >>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> >>the guest switches to dma_direct_ops and never calls TCE hypercalls
>> >>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> >>and not waste time on map/unmap later.
>> >>
>> >>This adds 4 RTAS handlers:
>> >>* ibm,query-pe-dma-window
>> >>* ibm,create-pe-dma-window
>> >>* ibm,remove-pe-dma-window
>> >>* ibm,reset-pe-dma-window
>> >>These are registered from type_init() callback.
>> >>
>> >>These RTAS handlers are implemented in a separate file to avoid polluting
>> >>spapr_iommu.c with PCI.
>> >>
>> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> >
>> >Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> 
>> I saw this and decided there are no more coments but I was wrong :)
>
> Right.  Note that if I add a Reviewed-by but also make comments, then
> those comments are seeking clarification and maybe suggesting later
> cleanups, but I think the problems are small enough that the patch is
> still ready to go as it is.

You can help the recipient of your comments by putting your R-by behind
the last comment.

Wouldn't be necessary if people never left reams of quoted material
at the end of their replies, but that's a pipe dream :)

[...]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2015-06-19  6:49         ` Markus Armbruster
@ 2015-06-22  2:00           ` David Gibson
  0 siblings, 0 replies; 56+ messages in thread
From: David Gibson @ 2015-06-22  2:00 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, qemu-devel,
	Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 3487 bytes --]

On Fri, Jun 19, 2015 at 08:49:00AM +0200, Markus Armbruster wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Thu, Jun 18, 2015 at 09:35:44PM +1000, Alexey Kardashevskiy wrote:
> >> On 05/05/2015 10:49 PM, David Gibson wrote:
> >> >On Sat, Apr 25, 2015 at 10:24:43PM +1000, Alexey Kardashevskiy wrote:
> >> >>This adds support for Dynamic DMA Windows (DDW) option defined by
> >> >>the SPAPR specification which allows to have additional DMA window(s)
> >> >>
> >> >>This implements DDW for emulated and VFIO devices. As all TCE root regions
> >> >>are mapped at 0 and 64bit long (and actual tables are child regions),
> >> >>this replaces memory_region_add_subregion() with _overlap() to make
> >> >>QEMU memory API happy.
> >> >>
> >> >>This reserves RTAS token numbers for DDW calls.
> >> >>
> >> >>This implements helpers to interact with VFIO kernel interface.
> >> >>
> >> >>This changes the TCE table migration descriptor to support dynamic
> >> >>tables as from now on, PHB will create as many stub TCE table objects
> >> >>as PHB can possibly support but not all of them might be initialized at
> >> >>the time of migration because DDW might or might not be requested by
> >> >>the guest.
> >> >>
> >> >>The "ddw" property is enabled by default on a PHB but for compatibility
> >> >>the pseries-2.3 machine and older disable it.
> >> >>
> >> >>This implements DDW for VFIO. The host kernel support is required.
> >> >>This adds a "levels" property to PHB to control the number of levels
> >> >>in the actual TCE table allocated by the host kernel, 0 is the default
> >> >>value to tell QEMU to calculate the correct value. Current hardware
> >> >>supports up to 5 levels.
> >> >>
> >> >>The existing linux guests try creating one additional huge DMA window
> >> >>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >> >>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >> >>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >> >>and not waste time on map/unmap later.
> >> >>
> >> >>This adds 4 RTAS handlers:
> >> >>* ibm,query-pe-dma-window
> >> >>* ibm,create-pe-dma-window
> >> >>* ibm,remove-pe-dma-window
> >> >>* ibm,reset-pe-dma-window
> >> >>These are registered from type_init() callback.
> >> >>
> >> >>These RTAS handlers are implemented in a separate file to avoid polluting
> >> >>spapr_iommu.c with PCI.
> >> >>
> >> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> >
> >> >Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >> 
> >> I saw this and decided there are no more coments but I was wrong :)
> >
> > Right.  Note that if I add a Reviewed-by but also make comments, then
> > those comments are seeking clarification and maybe suggesting later
> > cleanups, but I think the problems are small enough that the patch is
> > still ready to go as it is.
> 
> You can help the recipient of your comments by putting your R-by behind
> the last comment.

Noted for future reference.

> Wouldn't be necessary if people never left reams of quoted material
> at the end of their replies, but that's a pipe dream :)

I do usually try to trim quoted material - looks like I forgot this
time though.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-05-27  7:05                                           ` Paolo Bonzini
@ 2015-07-04  1:12                                             ` Alexey Kardashevskiy
  2015-07-06  0:52                                               ` Alexey Kardashevskiy
  2015-07-06 11:16                                               ` Paolo Bonzini
  0 siblings, 2 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-04  1:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael Roth, qemu-devel, Alexander Graf, Alex Williamson,
	qemu-ppc, David Gibson

On 05/27/2015 05:05 PM, Paolo Bonzini wrote:
>
>
> On 27/05/2015 01:55, Alexey Kardashevskiy wrote:
>> One step back :) Whole dance is what here? There are:
>> 1) del+set_size(0)
>> 2) set_size(not zero)+add
>
> Then no need for begin/commit. :)

I got a new problem here - set_size(0) + set_size(not 0) do not invoke 
region_del/region_add which does not seem right. As the result,
vfio_listener_region_del() never gets called and the IOMMU MR notifier 
stays in the container->giommu_list. What is the correct solution to this?

Incorrect (most likely :) ) solution is in
[PATCH qemu v9 13/13] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows 
(DDW)
where I explicitely do  memory_region_del_subregion() + 
memory_region_add_subregion() which seems to do the right job.

-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-07-04  1:12                                             ` Alexey Kardashevskiy
@ 2015-07-06  0:52                                               ` Alexey Kardashevskiy
  2015-07-06 11:16                                               ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-06  0:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael Roth, qemu-devel, Alexander Graf, Alex Williamson,
	qemu-ppc, David Gibson

On 07/04/2015 11:12 AM, Alexey Kardashevskiy wrote:
> On 05/27/2015 05:05 PM, Paolo Bonzini wrote:
>>
>>
>> On 27/05/2015 01:55, Alexey Kardashevskiy wrote:
>>> One step back :) Whole dance is what here? There are:
>>> 1) del+set_size(0)
>>> 2) set_size(not zero)+add
>>
>> Then no need for begin/commit. :)
>
> I got a new problem here - set_size(0) + set_size(not 0) do not invoke
> region_del/region_add which does not seem right. As the result,
> vfio_listener_region_del() never gets called and the IOMMU MR notifier
> stays in the container->giommu_list. What is the correct solution to this?
>
> Incorrect (most likely :) ) solution is in
> [PATCH qemu v9 13/13] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows
> (DDW)
> where I explicitely do  memory_region_del_subregion() +
> memory_region_add_subregion() which seems to do the right job.
>

Never mind, "[PATCH qemu] vfio: Unregister IOMMU notifiers when container 
is destroyed" seems to solve the problem. Thanks for listening :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table
  2015-07-04  1:12                                             ` Alexey Kardashevskiy
  2015-07-06  0:52                                               ` Alexey Kardashevskiy
@ 2015-07-06 11:16                                               ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-07-06 11:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Roth, qemu-devel, Alexander Graf, Alex Williamson,
	qemu-ppc, David Gibson



On 04/07/2015 03:12, Alexey Kardashevskiy wrote:
>>
>>> One step back :) Whole dance is what here? There are:
>>> 1) del+set_size(0)
>>> 2) set_size(not zero)+add
>>
>> Then no need for begin/commit. :)
> 
> I got a new problem here - set_size(0) + set_size(not 0) do not invoke
> region_del/region_add which does not seem right. As the result,
> vfio_listener_region_del() never gets called and the IOMMU MR notifier
> stays in the container->giommu_list. What is the correct solution to this?
> 
> Incorrect (most likely :) ) solution is in
> [PATCH qemu v9 13/13] spapr_pci/spapr_pci_vfio: Support Dynamic DMA
> Windows (DDW)
> where I explicitely do  memory_region_del_subregion() +
> memory_region_add_subregion() which seems to do the right job.

I would like a comment there explaining the problem (i.e. what you would
have done, what you expected and what were the actual results: it's not
even that clear from this email), but it doesn't seem like an awful
thing to do.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2015-07-06 11:16 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-25 12:24 [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 01/14] spapr_pci: Finish making find_phb()/find_dev() public Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 03/14] vfio: spapr: Move SPAPR-related code to a separate file Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 05/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 06/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2015-05-05 12:28   ` David Gibson
2015-05-25 15:05   ` Alexey Kardashevskiy
2015-05-26  2:46     ` David Gibson
2015-05-26  8:58       ` Paolo Bonzini
2015-05-26  9:01         ` Alexander Graf
2015-05-26  9:16           ` Paolo Bonzini
2015-05-26 10:15         ` Alexey Kardashevskiy
2015-05-26 10:16           ` Paolo Bonzini
2015-05-26 12:33             ` Alexey Kardashevskiy
2015-05-26 12:50               ` Paolo Bonzini
2015-05-26 13:28                 ` Alexey Kardashevskiy
2015-05-26 13:31                   ` Paolo Bonzini
2015-05-26 13:42                     ` Alexey Kardashevskiy
2015-05-26 13:48                       ` Paolo Bonzini
2015-05-26 14:00                         ` Alexey Kardashevskiy
2015-05-26 14:03                           ` Paolo Bonzini
2015-05-26 14:17                             ` Alexey Kardashevskiy
2015-05-26 14:24                               ` Paolo Bonzini
2015-05-26 14:55                                 ` Michael Roth
2015-05-26 14:58                                   ` Paolo Bonzini
2015-05-26 15:49                                     ` Alexey Kardashevskiy
2015-05-26 15:51                                       ` Paolo Bonzini
2015-05-26 23:55                                         ` Alexey Kardashevskiy
2015-05-27  7:05                                           ` Paolo Bonzini
2015-07-04  1:12                                             ` Alexey Kardashevskiy
2015-07-06  0:52                                               ` Alexey Kardashevskiy
2015-07-06 11:16                                               ` Paolo Bonzini
2015-05-26 15:00                                 ` Alexey Kardashevskiy
2015-05-26 15:08                                   ` Paolo Bonzini
2015-05-26 15:49                                     ` Alexey Kardashevskiy
2015-05-26 14:36                               ` Michael Roth
2015-05-27  2:54         ` David Gibson
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
2015-05-05 12:31   ` David Gibson
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
2015-05-05 12:34   ` David Gibson
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 10/14] linux headers update for DDW on SPAPR Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 11/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 12/14] spapr: Add pseries-2.4 machine Alexey Kardashevskiy
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 13/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2015-05-05 12:49   ` David Gibson
2015-06-18 11:35     ` Alexey Kardashevskiy
2015-06-19  1:45       ` David Gibson
2015-06-19  6:49         ` Markus Armbruster
2015-06-22  2:00           ` David Gibson
2015-04-25 12:24 ` [Qemu-devel] [PATCH qemu v7 14/14] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
2015-05-05 12:50   ` David Gibson
2015-05-05  9:30 ` [Qemu-devel] [PATCH qemu v7 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.