All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW)
@ 2016-05-04  6:52 Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release Alexey Kardashevskiy
                   ` (19 more replies)
  0 siblings, 20 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows on pseries.

This patchset is based on the latest upstream.

This includes "vmstate: Define VARRAY with VMS_ALLOC" as it has been accepted
but has not been merged to upstream yet.

Please comment. Thanks!


Paolo, I did cc: you on this because of 02/19 and 03/19, would be great to
get an opinion as the rest of the series relies on it to do
vfio-pci hot _un_plug properly. Thanks!


Alexey Kardashevskiy (19):
  vfio: Delay DMA address space listener release
  memory: Call region_del() callbacks on memory listener unregistering
  memory: Fix IOMMU replay base address
  vmstate: Define VARRAY with VMS_ALLOC
  vfio: Check that IOMMU MR translates to system address space
  spapr_pci: Use correct DMA LIOBN when composing the device tree
  spapr_iommu: Move table allocation to helpers
  spapr_iommu: Introduce "enabled" state for TCE table
  spapr_iommu: Finish renaming vfio_accel to need_vfio
  spapr_iommu: Migrate full state
  spapr_iommu: Add root memory region
  spapr_pci: Reset DMA config on PHB reset
  memory: Add reporting of supported page sizes
  vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  spapr_pci: Add and export DMA resetting helper
  vfio: Add host side DMA window capabilities
  spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being
    used by VFIO
  vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

 hw/ppc/Makefile.objs          |   1 +
 hw/ppc/spapr.c                |   5 +
 hw/ppc/spapr_iommu.c          | 228 ++++++++++++++++++++++++------
 hw/ppc/spapr_pci.c            |  96 +++++++++----
 hw/ppc/spapr_rtas_ddw.c       | 292 ++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_vio.c            |   8 +-
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              | 319 +++++++++++++++++++++++++++++++++++-------
 hw/vfio/prereg.c              | 137 ++++++++++++++++++
 include/exec/memory.h         |  22 ++-
 include/hw/pci-host/spapr.h   |  10 +-
 include/hw/ppc/spapr.h        |  31 +++-
 include/hw/vfio/vfio-common.h |  21 ++-
 include/migration/vmstate.h   |  10 ++
 memory.c                      |  64 ++++++++-
 target-ppc/kvm_ppc.h          |   2 +-
 trace-events                  |  12 +-
 17 files changed, 1120 insertions(+), 139 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c
 create mode 100644 hw/vfio/prereg.c

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-05 22:39   ` Alex Williamson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 02/19] memory: Call region_del() callbacks on memory listener unregistering Alexey Kardashevskiy
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

This postpones VFIO container deinitialization to let region_del()
callbacks (called via vfio_listener_release) do proper clean up
while the group is still attached to the container.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fe5ec6a..0b40262 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
 
-    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
-                     group->groupid);
-    }
-
     QLIST_REMOVE(group, container_next);
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        VFIOGuestIOMMU *giommu;
+
+        vfio_listener_release(container);
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            memory_region_unregister_iommu_notifier(&giommu->n);
+        }
+    }
+
     group->container = NULL;
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container",
+                     group->groupid);
+    }
 
     if (QLIST_EMPTY(&container->group_list)) {
         VFIOAddressSpace *space = container->space;
         VFIOGuestIOMMU *giommu, *tmp;
 
-        vfio_listener_release(container);
         QLIST_REMOVE(container, next);
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(&giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
             g_free(giommu);
         }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 02/19] memory: Call region_del() callbacks on memory listener unregistering
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-05 22:45   ` Alex Williamson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 03/19] memory: Fix IOMMU replay base address Alexey Kardashevskiy
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

When a new memory listener is registered, listener_add_address_space()
is called and which in turn calls region_add() callbacks of memory regions.
However when unregistering the memory listener, it is just removed from
the listening chain and no region_del() is called.

This adds listener_del_address_space() and uses it in
memory_listener_unregister(). listener_add_address_space() was used as
a template with the following changes:
s/log_global_start/log_global_stop/
s/log_start/log_stop/
s/region_add/region_del/

This will allow the following patches to add/remove DMA windows
dynamically from VFIO's PCI address space's region_add()/region_del().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 memory.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/memory.c b/memory.c
index f76f85d..f762a34 100644
--- a/memory.c
+++ b/memory.c
@@ -2185,6 +2185,49 @@ static void listener_add_address_space(MemoryListener *listener,
     flatview_unref(view);
 }
 
+static void listener_del_address_space(MemoryListener *listener,
+                                       AddressSpace *as)
+{
+    FlatView *view;
+    FlatRange *fr;
+
+    if (listener->address_space_filter
+        && listener->address_space_filter != as) {
+        return;
+    }
+
+    if (listener->begin) {
+        listener->begin(listener);
+    }
+    if (global_dirty_log) {
+        if (listener->log_global_stop) {
+            listener->log_global_stop(listener);
+        }
+    }
+
+    view = address_space_get_flatview(as);
+    FOR_EACH_FLAT_RANGE(fr, view) {
+        MemoryRegionSection section = {
+            .mr = fr->mr,
+            .address_space = as,
+            .offset_within_region = fr->offset_in_region,
+            .size = fr->addr.size,
+            .offset_within_address_space = int128_get64(fr->addr.start),
+            .readonly = fr->readonly,
+        };
+        if (fr->dirty_log_mask && listener->log_stop) {
+            listener->log_stop(listener, &section, 0, fr->dirty_log_mask);
+        }
+        if (listener->region_del) {
+            listener->region_del(listener, &section);
+        }
+    }
+    if (listener->commit) {
+        listener->commit(listener);
+    }
+    flatview_unref(view);
+}
+
 void memory_listener_register(MemoryListener *listener, AddressSpace *filter)
 {
     MemoryListener *other = NULL;
@@ -2211,6 +2254,11 @@ void memory_listener_register(MemoryListener *listener, AddressSpace *filter)
 
 void memory_listener_unregister(MemoryListener *listener)
 {
+    AddressSpace *as;
+
+    QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
+        listener_del_address_space(listener, as);
+    }
     QTAILQ_REMOVE(&memory_listeners, listener, link);
 }
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 03/19] memory: Fix IOMMU replay base address
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 02/19] memory: Call region_del() callbacks on memory listener unregistering Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-26  1:50   ` David Gibson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
when new VFIO listener is added, all existing IOMMU mappings are
replayed. However there is a problem that the base address of
an IOMMU memory region (IOMMU MR) is ignored which is not a problem
for the existing user (which is pseries) with its default 32bit DMA
window starting at 0 but it is if there is another DMA window.

This stores the IOMMU's offset_within_address_space and adjusts
the IOVA before calling vfio_dma_map/vfio_dma_unmap.

As the IOMMU notifier expects IOVA offset rather than the absolute
address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
calling notifier(s).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v15:
* accounted section->offset_within_region
* s/giommu->offset_within_address_space/giommu->iommu_offset/
---
 hw/ppc/spapr_iommu.c          |  2 +-
 hw/vfio/common.c              | 14 ++++++++------
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 7dd4588..277f289 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
     tcet->table[index] = tce;
 
     entry.target_as = &address_space_memory,
-    entry.iova = ioba & page_mask;
+    entry.iova = (ioba - tcet->bus_offset) & page_mask;
     entry.translated_addr = tce & page_mask;
     entry.addr_mask = ~page_mask;
     entry.perm = spapr_tce_iommu_access_flags(tce);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b40262..f32cc49 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
     IOMMUTLBEntry *iotlb = data;
+    hwaddr iova = iotlb->iova + giommu->iommu_offset;
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
     void *vaddr;
     int ret;
 
-    trace_vfio_iommu_map_notify(iotlb->iova,
-                                iotlb->iova + iotlb->addr_mask);
+    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
 
     /*
      * The IOMMU TLB entry we have just covers translation through
@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         vaddr = memory_region_get_ram_ptr(mr) + xlat;
-        ret = vfio_dma_map(container, iotlb->iova,
+        ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
                            !(iotlb->perm & IOMMU_WO) || mr->readonly);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, ret);
         }
     }
@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
          */
         giommu = g_malloc0(sizeof(*giommu));
         giommu->iommu = section->mr;
+        giommu->iommu_offset = section->offset_within_address_space -
+            section->offset_within_region;
         giommu->container = container;
         giommu->n.notify = vfio_iommu_map_notify;
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eb0e1b0..c9b6622 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -90,6 +90,7 @@ typedef struct VFIOContainer {
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
     MemoryRegion *iommu;
+    hwaddr iommu_offset;
     Notifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 03/19] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-27  7:54   ` Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 05/19] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

This allows dynamic allocation for migrating arrays.

Already existing VMSTATE_VARRAY_UINT32 requires an array to be
pre-allocated, however there are cases when the size is not known in
advance and there is no real need to enforce it.

This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
flag which tells the receiving side to allocate memory for the array
before receiving the data.

The first user of it is a dynamic DMA window which existence and size
are totally dynamic.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 include/migration/vmstate.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 84ee355..1622638 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
     .offset     = vmstate_offset_pointer(_state, _field, _type),     \
 }
 
+#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+    .info       = &(_info),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
+    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
     .name       = (stringify(_field)),                               \
     .version_id = (_version),                                        \
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 05/19] vfio: Check that IOMMU MR translates to system address space
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-26  1:51   ` David Gibson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 06/19] spapr_pci: Use correct DMA LIOBN when composing the device tree Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

At the moment IOMMU MR only translate to the system memory.
However if some new code changes this, we will need clear indication why
it is not working so here is the check.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v15:
* added some spaces

v14:
* new to the series
---
 hw/vfio/common.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f32cc49..6d23d0f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 
     trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
 
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+        return;
+    }
+
     /*
      * The IOMMU TLB entry we have just covers translation through
      * this IOMMU to its immediate target.  We need to translate
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 06/19] spapr_pci: Use correct DMA LIOBN when composing the device tree
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 05/19] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-26  3:17   ` David Gibson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 07/19] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

The user could have picked LIOBN via the CLI but the device tree
rendering code would still use the value derived from the PHB index
(which is the default fallback if LIOBN is not set in the CLI).

This replaces SPAPR_PCI_LIOBN() with the actual DMA LIOBN value.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v16:
* new in the series
---
 hw/ppc/spapr_pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 573e635..742d127 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1815,7 +1815,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
-    tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(phb->index, 0));
+    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
     if (!tcet) {
         return -1;
     }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 07/19] spapr_iommu: Move table allocation to helpers
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 06/19] spapr_pci: Use correct DMA LIOBN when composing the device tree Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-26  3:32   ` David Gibson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 08/19] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

At the moment presence of vfio-pci devices on a bus affect the way
the guest view table is allocated. If there is no vfio-pci on a PHB
and the host kernel supports KVM acceleration of H_PUT_TCE, a table
is allocated in KVM. However, if there is vfio-pci and we do yet not
KVM acceleration for these, the table has to be allocated by
the userspace. At the moment the table is allocated once at boot time
but next patches will reallocate it.

This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
to helpers.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
 trace-events         |  2 +-
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 277f289..8132f64 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -75,6 +75,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
     }
 }
 
+static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
+                                       uint32_t page_shift,
+                                       uint32_t nb_table,
+                                       int *fd,
+                                       bool need_vfio)
+{
+    uint64_t *table = NULL;
+    uint64_t window_size = (uint64_t)nb_table << page_shift;
+
+    if (kvm_enabled() && !(window_size >> 32)) {
+        table = kvmppc_create_spapr_tce(liobn, window_size, fd, need_vfio);
+    }
+
+    if (!table) {
+        *fd = -1;
+        table = g_malloc0(nb_table * sizeof(uint64_t));
+    }
+
+    trace_spapr_iommu_new_table(liobn, table, *fd);
+
+    return table;
+}
+
+static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
+{
+    if (!kvm_enabled() ||
+        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
+        g_free(table);
+    }
+}
+
 /* Called from RCU critical section */
 static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
                                                bool is_write)
@@ -141,21 +172,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
-    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
 
-    if (kvm_enabled() && !(window_size >> 32)) {
-        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
-                                              window_size,
-                                              &tcet->fd,
-                                              tcet->need_vfio);
-    }
-
-    if (!tcet->table) {
-        size_t table_size = tcet->nb_table * sizeof(uint64_t);
-        tcet->table = g_malloc0(table_size);
-    }
-
-    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
+    tcet->fd = -1;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
 
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
                              "iommu-spapr",
@@ -241,11 +264,8 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
     QLIST_REMOVE(tcet, list);
 
-    if (!kvm_enabled() ||
-        (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
-                                 tcet->nb_table) != 0)) {
-        g_free(tcet->table);
-    }
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/trace-events b/trace-events
index 8350743..d96d344 100644
--- a/trace-events
+++ b/trace-events
@@ -1431,7 +1431,7 @@ spapr_iommu_pci_get(uint64_t liobn, uint64_t ioba, uint64_t ret, uint64_t tce) "
 spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN, uint64_t tceN, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcelist=0x%"PRIx64" iobaN=0x%"PRIx64" tceN=0x%"PRIx64" ret=%"PRId64
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
-spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
+spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 08/19] spapr_iommu: Introduce "enabled" state for TCE table
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 07/19] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-26  3:39   ` David Gibson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 09/19] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

Currently TCE tables are created once at start and their sizes never
change. We are going to change that by introducing a Dynamic DMA windows
support where DMA configuration may change during the guest execution.

This changes spapr_tce_new_table() to create an empty zero-size IOMMU
memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
It still will be called once at the owner object (VIO or PHB) creation.

This introduces an "enabled" state for TCE table objects with two
helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
- spapr_tce_table_enable() receives TCE table parameters, allocates
a guest view of the TCE table (in the user space or KVM) and
sets the correct size on the IOMMU MR.
- spapr_tce_table_disable() disposes the table and resets the IOMMU MR
size.

This changes the PHB reset handler to do the default DMA initialization
instead of spapr_phb_realize(). This does not make differenct now but
later with more than just one DMA window, we will have to remove them all
and create the default one on a system reset.

No visible change in behaviour is expected except the actual table
will be reallocated every reset. We might optimize this later.

The other way to implement this would be dynamically create/remove
the TCE table QOM objects but this would make migration impossible
as the migration code expects all QOM objects to exist at the receiver
so we have to have TCE table objects created when migration begins.

spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
as later it will be called at the sPAPRTCETable post-migration stage when
it already has all the properties set after the migration; the same is
done for spapr_tce_table_disable().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v15:
* made adjustments after removing spapr_phb_dma_window_enable()

v14:
* added spapr_tce_table_do_disable(), will make difference in following
patch with fully dynamic table migration

# Conflicts:
#	hw/ppc/spapr_pci.c
---
 hw/ppc/spapr_iommu.c   | 86 ++++++++++++++++++++++++++++++++++++--------------
 hw/ppc/spapr_pci.c     |  8 +++--
 hw/ppc/spapr_vio.c     |  8 ++---
 include/hw/ppc/spapr.h | 10 +++---
 4 files changed, 75 insertions(+), 37 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 8132f64..9bcd3f6 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "hw/hw.h"
 #include "sysemu/kvm.h"
 #include "hw/qdev.h"
@@ -174,15 +175,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     tcet->fd = -1;
-    tcet->table = spapr_tce_alloc_table(tcet->liobn,
-                                        tcet->page_shift,
-                                        tcet->nb_table,
-                                        &tcet->fd,
-                                        tcet->need_vfio);
-
+    tcet->need_vfio = false;
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr",
-                             (uint64_t)tcet->nb_table << tcet->page_shift);
+                             "iommu-spapr", 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -224,14 +219,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
     tcet->table = newtable;
 }
 
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio)
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
 {
     sPAPRTCETable *tcet;
-    char tmp[64];
+    char tmp[32];
 
     if (spapr_tce_find_by_liobn(liobn)) {
         fprintf(stderr, "Attempted to create TCE table with duplicate"
@@ -239,16 +230,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
         return NULL;
     }
 
-    if (!nb_table) {
-        return NULL;
-    }
-
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
     tcet->liobn = liobn;
-    tcet->bus_offset = bus_offset;
-    tcet->page_shift = page_shift;
-    tcet->nb_table = nb_table;
-    tcet->need_vfio = need_vfio;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
@@ -258,14 +241,69 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
     return tcet;
 }
 
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+{
+    if (!tcet->nb_table) {
+        return;
+    }
+
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
+
+    memory_region_set_size(&tcet->iommu,
+                           (uint64_t)tcet->nb_table << tcet->page_shift);
+
+    tcet->enabled = true;
+}
+
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table)
+{
+    if (tcet->enabled) {
+        error_report("Warning: trying to enable already enabled TCE table");
+        return;
+    }
+
+    tcet->bus_offset = bus_offset;
+    tcet->page_shift = page_shift;
+    tcet->nb_table = nb_table;
+
+    spapr_tce_table_do_enable(tcet);
+}
+
+static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
+{
+    memory_region_set_size(&tcet->iommu, 0);
+
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
+    tcet->table = NULL;
+    tcet->enabled = false;
+    tcet->bus_offset = 0;
+    tcet->page_shift = 0;
+    tcet->nb_table = 0;
+}
+
+static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+{
+    if (!tcet->enabled) {
+        error_report("Warning: trying to disable already disabled TCE table");
+        return;
+    }
+    spapr_tce_table_do_disable(tcet);
+}
+
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     QLIST_REMOVE(tcet, list);
 
-    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
-    tcet->fd = -1;
+    spapr_tce_table_disable(tcet);
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 742d127..beeac06 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1464,8 +1464,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
+    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
                    sphb->dtbusname);
@@ -1473,7 +1472,10 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
+    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
+                           nb_table);
+
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index 8aa021f..a7d49a0 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -482,11 +482,9 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
         memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
         address_space_init(&dev->as, &dev->mrroot, qdev->id);
 
-        dev->tcet = spapr_tce_new_table(qdev, liobn,
-                                        0,
-                                        SPAPR_TCE_PAGE_SHIFT,
-                                        pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT, false);
+        dev->tcet = spapr_tce_new_table(qdev, liobn);
+        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
+                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT);
         dev->tcet->vdev = dev;
         memory_region_add_subregion_overlap(&dev->mrroot, 0,
                                             spapr_tce_get_iommu(dev->tcet), 2);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 815d5ee..0140810 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -534,6 +534,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 
 struct sPAPRTCETable {
     DeviceState parent;
+    bool enabled;
     uint32_t liobn;
     uint32_t nb_table;
     uint64_t bus_offset;
@@ -561,11 +562,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(sPAPRMachineState *sm,
                                  target_ulong addr, target_ulong size,
                                  bool cpu_update, bool memory_update);
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio);
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 09/19] spapr_iommu: Finish renaming vfio_accel to need_vfio
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 08/19] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-26  3:18   ` David Gibson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 10/19] spapr_iommu: Migrate full state Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

6a81dd17 "spapr_iommu: Rename vfio_accel parameter" renamed vfio_accel
flag everywhere but one spot was missed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 target-ppc/kvm_ppc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
index fc79312..3b2090e 100644
--- a/target-ppc/kvm_ppc.h
+++ b/target-ppc/kvm_ppc.h
@@ -163,7 +163,7 @@ static inline bool kvmppc_spapr_use_multitce(void)
 
 static inline void *kvmppc_create_spapr_tce(uint32_t liobn,
                                             uint32_t window_size, int *fd,
-                                            bool vfio_accel)
+                                            bool need_vfio)
 {
     return NULL;
 }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 10/19] spapr_iommu: Migrate full state
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 09/19] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-26  4:01   ` David Gibson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 11/19] spapr_iommu: Add root memory region Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

The source guest could have reallocated the default TCE table and
migrate bigger/smaller table. This adds reallocation in post_load()
if the default table size is different on source and destination.

This adds @bus_offset, @page_shift, @enabled to the migration stream.
These cannot change without dynamic DMA windows so no change in
behavior is expected now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v15:
* squashed "migrate full state" into this
* added missing tcet->mig_nb_table initialization in spapr_tce_table_pre_save()
* instead of bumping the version, moved extra parameters to subsection

v14:
* new to the series
---
 hw/ppc/spapr_iommu.c   | 67 ++++++++++++++++++++++++++++++++++++++++++++++++--
 include/hw/ppc/spapr.h |  2 ++
 trace-events           |  2 ++
 3 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 9bcd3f6..52b1e0d 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -137,33 +137,96 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->mig_table = tcet->table;
+    tcet->mig_nb_table = tcet->nb_table;
+
+    trace_spapr_iommu_pre_save(tcet->liobn, tcet->mig_nb_table,
+                               tcet->bus_offset, tcet->page_shift);
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
+static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+    uint32_t old_nb_table = tcet->nb_table;
 
     if (tcet->vdev) {
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (tcet->enabled) {
+        if (tcet->nb_table != tcet->mig_nb_table) {
+            if (tcet->nb_table) {
+                spapr_tce_table_do_disable(tcet);
+            }
+            tcet->nb_table = tcet->mig_nb_table;
+            spapr_tce_table_do_enable(tcet);
+        }
+
+        memcpy(tcet->table, tcet->mig_table,
+               tcet->nb_table * sizeof(tcet->table[0]));
+
+        free(tcet->mig_table);
+        tcet->mig_table = NULL;
+    } else if (tcet->table) {
+        /* Destination guest has a default table but source does not -> free */
+        spapr_tce_table_do_disable(tcet);
+    }
+
+    trace_spapr_iommu_post_load(tcet->liobn, old_nb_table, tcet->nb_table,
+                                tcet->bus_offset, tcet->page_shift);
+
     return 0;
 }
 
+static bool spapr_tce_table_ex_needed(void *opaque)
+{
+    sPAPRTCETable *tcet = opaque;
+
+    return tcet->bus_offset || tcet->page_shift != 0xC;
+}
+
+static const VMStateDescription vmstate_spapr_tce_table_ex = {
+    .name = "spapr_iommu_ex",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_tce_table_ex_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(enabled, sPAPRTCETable),
+        VMSTATE_UINT64(bus_offset, sPAPRTCETable),
+        VMSTATE_UINT32(page_shift, sPAPRTCETable),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
     .version_id = 2,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, mig_nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
+    .subsections = (const VMStateDescription*[]) {
+        &vmstate_spapr_tce_table_ex,
+        NULL
+    }
 };
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 0140810..d36dda2 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -540,6 +540,8 @@ struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint32_t mig_nb_table;
+    uint64_t *mig_table;
     bool bypass;
     bool need_vfio;
     int fd;
diff --git a/trace-events b/trace-events
index d96d344..dd50005 100644
--- a/trace-events
+++ b/trace-events
@@ -1432,6 +1432,8 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
+spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 11/19] spapr_iommu: Add root memory region
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 10/19] spapr_iommu: Migrate full state Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 12/19] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

We are going to have multiple DMA windows at different offsets on
a PCI bus. For the sake of migration, we will have as many TCE table
objects pre-created as many windows supported.
So we need a way to map windows dynamically onto a PCI bus
when migration of a table is completed but at this stage a TCE table
object does not have access to a PHB to ask it to map a DMA window
backed by just migrated TCE table.

This adds a "root" memory region (UINT64_MAX long) to the TCE object.
This new region is mapped on a PCI bus with enabled overlapping as
there will be one root MR per TCE table, each of them mapped at 0.
The actual IOMMU memory region is a subregion of the root region and
a TCE table enables/disables this subregion and maps it at
the specific offset inside the root MR which is 1:1 mapping of
a PCI address space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 hw/ppc/spapr_iommu.c   | 13 ++++++++++---
 hw/ppc/spapr_pci.c     |  6 +++---
 include/hw/ppc/spapr.h |  2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 52b1e0d..740836f 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -236,11 +236,16 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+    Object *tcetobj = OBJECT(tcet);
+    char tmp[32];
 
     tcet->fd = -1;
     tcet->need_vfio = false;
-    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr", 0);
+    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
+    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
+
+    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
+    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -318,6 +323,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
+    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
 
     tcet->enabled = true;
 }
@@ -340,6 +346,7 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
 
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
 {
+    memory_region_del_subregion(&tcet->root, &tcet->iommu);
     memory_region_set_size(&tcet->iommu, 0);
 
     spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
@@ -371,7 +378,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
 {
-    return &tcet->iommu;
+    return &tcet->root;
 }
 
 static void spapr_tce_reset(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index beeac06..e1b196d 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1471,13 +1471,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                        spapr_tce_get_iommu(tcet), 0);
+
     /* Register default 32bit DMA window */
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            nb_table);
 
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
-                                spapr_tce_get_iommu(tcet));
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d36dda2..2026c69 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -545,7 +545,7 @@ struct sPAPRTCETable {
     bool bypass;
     bool need_vfio;
     int fd;
-    MemoryRegion iommu;
+    MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
     QLIST_ENTRY(sPAPRTCETable) list;
 };
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 12/19] spapr_pci: Reset DMA config on PHB reset
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 11/19] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 13/19] memory: Add reporting of supported page sizes Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

LoPAPR dictates that during system reset all DMA windows must be removed
and the default DMA32 window must be created so does the patch.

At the moment there is just one window supported so no change in
behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c   |  2 +-
 hw/ppc/spapr_pci.c     | 17 +++++++++++------
 include/hw/ppc/spapr.h |  1 +
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 740836f..5ce2f5e 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -358,7 +358,7 @@ static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
     tcet->nb_table = 0;
 }
 
-static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+void spapr_tce_table_disable(sPAPRTCETable *tcet)
 {
     if (!tcet->enabled) {
         error_report("Warning: trying to disable already disabled TCE table");
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index e1b196d..aa9201b 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1311,7 +1311,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
-    uint32_t nb_table;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1463,7 +1462,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
     tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
@@ -1474,10 +1472,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
                                         spapr_tce_get_iommu(tcet), 0);
 
-    /* Register default 32bit DMA window */
-    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
-                           nb_table);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1494,6 +1488,17 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 static void spapr_phb_reset(DeviceState *qdev)
 {
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+
+    if (tcet && tcet->enabled) {
+        spapr_tce_table_disable(tcet);
+    }
+
+    /* Register default 32bit DMA window */
+    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
+                           sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
+
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 2026c69..f0cfd58 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -568,6 +568,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
 void spapr_tce_table_enable(sPAPRTCETable *tcet,
                             uint32_t page_shift, uint64_t bus_offset,
                             uint32_t nb_table);
+void spapr_tce_table_disable(sPAPRTCETable *tcet);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 13/19] memory: Add reporting of supported page sizes
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 12/19] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the actual
page size(s) used by an IOMMU.

As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.

This removes vfio_container_granularity() and uses new helper in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v16:
* used memory_region_iommu_get_page_sizes() instead of
mr->iommu_ops->get_page_sizes() in memory_region_iommu_replay()

v15:
* s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes

v14:
* removed vfio_container_granularity(), changed memory_region_iommu_replay()

v4:
* s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
---
 hw/ppc/spapr_iommu.c  |  8 ++++++++
 hw/vfio/common.c      |  6 ------
 include/exec/memory.h | 18 ++++++++++++++----
 memory.c              | 16 +++++++++++++---
 4 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 5ce2f5e..c945dba 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -148,6 +148,13 @@ static void spapr_tce_table_pre_save(void *opaque)
                                tcet->bus_offset, tcet->page_shift);
 }
 
+static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return 1ULL << tcet->page_shift;
+}
+
 static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
 
@@ -231,6 +238,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
+    .get_page_sizes = spapr_tce_get_page_sizes,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6d23d0f..2050040 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -319,11 +319,6 @@ out:
     rcu_read_unlock();
 }
 
-static hwaddr vfio_container_granularity(VFIOContainer *container)
-{
-    return (hwaddr)1 << ctz64(container->iova_pgsizes);
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -391,7 +386,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
-                                   vfio_container_granularity(container),
                                    false);
 
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e2a3e99..a3a1703 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -149,6 +149,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /* Returns supported page sizes */
+    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -572,6 +574,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
 
 
 /**
+ * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
+ *
+ * Returns %bitmap of supported page sizes for an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
+
+/**
  * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
  *
  * @mr: the memory region that was changed
@@ -595,16 +606,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
 
 /**
  * memory_region_iommu_replay: replay existing IOMMU translations to
- * a notifier
+ * a notifier with the minimum page granularity returned by
+ * mr->iommu_ops->get_page_sizes().
  *
  * @mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
- * @granularity: Minimum page granularity to replay notifications for
  * @is_write: Whether to treat the replay as a translate "write"
  *     through the iommu
  */
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write);
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
 
 /**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
diff --git a/memory.c b/memory.c
index f762a34..e673c62 100644
--- a/memory.c
+++ b/memory.c
@@ -1513,12 +1513,22 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
     notifier_list_add(&mr->iommu_notify, n);
 }
 
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write)
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
 {
-    hwaddr addr;
+    assert(memory_region_is_iommu(mr));
+    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
+        return mr->iommu_ops->get_page_sizes(mr);
+    }
+    return TARGET_PAGE_SIZE;
+}
+
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
+{
+    hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_page_sizes(mr));
+
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
         iotlb = mr->iommu_ops->translate(mr, addr, is_write);
         if (iotlb.perm != IOMMU_NONE) {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 13/19] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-13 22:25   ` Alex Williamson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 15/19] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

As there is no per-IOMMU-type release() callback anymore, this stores
the IOMMU type in the container so vfio_listener_release() can determine
if it needs to unregister @prereg_listener.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This enforces guest RAM blocks to be host page size aligned; however
this is not new as KVM already requires memory slots to be host page
size aligned.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v16:
* switched to 64bit math everywhere as there is no chance to see
region_add on RAM blocks even remotely close to 1<<64bytes.

v15:
* banned unaligned sections
* added an vfio_prereg_gpa_to_ua() helper

v14:
* s/free_container_exit/listener_release_exit/g
* added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
---
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              |  38 +++++++++---
 hw/vfio/prereg.c              | 137 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   4 ++
 trace-events                  |   2 +
 5 files changed, 172 insertions(+), 10 deletions(-)
 create mode 100644 hw/vfio/prereg.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index ceddbb8..5800e0e 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
+obj-$(CONFIG_SOFTMMU) += prereg.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2050040..496eb82 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -501,6 +501,9 @@ static const MemoryListener vfio_memory_listener = {
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
 }
 
 int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
@@ -808,8 +811,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -834,8 +837,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
             container->iova_pgsizes = info.iova_pgsizes;
         }
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
 
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
@@ -843,7 +848,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -855,11 +862,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto listener_release_exit;
+            }
         }
 
         /*
@@ -872,7 +890,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if (ret) {
             error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
             ret = -errno;
-            goto free_container_exit;
+            goto listener_release_exit;
         }
         container->min_iova = info.dma32_window_start;
         container->max_iova = container->min_iova + info.dma32_window_size - 1;
diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
new file mode 100644
index 0000000..d0e4728
--- /dev/null
+++ b/hw/vfio/prereg.c
@@ -0,0 +1,137 @@
+/*
+ * DMA memory preregistration
+ *
+ * Authors:
+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/hw.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
+{
+    if (memory_region_is_iommu(section->mr)) {
+        error_report("Cannot possibly preregister IOMMU memory");
+        return true;
+    }
+
+    return !memory_region_is_ram(section->mr) ||
+            memory_region_is_skip_dump(section->mr);
+}
+
+static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)
+{
+    return memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (gpa - section->offset_within_address_space);
+}
+
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    hwaddr end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_add_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    end = section->offset_within_address_space + int128_get64(section->size);
+    g_assert(gpa < end);
+
+    memory_region_ref(section->mr);
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
+    if (ret) {
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->initialized) {
+            if (!container->error) {
+                container->error = ret;
+            }
+        } else {
+            hw_error("vfio: Memory registering failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    hwaddr end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_del_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    end = section->offset_within_address_space + int128_get64(section->size);
+    if (gpa >= end) {
+        return;
+    }
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
+    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+}
+
+const MemoryListener vfio_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c9b6622..c72e45a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -73,6 +73,8 @@ typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
+    MemoryListener prereg_listener;
+    unsigned iommu_type;
     int error;
     bool initialized;
     /*
@@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
                          struct vfio_region_info **info);
 #endif
+extern const MemoryListener vfio_prereg_listener;
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index dd50005..d0d8615 100644
--- a/trace-events
+++ b/trace-events
@@ -1737,6 +1737,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 15/19] spapr_pci: Add and export DMA resetting helper
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 16/19] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

This will be later used by the "ibm,reset-pe-dma-window" RTAS handler
which resets the DMA configuration to the defaults.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci.c          | 10 ++++++++--
 include/hw/pci-host/spapr.h |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index aa9201b..5b9ccff 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1486,9 +1486,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
     return 0;
 }
 
-static void spapr_phb_reset(DeviceState *qdev)
+void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
 
     if (tcet && tcet->enabled) {
@@ -1498,6 +1497,13 @@ static void spapr_phb_reset(DeviceState *qdev)
     /* Register default 32bit DMA window */
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
+}
+
+static void spapr_phb_reset(DeviceState *qdev)
+{
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_reset(sphb);
 
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 03ee006..7848366 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -147,4 +147,6 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 }
 #endif
 
+void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 16/19] vfio: Add host side DMA window capabilities
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (14 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 15/19] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-13 22:25   ` Alex Williamson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

There are going to be multiple IOMMUs per a container. This moves
the single host IOMMU parameter set to a list of VFIOHostDMAWindow.

This should cause no behavioral change and will be used later by
the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v16:
* adjusted commit log with changes from v15

v15:
* s/vfio_host_iommu_add/vfio_host_win_add/
* s/VFIOHostIOMMU/VFIOHostDMAWindow/
---
 hw/vfio/common.c              | 65 +++++++++++++++++++++++++++++++++----------
 include/hw/vfio/vfio-common.h |  9 ++++--
 2 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 496eb82..3f2fb23 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "exec/memory.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
+#include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
 
@@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
+static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
+                                               hwaddr min_iova, hwaddr max_iova)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {
+            return hostwin;
+        }
+    }
+
+    return NULL;
+}
+
+static int vfio_host_win_add(VFIOContainer *container,
+                             hwaddr min_iova, hwaddr max_iova,
+                             uint64_t iova_pgsizes)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(min_iova, max_iova - min_iova + 1,
+                           hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1)) {
+            error_report("%s: Overlapped IOMMU are not enabled", __func__);
+            return -1;
+        }
+    }
+
+    hostwin = g_malloc0(sizeof(*hostwin));
+
+    hostwin->min_iova = min_iova;
+    hostwin->max_iova = max_iova;
+    hostwin->iova_pgsizes = iova_pgsizes;
+    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
+
+    return 0;
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -352,7 +392,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(int128_sub(llend, int128_one()));
 
-    if ((iova < container->min_iova) || (end > container->max_iova)) {
+    if (!vfio_host_win_lookup(container, iova, end)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
                      container, iova, end);
@@ -367,10 +407,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         trace_vfio_listener_region_add_iommu(iova, end);
         /*
-         * FIXME: We should do some checking to see if the
-         * capabilities of the host VFIO IOMMU are adequate to model
-         * the guest IOMMU
-         *
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
          * would be the right place to wire that up (tell the KVM
@@ -826,16 +862,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        container->min_iova = 0;
-        container->max_iova = (hwaddr)-1;
-
-        /* Assume just 4K IOVA page size */
-        container->iova_pgsizes = 0x1000;
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
         /* Ignore errors */
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
-            container->iova_pgsizes = info.iova_pgsizes;
+            vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
+        } else {
+            /* Assume just 4K IOVA page size */
+            vfio_host_win_add(container, 0, (hwaddr)-1, 0x1000);
         }
     } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
                ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
@@ -892,11 +926,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto listener_release_exit;
         }
-        container->min_iova = info.dma32_window_start;
-        container->max_iova = container->min_iova + info.dma32_window_size - 1;
 
-        /* Assume just 4K IOVA pages for now */
-        container->iova_pgsizes = 0x1000;
+        /* The default table uses 4K pages */
+        vfio_host_win_add(container, info.dma32_window_start,
+                          info.dma32_window_start +
+                          info.dma32_window_size - 1,
+                          0x1000);
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c72e45a..808263b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -82,9 +82,8 @@ typedef struct VFIOContainer {
      * contiguous IOVA window.  We may need to generalize that in
      * future
      */
-    hwaddr min_iova, max_iova;
-    uint64_t iova_pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
@@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
+typedef struct VFIOHostDMAWindow {
+    hwaddr min_iova, max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
+} VFIOHostDMAWindow;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (15 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 16/19] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-13 22:26   ` Alex Williamson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
a guest view of the table and a hardware TCE table. If there is no VFIO
presense in the address space, then just the guest view is used, if
this is the case, it is allocated in the KVM. However since there is no
support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
we need to move the guest view from KVM to the userspace; and we need
to do this for every IOMMU on a bus with VFIO devices.

This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
notify IOMMU about changing environment so it can reallocate the table
to/from KVM or (when available) hook the IOMMU groups with the logical
bus (LIOBN) in the KVM.

This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
path as the new callbacks do this better - they notify IOMMU at
the exact moment when the configuration is changed, and this also
includes the case of PCI hot unplug.

This postpones vfio_stop() till the end of region_del() as
vfio_dma_unmap() has to execute before VFIO support is disabled.

As there can be multiple containers attached to the same PHB/LIOBN,
this adds a wrapper with a use counter for every IOMMU MR and
stores them in a list in the VFIOAddressSpace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v16:
* added a use counter in VFIOAddressSpace->VFIOIOMMUMR

v15:
* s/need_vfio/vfio-Users/g
---
 hw/ppc/spapr_iommu.c          | 12 ++++++++++++
 hw/ppc/spapr_pci.c            |  6 ------
 hw/vfio/common.c              | 45 ++++++++++++++++++++++++++++++++++++++++++-
 include/exec/memory.h         |  4 ++++
 include/hw/vfio/vfio-common.h |  7 +++++++
 5 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index c945dba..7af2700 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
     return 1ULL << tcet->page_shift;
 }
 
+static void spapr_tce_vfio_start(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
+}
+
+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
+}
+
 static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
 
@@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
     .get_page_sizes = spapr_tce_get_page_sizes,
+    .vfio_start = spapr_tce_vfio_start,
+    .vfio_stop = spapr_tce_vfio_stop,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 5b9ccff..51e7d56 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1086,12 +1086,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     void *fdt = NULL;
     int fdt_start_offset = 0, fdt_size;
 
-    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
-
-        spapr_tce_set_need_vfio(tcet, true);
-    }
-
     if (dev->hotplugged) {
         fdt = create_device_tree(&fdt_size);
         fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3f2fb23..03daf88 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -421,6 +421,26 @@ static void vfio_listener_region_add(MemoryListener *listener,
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+
+        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
+            VFIOIOMMUMR *iommumr;
+            bool found = false;
+
+            QLIST_FOREACH(iommumr, &container->space->iommumrs, iommumr_next) {
+                if (iommumr->iommu == section->mr) {
+                    found = true;
+                    break;
+                }
+            }
+            if (!found) {
+                iommumr = g_malloc0(sizeof(*iommumr));
+                iommumr->iommu = section->mr;
+                section->mr->iommu_ops->vfio_start(section->mr);
+                QLIST_INSERT_HEAD(&container->space->iommumrs, iommumr,
+                                  iommumr_next);
+            }
+            ++iommumr->users;
+        }
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
                                    false);
 
@@ -470,6 +490,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     hwaddr iova, end;
     Int128 llend, llsize;
     int ret;
+    MemoryRegion *iommu = NULL;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -490,13 +511,30 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (giommu->iommu == section->mr) {
+                VFIOIOMMUMR *iommumr;
+
                 memory_region_unregister_iommu_notifier(&giommu->n);
+
+                QLIST_FOREACH(iommumr, &container->space->iommumrs,
+                              iommumr_next) {
+                    if (iommumr->iommu != section->mr) {
+                        continue;
+                    }
+                    --iommumr->users;
+                    if (iommumr->users) {
+                        break;
+                    }
+                    QLIST_REMOVE(iommumr, iommumr_next);
+                    g_free(iommumr);
+                    iommu = giommu->iommu;
+                    break;
+                }
+
                 QLIST_REMOVE(giommu, giommu_next);
                 g_free(giommu);
                 break;
             }
         }
-
         /*
          * FIXME: We assume the one big unmap below is adequate to
          * remove any individual page mappings in the IOMMU which
@@ -527,6 +565,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, int128_get64(llsize), ret);
     }
+
+    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
+        iommu->iommu_ops->vfio_stop(section->mr);
+    }
 }
 
 static const MemoryListener vfio_memory_listener = {
@@ -787,6 +829,7 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     space = g_malloc0(sizeof(*space));
     space->as = as;
     QLIST_INIT(&space->containers);
+    QLIST_INIT(&space->iommumrs);
 
     QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index a3a1703..52d2c70 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
     /* Returns supported page sizes */
     uint64_t (*get_page_sizes)(MemoryRegion *iommu);
+    /* Called when VFIO starts using this */
+    void (*vfio_start)(MemoryRegion *iommu);
+    /* Called when VFIO stops using this */
+    void (*vfio_stop)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 808263b..a9e6e33 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -64,9 +64,16 @@ typedef struct VFIORegion {
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
+    QLIST_HEAD(, VFIOIOMMUMR) iommumrs;
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
 
+typedef struct VFIOIOMMUMR {
+    MemoryRegion *iommu;
+    int users;
+    QLIST_ENTRY(VFIOIOMMUMR) iommumr_next;
+} VFIOIOMMUMR;
+
 struct VFIOGroup;
 
 typedef struct VFIOContainer {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (16 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-13 22:26   ` Alex Williamson
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  2016-05-13  4:54 ` [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses heuristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v16:
* used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
* enforced no intersections between windows

v14:
* new to the series
---
 hw/vfio/common.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 trace-events     |   2 +
 2 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 03daf88..bd2dee8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -240,6 +240,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
+static bool range_contains(hwaddr start, hwaddr end, hwaddr addr)
+{
+    return start <= addr && addr <= end;
+}
+
+static bool vfio_host_win_intersects(VFIOHostDMAWindow *hostwin,
+                                     hwaddr min_iova, hwaddr max_iova)
+{
+    return range_contains(hostwin->min_iova, hostwin->min_iova, min_iova) ||
+        range_contains(min_iova, max_iova, hostwin->min_iova);
+}
+
 static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
                                                hwaddr min_iova, hwaddr max_iova)
 {
@@ -279,6 +291,14 @@ static int vfio_host_win_add(VFIOContainer *container,
     return 0;
 }
 
+static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
+{
+    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
+
+    g_assert(hostwin);
+    QLIST_REMOVE(hostwin, hostwin_next);
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +412,69 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(int128_sub(llend, int128_one()));
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        VFIOHostDMAWindow *hostwin;
+        unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
+        unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
+        unsigned entries, pages;
+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
+
+        trace_vfio_listener_region_add_iommu(iova, end);
+        /*
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        create.window_size = int128_get64(section->size);
+        create.page_shift = ctz64(pagesize);
+        /*
+         * SPAPR host supports multilevel TCE tables, there is some
+         * heuristic to decide how many levels we want for our table:
+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+         */
+        entries = create.window_size >> create.page_shift;
+        pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
+        pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
+        create.levels = ctz64(pages) / 6 + 1;
+
+        /* For now intersections are not allowed, we may relax this later */
+        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+            if (vfio_host_win_intersects(hostwin,
+                    section->offset_within_address_space,
+                    section->offset_within_address_space +
+                    create.window_size - 1)) {
+                goto fail;
+            }
+        }
+
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+        if (ret) {
+            error_report("Failed to create a window, ret = %d (%m)", ret);
+            goto fail;
+        }
+
+        if (create.start_addr != section->offset_within_address_space) {
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = create.start_addr
+            };
+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
+                         section->offset_within_address_space,
+                         create.start_addr);
+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            ret = -EINVAL;
+            goto fail;
+        }
+        trace_vfio_spapr_create_window(create.page_shift,
+                                       create.window_size,
+                                       create.start_addr);
+
+        vfio_host_win_add(container, create.start_addr,
+                          create.start_addr + create.window_size - 1,
+                          1ULL << create.page_shift);
+    }
+
     if (!vfio_host_win_lookup(container, iova, end)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
@@ -566,6 +649,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      container, iova, int128_get64(llsize), ret);
     }
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        struct vfio_iommu_spapr_tce_remove remove = {
+            .argsz = sizeof(remove),
+            .start_addr = section->offset_within_address_space,
+        };
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        if (ret) {
+            error_report("Failed to remove window at %"PRIx64,
+                         remove.start_addr);
+        }
+
+        vfio_host_win_del(container, section->offset_within_address_space);
+
+        trace_vfio_spapr_remove_window(remove.start_addr);
+    }
+
     if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
         iommu->iommu_ops->vfio_stop(section->mr);
     }
@@ -957,11 +1056,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
@@ -970,11 +1064,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto listener_release_exit;
         }
 
-        /* The default table uses 4K pages */
-        vfio_host_win_add(container, info.dma32_window_start,
-                          info.dma32_window_start +
-                          info.dma32_window_size - 1,
-                          0x1000);
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = info.dma32_window_start,
+            };
+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            if (ret) {
+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/trace-events b/trace-events
index d0d8615..b5419de 100644
--- a/trace-events
+++ b/trace-events
@@ -1739,6 +1739,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
 vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (17 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-05-04  6:52 ` Alexey Kardashevskiy
  2016-05-13  8:41   ` Bharata B Rao
  2016-05-13  4:54 ` [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-04  6:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson, Paolo Bonzini

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.5 machine (TODO: update version) and older disable it.
This also creates a single DMA window for the older machines to
maintain backward migration.

This implements DDW for PHB with emulated and VFIO devices. The host
kernel support is required. The advertised IOMMU page sizes are 4K and
64K; 16M pages are supported but not advertised by default, in order to
enable them, the user has to specify "pgsz" property for PHB and
enable huge pages for RAM.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v16:
* s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
* s/SPAPR_PCI_LIOBN()/dma_liobn[]/

v15:
* moved page mask filtering to PHB realize(), use "-mempath" to know
if there are huge pages
* fixed error reporting in RTAS handlers
* max window size accounts now hotpluggable memory boundaries
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   5 +
 hw/ppc/spapr_pci.c          |  75 +++++++++---
 hw/ppc/spapr_rtas_ddw.c     | 292 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |   8 +-
 include/hw/ppc/spapr.h      |  16 ++-
 trace-events                |   4 +
 7 files changed, 381 insertions(+), 20 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c1ffc77..986b36f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index b69995e..0206609 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2365,6 +2365,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
         .driver   = "spapr-vlan", \
         .property = "use-rx-buffer-pools", \
         .value    = "off", \
+    }, \
+    {\
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+        .property = "ddw",\
+        .value    = stringify(off),\
     },
 
 static void spapr_machine_2_5_instance_options(MachineState *machine)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 51e7d56..aa414f2 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -35,6 +35,7 @@
 #include "hw/ppc/spapr.h"
 #include "hw/pci-host/spapr.h"
 #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
 #include <libfdt.h>
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -44,6 +45,7 @@
 #include "hw/pci/pci_bus.h"
 #include "hw/ppc/spapr_drc.h"
 #include "sysemu/device_tree.h"
+#include "sysemu/hostmem.h"
 
 #include "hw/vfio/vfio.h"
 
@@ -1305,11 +1307,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
+    const unsigned windows_supported =
+        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
 
-        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
+        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
+            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
             || (sphb->mem_win_addr != (hwaddr)-1)
             || (sphb->io_win_addr != (hwaddr)-1)) {
             error_setg(errp, "Either \"index\" or other parameters must"
@@ -1324,7 +1329,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
 
         sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+        for (i = 0; i < windows_supported; ++i) {
+            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
+        }
 
         windows_base = SPAPR_PCI_WINDOW_BASE
             + sphb->index * SPAPR_PCI_WINDOW_SPACING;
@@ -1337,8 +1344,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (sphb->dma_liobn == (uint32_t)-1) {
-        error_setg(errp, "LIOBN not specified for PHB");
+    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
+        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
+        error_setg(errp, "LIOBN(s) not specified for PHB");
         return;
     }
 
@@ -1456,16 +1464,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return;
+    /* DMA setup */
+    for (i = 0; i < windows_supported; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
     }
 
-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
-                                        spapr_tce_get_iommu(tcet), 0);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1482,13 +1492,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+    int i;
+    sPAPRTCETable *tcet;
 
-    if (tcet && tcet->enabled) {
-        spapr_tce_table_disable(tcet);
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
+
+        if (tcet && tcet->enabled) {
+            spapr_tce_table_disable(tcet);
+        }
     }
 
     /* Register default 32bit DMA window */
+    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
 }
@@ -1510,7 +1526,8 @@ static void spapr_phb_reset(DeviceState *qdev)
 static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
     DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
-    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
+    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
+    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
     DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
     DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
                        SPAPR_PCI_MMIO_WIN_SIZE),
@@ -1522,6 +1539,11 @@ static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
+                       (1ULL << 12) | (1ULL << 16)),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1598,7 +1620,7 @@ static const VMStateDescription vmstate_spapr_pci = {
     .post_load = spapr_pci_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
-        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
+        VMSTATE_UNUSED(4), /* dma_liobn */
         VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
@@ -1775,6 +1797,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1799,6 +1830,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
@@ -1822,7 +1861,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
-    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
+    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
     if (!tcet) {
         return -1;
     }
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..b4e0686
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,292 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->enabled) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->enabled) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
+{
+    int i;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
+        if (page_mask & (1ULL << masks[i].shift)) {
+            mask |= masks[i].mask;
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid, max_window_size;
+    uint32_t avail, addr, pgmask = 0;
+    MachineState *machine = MACHINE(spapr);
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    /* Translate page mask to LoPAPR format */
+    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as maximum supported RAM size was in 4K pages.
+     */
+    if (machine->ram_size == machine->maxram_size) {
+        max_window_size = machine->ram_size >> SPAPR_TCE_PAGE_SHIFT;
+    } else {
+        MemoryHotplugState *hpms = &spapr->hotplug_memory;
+
+        max_window_size = hpms->base + memory_region_size(&hpms->mr);
+    }
+
+    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+    rtas_st(rets, 2, max_window_size);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
+        (window_shift < page_shift)) {
+        goto param_error_exit;
+    }
+
+    if (!liobn || !sphb->ddw_enabled ||
+        spapr_phb_get_active_win_num(sphb) == SPAPR_PCI_DMA_MAX_WINDOWS) {
+        goto hw_error_exit;
+    }
+
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
+    if (!tcet) {
+        goto hw_error_exit;
+    }
+
+    spapr_tce_table_enable(tcet, page_shift, sphb->dma64_window_addr,
+                           1ULL << (window_shift - page_shift));
+    if (!tcet->enabled) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled || !tcet->enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_tce_table_disable(tcet);
+    trace_spapr_iommu_ddw_remove(liobn);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..36a370e 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -32,6 +32,8 @@
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 typedef struct sPAPRPHBState sPAPRPHBState;
 
 typedef struct spapr_pci_msi {
@@ -56,7 +58,7 @@ struct sPAPRPHBState {
     hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
     MemoryRegion memwindow, iowindow, msiwindow;
 
-    uint32_t dma_liobn;
+    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
     hwaddr dma_win_addr, dma_win_size;
     AddressSpace iommu_as;
     MemoryRegion iommu_root;
@@ -71,6 +73,10 @@ struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_addr;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index f0cfd58..4a1fe6f 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -412,6 +412,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -453,8 +463,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
diff --git a/trace-events b/trace-events
index b5419de..88a9cf4 100644
--- a/trace-events
+++ b/trace-events
@@ -1434,6 +1434,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
+spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release Alexey Kardashevskiy
@ 2016-05-05 22:39   ` Alex Williamson
  2016-05-13  7:16     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-05 22:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Wed,  4 May 2016 16:52:13 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This postpones VFIO container deinitialization to let region_del()
> callbacks (called via vfio_listener_release) do proper clean up
> while the group is still attached to the container.

Any mappings within the container should clean themselves up when the
container is deprivleged by removing the last group in the kernel.  Is
the issue that that doesn't happen, which would be a spapr vfio kernel
bug, or that our QEMU side structures get all out of whack if we let
that happen?

> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/common.c | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fe5ec6a..0b40262 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  {
>      VFIOContainer *container = group->container;
>  
> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> -        error_report("vfio: error disconnecting group %d from container",
> -                     group->groupid);
> -    }
> -
>      QLIST_REMOVE(group, container_next);
> +
> +    if (QLIST_EMPTY(&container->group_list)) {
> +        VFIOGuestIOMMU *giommu;
> +
> +        vfio_listener_release(container);
> +
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            memory_region_unregister_iommu_notifier(&giommu->n);
> +        }
> +    }
> +
>      group->container = NULL;
> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> +        error_report("vfio: error disconnecting group %d from container",
> +                     group->groupid);
> +    }
>  
>      if (QLIST_EMPTY(&container->group_list)) {
>          VFIOAddressSpace *space = container->space;
>          VFIOGuestIOMMU *giommu, *tmp;
>  
> -        vfio_listener_release(container);
>          QLIST_REMOVE(container, next);
>  
>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(&giommu->n);
>              QLIST_REMOVE(giommu, giommu_next);
>              g_free(giommu);
>          }

I'm not spotting why this is a 2-pass process vs simply moving the
existing QLIST_EMPTY cleanup above the ioctl.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 02/19] memory: Call region_del() callbacks on memory listener unregistering
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 02/19] memory: Call region_del() callbacks on memory listener unregistering Alexey Kardashevskiy
@ 2016-05-05 22:45   ` Alex Williamson
  2016-05-26  1:48     ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-05 22:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Wed,  4 May 2016 16:52:14 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> When a new memory listener is registered, listener_add_address_space()
> is called and which in turn calls region_add() callbacks of memory regions.
> However when unregistering the memory listener, it is just removed from
> the listening chain and no region_del() is called.
> 
> This adds listener_del_address_space() and uses it in
> memory_listener_unregister(). listener_add_address_space() was used as
> a template with the following changes:
> s/log_global_start/log_global_stop/
> s/log_start/log_stop/
> s/region_add/region_del/
> 
> This will allow the following patches to add/remove DMA windows
> dynamically from VFIO's PCI address space's region_add()/region_del().

Following patch 1 comments, it would be a bug if the kernel actually
needed this to do cleanup, we must release everything if QEMU gets shot
with a SIGKILL anyway.  So what does this cleanup facilitate in QEMU?
Having QEMU trigger an unmap for each region_del is not going to be as
efficient as just dropping the container and letting the kernel handle
the cleanup all in one go.  Thanks,

Alex

> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  memory.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/memory.c b/memory.c
> index f76f85d..f762a34 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2185,6 +2185,49 @@ static void listener_add_address_space(MemoryListener *listener,
>      flatview_unref(view);
>  }
>  
> +static void listener_del_address_space(MemoryListener *listener,
> +                                       AddressSpace *as)
> +{
> +    FlatView *view;
> +    FlatRange *fr;
> +
> +    if (listener->address_space_filter
> +        && listener->address_space_filter != as) {
> +        return;
> +    }
> +
> +    if (listener->begin) {
> +        listener->begin(listener);
> +    }
> +    if (global_dirty_log) {
> +        if (listener->log_global_stop) {
> +            listener->log_global_stop(listener);
> +        }
> +    }
> +
> +    view = address_space_get_flatview(as);
> +    FOR_EACH_FLAT_RANGE(fr, view) {
> +        MemoryRegionSection section = {
> +            .mr = fr->mr,
> +            .address_space = as,
> +            .offset_within_region = fr->offset_in_region,
> +            .size = fr->addr.size,
> +            .offset_within_address_space = int128_get64(fr->addr.start),
> +            .readonly = fr->readonly,
> +        };
> +        if (fr->dirty_log_mask && listener->log_stop) {
> +            listener->log_stop(listener, &section, 0, fr->dirty_log_mask);
> +        }
> +        if (listener->region_del) {
> +            listener->region_del(listener, &section);
> +        }
> +    }
> +    if (listener->commit) {
> +        listener->commit(listener);
> +    }
> +    flatview_unref(view);
> +}
> +
>  void memory_listener_register(MemoryListener *listener, AddressSpace *filter)
>  {
>      MemoryListener *other = NULL;
> @@ -2211,6 +2254,11 @@ void memory_listener_register(MemoryListener *listener, AddressSpace *filter)
>  
>  void memory_listener_unregister(MemoryListener *listener)
>  {
> +    AddressSpace *as;
> +
> +    QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
> +        listener_del_address_space(listener, as);
> +    }
>      QTAILQ_REMOVE(&memory_listeners, listener, link);
>  }
>  

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (18 preceding siblings ...)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2016-05-13  4:54 ` Alexey Kardashevskiy
  2016-05-13  5:36   ` Alex Williamson
  19 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-13  4:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

Alex W,

could you please review VFIO-related chunks? Thanks!


On 05/04/2016 04:52 PM, Alexey Kardashevskiy wrote:
> Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
> where devices are allowed to do DMA. These ranges are called DMA windows.
> By default, there is a single DMA window, 1 or 2GB big, mapped at zero
> on a PCI bus.
>
> PAPR defines a DDW RTAS API which allows pseries guests
> querying the hypervisor about DDW support and capabilities (page size mask
> for now). A pseries guest may request an additional (to the default)
> DMA windows using this RTAS API.
> The existing pseries Linux guests request an additional window as big as
> the guest RAM and map the entire guest window which effectively creates
> direct mapping of the guest memory to a PCI bus.
>
> This patchset reworks PPC64 IOMMU code and adds necessary structures
> to support big windows on pseries.
>
> This patchset is based on the latest upstream.
>
> This includes "vmstate: Define VARRAY with VMS_ALLOC" as it has been accepted
> but has not been merged to upstream yet.
>
> Please comment. Thanks!
>
>
> Paolo, I did cc: you on this because of 02/19 and 03/19, would be great to
> get an opinion as the rest of the series relies on it to do
> vfio-pci hot _un_plug properly. Thanks!
>
>
> Alexey Kardashevskiy (19):
>   vfio: Delay DMA address space listener release
>   memory: Call region_del() callbacks on memory listener unregistering
>   memory: Fix IOMMU replay base address
>   vmstate: Define VARRAY with VMS_ALLOC
>   vfio: Check that IOMMU MR translates to system address space
>   spapr_pci: Use correct DMA LIOBN when composing the device tree
>   spapr_iommu: Move table allocation to helpers
>   spapr_iommu: Introduce "enabled" state for TCE table
>   spapr_iommu: Finish renaming vfio_accel to need_vfio
>   spapr_iommu: Migrate full state
>   spapr_iommu: Add root memory region
>   spapr_pci: Reset DMA config on PHB reset
>   memory: Add reporting of supported page sizes
>   vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
>   spapr_pci: Add and export DMA resetting helper
>   vfio: Add host side DMA window capabilities
>   spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being
>     used by VFIO
>   vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
>   spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
>
>  hw/ppc/Makefile.objs          |   1 +
>  hw/ppc/spapr.c                |   5 +
>  hw/ppc/spapr_iommu.c          | 228 ++++++++++++++++++++++++------
>  hw/ppc/spapr_pci.c            |  96 +++++++++----
>  hw/ppc/spapr_rtas_ddw.c       | 292 ++++++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_vio.c            |   8 +-
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              | 319 +++++++++++++++++++++++++++++++++++-------
>  hw/vfio/prereg.c              | 137 ++++++++++++++++++
>  include/exec/memory.h         |  22 ++-
>  include/hw/pci-host/spapr.h   |  10 +-
>  include/hw/ppc/spapr.h        |  31 +++-
>  include/hw/vfio/vfio-common.h |  21 ++-
>  include/migration/vmstate.h   |  10 ++
>  memory.c                      |  64 ++++++++-
>  target-ppc/kvm_ppc.h          |   2 +-
>  trace-events                  |  12 +-
>  17 files changed, 1120 insertions(+), 139 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>  create mode 100644 hw/vfio/prereg.c
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2016-05-13  4:54 ` [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2016-05-13  5:36   ` Alex Williamson
  0 siblings, 0 replies; 69+ messages in thread
From: Alex Williamson @ 2016-05-13  5:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paolo Bonzini, David Gibson, qemu-ppc, qemu-devel, Alexander Graf

On Fri, 13 May 2016 14:54:52 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Alex W,
> 
> could you please review VFIO-related chunks? Thanks!


https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00744.html
https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00745.html


> On 05/04/2016 04:52 PM, Alexey Kardashevskiy wrote:
> > Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
> > where devices are allowed to do DMA. These ranges are called DMA windows.
> > By default, there is a single DMA window, 1 or 2GB big, mapped at zero
> > on a PCI bus.
> >
> > PAPR defines a DDW RTAS API which allows pseries guests
> > querying the hypervisor about DDW support and capabilities (page size mask
> > for now). A pseries guest may request an additional (to the default)
> > DMA windows using this RTAS API.
> > The existing pseries Linux guests request an additional window as big as
> > the guest RAM and map the entire guest window which effectively creates
> > direct mapping of the guest memory to a PCI bus.
> >
> > This patchset reworks PPC64 IOMMU code and adds necessary structures
> > to support big windows on pseries.
> >
> > This patchset is based on the latest upstream.
> >
> > This includes "vmstate: Define VARRAY with VMS_ALLOC" as it has been accepted
> > but has not been merged to upstream yet.
> >
> > Please comment. Thanks!
> >
> >
> > Paolo, I did cc: you on this because of 02/19 and 03/19, would be great to
> > get an opinion as the rest of the series relies on it to do
> > vfio-pci hot _un_plug properly. Thanks!
> >
> >
> > Alexey Kardashevskiy (19):
> >   vfio: Delay DMA address space listener release
> >   memory: Call region_del() callbacks on memory listener unregistering
> >   memory: Fix IOMMU replay base address
> >   vmstate: Define VARRAY with VMS_ALLOC
> >   vfio: Check that IOMMU MR translates to system address space
> >   spapr_pci: Use correct DMA LIOBN when composing the device tree
> >   spapr_iommu: Move table allocation to helpers
> >   spapr_iommu: Introduce "enabled" state for TCE table
> >   spapr_iommu: Finish renaming vfio_accel to need_vfio
> >   spapr_iommu: Migrate full state
> >   spapr_iommu: Add root memory region
> >   spapr_pci: Reset DMA config on PHB reset
> >   memory: Add reporting of supported page sizes
> >   vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
> >   spapr_pci: Add and export DMA resetting helper
> >   vfio: Add host side DMA window capabilities
> >   spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being
> >     used by VFIO
> >   vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
> >   spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
> >
> >  hw/ppc/Makefile.objs          |   1 +
> >  hw/ppc/spapr.c                |   5 +
> >  hw/ppc/spapr_iommu.c          | 228 ++++++++++++++++++++++++------
> >  hw/ppc/spapr_pci.c            |  96 +++++++++----
> >  hw/ppc/spapr_rtas_ddw.c       | 292 ++++++++++++++++++++++++++++++++++++++
> >  hw/ppc/spapr_vio.c            |   8 +-
> >  hw/vfio/Makefile.objs         |   1 +
> >  hw/vfio/common.c              | 319 +++++++++++++++++++++++++++++++++++-------
> >  hw/vfio/prereg.c              | 137 ++++++++++++++++++
> >  include/exec/memory.h         |  22 ++-
> >  include/hw/pci-host/spapr.h   |  10 +-
> >  include/hw/ppc/spapr.h        |  31 +++-
> >  include/hw/vfio/vfio-common.h |  21 ++-
> >  include/migration/vmstate.h   |  10 ++
> >  memory.c                      |  64 ++++++++-
> >  target-ppc/kvm_ppc.h          |   2 +-
> >  trace-events                  |  12 +-
> >  17 files changed, 1120 insertions(+), 139 deletions(-)
> >  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >  create mode 100644 hw/vfio/prereg.c
> >  
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release
  2016-05-05 22:39   ` Alex Williamson
@ 2016-05-13  7:16     ` Alexey Kardashevskiy
  2016-05-13 22:24       ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-13  7:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On 05/06/2016 08:39 AM, Alex Williamson wrote:
> On Wed,  4 May 2016 16:52:13 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> This postpones VFIO container deinitialization to let region_del()
>> callbacks (called via vfio_listener_release) do proper clean up
>> while the group is still attached to the container.
>
> Any mappings within the container should clean themselves up when the
> container is deprivleged by removing the last group in the kernel. Is
> the issue that that doesn't happen, which would be a spapr vfio kernel
> bug, or that our QEMU side structures get all out of whack if we let
> that happen?

My mailbase got corrupted, missed that.

This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
Notify IOMMU about starting/stopping being used by VFIO", I should have put 
01/19 and 02/19 right before 17/19, sorry about that.


Every reboot the spapr machine removes all (i.e. one or two) windows and 
creates the default one.

I do this by memory_region_del_subregion(iommu_mr) + 
memory_region_add_subregion(iommu_mr). Which gets translated to 
VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via 
vfio_memory_listener if there is VFIO; no direct calls from spapr to vfio 
=> cool. During the machine reset, the VFIO device is there with the 
container and groups attached, at some point with no windows.

Now to VFIO plug/unplug.

When VFIO plug happens, vfio_memory_listener is created, region_add() is 
called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE).
Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If 
region_del() is not called when the container is being destroyed (as before 
this patchset), then the kernel cleans and destroys windows when 
close(container->fd) is called or when qemu is killed (and this fd is 
naturally closed), I hope this answers the comment from 02/19.

So far so good (right?)

However I also have a guest view of the TCE table, this is what the guest 
sees and this is what emulated PCI devices use. This guest view is either 
allocated in the KVM (so H_PUT_TCE can be handled quickly right in the host 
kernel, even in real mode) or userspace (VFIO case).

I generally want the guest view to be in the KVM. However when I plug VFIO, 
I have to move the table to the userspace. When I unplug VFIO, I want to do 
the opposite so I need a way to tell spapr that it can move the table. 
region_del() seemed a natural way of doing this as region_add() is already 
doing the opposite part.

With this patchset, each IOMMU MR gets a usage counter, region_add() does 
+1, region_del() does -1 (yeah, not extremely optimal during reset). When 
the counter goes from 0 to 1, vfio_start() hook is called, when the counter 
becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers on 
the same PHB.

Without 01/19 and 02/19, I'll have to repeat region_del()'s counter 
decrement steps in vfio_disconnect_container(). And I still cannot move 
counting from region_add() to vfio_connect_container() so there will be 
asymmetry which I am fine with, I am just checking here - what would be the 
best approach here?


Thanks.



>
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/vfio/common.c | 22 +++++++++++++++-------
>>  1 file changed, 15 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index fe5ec6a..0b40262 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>  {
>>      VFIOContainer *container = group->container;
>>
>> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
>> -        error_report("vfio: error disconnecting group %d from container",
>> -                     group->groupid);
>> -    }
>> -
>>      QLIST_REMOVE(group, container_next);
>> +
>> +    if (QLIST_EMPTY(&container->group_list)) {
>> +        VFIOGuestIOMMU *giommu;
>> +
>> +        vfio_listener_release(container);
>> +
>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>> +            memory_region_unregister_iommu_notifier(&giommu->n);
>> +        }
>> +    }
>> +
>>      group->container = NULL;
>> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
>> +        error_report("vfio: error disconnecting group %d from container",
>> +                     group->groupid);
>> +    }
>>
>>      if (QLIST_EMPTY(&container->group_list)) {
>>          VFIOAddressSpace *space = container->space;
>>          VFIOGuestIOMMU *giommu, *tmp;
>>
>> -        vfio_listener_release(container);
>>          QLIST_REMOVE(container, next);
>>
>>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
>> -            memory_region_unregister_iommu_notifier(&giommu->n);
>>              QLIST_REMOVE(giommu, giommu_next);
>>              g_free(giommu);
>>          }
>
> I'm not spotting why this is a 2-pass process vs simply moving the
> existing QLIST_EMPTY cleanup above the ioctl.  Thanks,





-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2016-05-13  8:41   ` Bharata B Rao
  2016-05-13  8:49     ` Bharata B Rao
                       ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Bharata B Rao @ 2016-05-13  8:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc,
	Paolo Bonzini, David Gibson

On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
>
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.5 machine (TODO: update version) and older disable it.
> This also creates a single DMA window for the older machines to
> maintain backward migration.
>
> This implements DDW for PHB with emulated and VFIO devices. The host
> kernel support is required. The advertised IOMMU page sizes are 4K and
> 64K; 16M pages are supported but not advertised by default, in order to
> enable them, the user has to specify "pgsz" property for PHB and
> enable huge pages for RAM.
>
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
>
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
>
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
>
> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v16:
> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>
> v15:
> * moved page mask filtering to PHB realize(), use "-mempath" to know
> if there are huge pages
> * fixed error reporting in RTAS handlers
> * max window size accounts now hotpluggable memory boundaries
> ---
>  hw/ppc/Makefile.objs        |   1 +
>  hw/ppc/spapr.c              |   5 +
>  hw/ppc/spapr_pci.c          |  75 +++++++++---
>  hw/ppc/spapr_rtas_ddw.c     | 292 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |   8 +-
>  include/hw/ppc/spapr.h      |  16 ++-
>  trace-events                |   4 +
>  7 files changed, 381 insertions(+), 20 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index c1ffc77..986b36f 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index b69995e..0206609 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2365,6 +2365,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>          .driver   = "spapr-vlan", \
>          .property = "use-rx-buffer-pools", \
>          .value    = "off", \
> +    }, \
> +    {\
> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +        .property = "ddw",\
> +        .value    = stringify(off),\
>      },
>
>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 51e7d56..aa414f2 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -35,6 +35,7 @@
>  #include "hw/ppc/spapr.h"
>  #include "hw/pci-host/spapr.h"
>  #include "exec/address-spaces.h"
> +#include "exec/ram_addr.h"
>  #include <libfdt.h>
>  #include "trace.h"
>  #include "qemu/error-report.h"
> @@ -44,6 +45,7 @@
>  #include "hw/pci/pci_bus.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "sysemu/device_tree.h"
> +#include "sysemu/hostmem.h"
>
>  #include "hw/vfio/vfio.h"
>
> @@ -1305,11 +1307,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
>      sPAPRTCETable *tcet;
> +    const unsigned windows_supported =
> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
>
> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
>              || (sphb->mem_win_addr != (hwaddr)-1)
>              || (sphb->io_win_addr != (hwaddr)-1)) {
>              error_setg(errp, "Either \"index\" or other parameters must"
> @@ -1324,7 +1329,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>
>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> +        for (i = 0; i < windows_supported; ++i) {
> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> +        }
>
>          windows_base = SPAPR_PCI_WINDOW_BASE
>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> @@ -1337,8 +1344,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>
> -    if (sphb->dma_liobn == (uint32_t)-1) {
> -        error_setg(errp, "LIOBN not specified for PHB");
> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>          return;
>      }
>
> @@ -1456,16 +1464,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -        error_setg(errp, "Unable to create TCE table for %s",
> -                   sphb->dtbusname);
> -        return;
> +    /* DMA setup */
> +    for (i = 0; i < windows_supported; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
> +        if (!tcet) {
> +            error_setg(errp, "Creating window#%d failed for %s",
> +                       i, sphb->dtbusname);
> +            return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
>      }
>
> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> -                                        spapr_tce_get_iommu(tcet), 0);
> -
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>
> @@ -1482,13 +1492,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> +    int i;
> +    sPAPRTCETable *tcet;
>
> -    if (tcet && tcet->enabled) {
> -        spapr_tce_table_disable(tcet);
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
> +
> +        if (tcet && tcet->enabled) {
> +            spapr_tce_table_disable(tcet);
> +        }
>      }
>
>      /* Register default 32bit DMA window */
> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>  }
> @@ -1510,7 +1526,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>  static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>                         SPAPR_PCI_MMIO_WIN_SIZE),
> @@ -1522,6 +1539,11 @@ static Property spapr_phb_properties[] = {
>      /* Default DMA window is 0..1GB */
>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> +                       0x800000000000000ULL),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> +                       (1ULL << 12) | (1ULL << 16)),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>
> @@ -1598,7 +1620,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>      .post_load = spapr_pci_post_load,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> +        VMSTATE_UNUSED(4), /* dma_liobn */
>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> @@ -1775,6 +1797,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1799,6 +1830,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> @@ -1822,7 +1861,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>                       sizeof(interrupt_map)));
>
> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>      if (!tcet) {
>          return -1;
>      }
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..b4e0686
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,292 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->enabled) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->enabled) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> +{
> +    int i;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> +        if (page_mask & (1ULL << masks[i].shift)) {
> +            mask |= masks[i].mask;
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid, max_window_size;
> +    uint32_t avail, addr, pgmask = 0;
> +    MachineState *machine = MACHINE(spapr);
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    /* Translate page mask to LoPAPR format */
> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> +     */
> +    if (machine->ram_size == machine->maxram_size) {
> +        max_window_size = machine->ram_size >> SPAPR_TCE_PAGE_SHIFT;
> +    } else {
> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> +
> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> +    }

Guess SPAPR_TCE_PAGE_SHIFT right shift should be applied to
max_window_size in both the instances (if and else) ?

> +
> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +    rtas_st(rets, 2, max_window_size);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);

Kernel has a bug due to which wrong window_shift gets returned here. I
have posted possible fix here:
https://patchwork.ozlabs.org/patch/621497/

I have tried to work around this issue in QEMU too
https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html

But the above work around involves changing the memory representation
in DT. Hence I feel until the guest kernel changes are available, a
simpler work around would be to discard the window_shift value above
and recalculate the right value as below:

if (machine->ram_size == machine->maxram_size) {
    max_window_size = machine->ram_size;
} else {
     MemoryHotplugState *hpms = &spapr->hotplug_memory;
     max_window_size = hpms->base + memory_region_size(&hpms->mr);
}
window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT;

and create DDW based on this calculated window_shift value. Does that
sound reasonable ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-13  8:41   ` Bharata B Rao
@ 2016-05-13  8:49     ` Bharata B Rao
  2016-05-16  6:25     ` Alexey Kardashevskiy
  2016-05-27  4:42     ` David Gibson
  2 siblings, 0 replies; 69+ messages in thread
From: Bharata B Rao @ 2016-05-13  8:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc,
	Paolo Bonzini, David Gibson

On Fri, May 13, 2016 at 2:11 PM, Bharata B Rao <bharata.rao@gmail.com> wrote:
> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>
> Kernel has a bug due to which wrong window_shift gets returned here. I
> have posted possible fix here:
> https://patchwork.ozlabs.org/patch/621497/
>
> I have tried to work around this issue in QEMU too
> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
>
> But the above work around involves changing the memory representation
> in DT. Hence I feel until the guest kernel changes are available, a
> simpler work around would be to discard the window_shift value above
> and recalculate the right value as below:
>
> if (machine->ram_size == machine->maxram_size) {
>     max_window_size = machine->ram_size;
> } else {
>      MemoryHotplugState *hpms = &spapr->hotplug_memory;
>      max_window_size = hpms->base + memory_region_size(&hpms->mr);
> }
> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT;
>
> and create DDW based on this calculated window_shift value. Does that
> sound reasonable ?

Sorry, missed mentioning earlier that incorrect DDW value here causes
memory hotplug to fail.

>
> Regards,
> Bharata.



-- 
http://raobharata.wordpress.com/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release
  2016-05-13  7:16     ` Alexey Kardashevskiy
@ 2016-05-13 22:24       ` Alex Williamson
  2016-05-25  6:34         ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-13 22:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Fri, 13 May 2016 17:16:48 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 05/06/2016 08:39 AM, Alex Williamson wrote:
> > On Wed,  4 May 2016 16:52:13 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >  
> >> This postpones VFIO container deinitialization to let region_del()
> >> callbacks (called via vfio_listener_release) do proper clean up
> >> while the group is still attached to the container.  
> >
> > Any mappings within the container should clean themselves up when the
> > container is deprivleged by removing the last group in the kernel. Is
> > the issue that that doesn't happen, which would be a spapr vfio kernel
> > bug, or that our QEMU side structures get all out of whack if we let
> > that happen?  
> 
> My mailbase got corrupted, missed that.
> 
> This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> 01/19 and 02/19 right before 17/19, sorry about that.

Which I object to, it's just ridiculous to have vfio start/stop
callbacks in a set of generic iommu region ops.

> Every reboot the spapr machine removes all (i.e. one or two) windows and 
> creates the default one.
> 
> I do this by memory_region_del_subregion(iommu_mr) + 
> memory_region_add_subregion(iommu_mr). Which gets translated to 
> VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via 
> vfio_memory_listener if there is VFIO; no direct calls from spapr to vfio 
> => cool. During the machine reset, the VFIO device is there with the   
> container and groups attached, at some point with no windows.
> 
> Now to VFIO plug/unplug.
> 
> When VFIO plug happens, vfio_memory_listener is created, region_add() is 
> called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE).
> Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If 
> region_del() is not called when the container is being destroyed (as before 
> this patchset), then the kernel cleans and destroys windows when 
> close(container->fd) is called or when qemu is killed (and this fd is 
> naturally closed), I hope this answers the comment from 02/19.
> 
> So far so good (right?)
> 
> However I also have a guest view of the TCE table, this is what the guest 
> sees and this is what emulated PCI devices use. This guest view is either 
> allocated in the KVM (so H_PUT_TCE can be handled quickly right in the host 
> kernel, even in real mode) or userspace (VFIO case).
> 
> I generally want the guest view to be in the KVM. However when I plug VFIO, 
> I have to move the table to the userspace. When I unplug VFIO, I want to do 
> the opposite so I need a way to tell spapr that it can move the table. 
> region_del() seemed a natural way of doing this as region_add() is already 
> doing the opposite part.
> 
> With this patchset, each IOMMU MR gets a usage counter, region_add() does 
> +1, region_del() does -1 (yeah, not extremely optimal during reset). When 
> the counter goes from 0 to 1, vfio_start() hook is called, when the counter 
> becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers on 
> the same PHB.
> 
> Without 01/19 and 02/19, I'll have to repeat region_del()'s counter 
> decrement steps in vfio_disconnect_container(). And I still cannot move 
> counting from region_add() to vfio_connect_container() so there will be 
> asymmetry which I am fine with, I am just checking here - what would be the 
> best approach here?


You're imposing on other iommu models (type1) that in order to release
a container we first deregister the listener, which un-plays all of
the mappings within that region.  That's inefficient when we can simply
unset the container and move on.  So you're imposing an inefficiency on
a separate vfio iommu model for the book keeping of your own.  I don't
think that's a reasonable approach.  Has it even been testing how that
affects type1 users?  When a container is closed, clearly it shouldn't
be contributing to reference counts, so it seems like there must be
other ways to handle this.

> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/vfio/common.c | 22 +++++++++++++++-------
> >>  1 file changed, 15 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index fe5ec6a..0b40262 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >>  {
> >>      VFIOContainer *container = group->container;
> >>
> >> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> >> -        error_report("vfio: error disconnecting group %d from container",
> >> -                     group->groupid);
> >> -    }
> >> -
> >>      QLIST_REMOVE(group, container_next);
> >> +
> >> +    if (QLIST_EMPTY(&container->group_list)) {
> >> +        VFIOGuestIOMMU *giommu;
> >> +
> >> +        vfio_listener_release(container);
> >> +
> >> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >> +            memory_region_unregister_iommu_notifier(&giommu->n);
> >> +        }
> >> +    }
> >> +
> >>      group->container = NULL;
> >> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> >> +        error_report("vfio: error disconnecting group %d from container",
> >> +                     group->groupid);
> >> +    }
> >>
> >>      if (QLIST_EMPTY(&container->group_list)) {
> >>          VFIOAddressSpace *space = container->space;
> >>          VFIOGuestIOMMU *giommu, *tmp;
> >>
> >> -        vfio_listener_release(container);
> >>          QLIST_REMOVE(container, next);
> >>
> >>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> >> -            memory_region_unregister_iommu_notifier(&giommu->n);
> >>              QLIST_REMOVE(giommu, giommu_next);
> >>              g_free(giommu);
> >>          }  
> >
> > I'm not spotting why this is a 2-pass process vs simply moving the
> > existing QLIST_EMPTY cleanup above the ioctl.  Thanks,  
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-05-13 22:25   ` Alex Williamson
  2016-05-16  1:10     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-13 22:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Wed,  4 May 2016 16:52:26 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> As there is no per-IOMMU-type release() callback anymore, this stores
> the IOMMU type in the container so vfio_listener_release() can determine
> if it needs to unregister @prereg_listener.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This enforces guest RAM blocks to be host page size aligned; however
> this is not new as KVM already requires memory slots to be host page
> size aligned.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v16:
> * switched to 64bit math everywhere as there is no chance to see
> region_add on RAM blocks even remotely close to 1<<64bytes.
> 
> v15:
> * banned unaligned sections
> * added an vfio_prereg_gpa_to_ua() helper
> 
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  38 +++++++++---
>  hw/vfio/prereg.c              | 137 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   2 +
>  5 files changed, 172 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/prereg.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..5800e0e 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += prereg.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 2050040..496eb82 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -501,6 +501,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
>  }
>  
>  int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
> @@ -808,8 +811,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -834,8 +837,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -843,7 +848,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -855,11 +862,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto listener_release_exit;
> +            }
>          }
>  
>          /*
> @@ -872,7 +890,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if (ret) {
>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>              ret = -errno;
> -            goto free_container_exit;
> +            goto listener_release_exit;
>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> new file mode 100644
> index 0000000..d0e4728
> --- /dev/null
> +++ b/hw/vfio/prereg.c
> @@ -0,0 +1,137 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/hw.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    if (memory_region_is_iommu(section->mr)) {
> +        error_report("Cannot possibly preregister IOMMU memory");

What is a user supposed to do with this error_report()?  Is it
continue-able?  How is it possible?  What should they do differently?

> +        return true;
> +    }
> +
> +    return !memory_region_is_ram(section->mr) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)

What's "ua"?

> +{
> +    return memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };

So we're just pretending that this spapr specific code is some sort of
generic pre-registration interface?

> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    g_assert(gpa < end);
> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);

Hmm, why wasn't that simply gpa_to_vaddr?

> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: Memory registering failed, unable to continue");
> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c9b6622..c72e45a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
> +    MemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
>  int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                           struct vfio_region_info **info);
>  #endif
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index dd50005..d0d8615 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1737,6 +1737,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 16/19] vfio: Add host side DMA window capabilities
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 16/19] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
@ 2016-05-13 22:25   ` Alex Williamson
  2016-05-27  0:36     ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-13 22:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Wed,  4 May 2016 16:52:28 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> There are going to be multiple IOMMUs per a container. This moves
> the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
> 
> This should cause no behavioral change and will be used later by
> the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v16:
> * adjusted commit log with changes from v15
> 
> v15:
> * s/vfio_host_iommu_add/vfio_host_win_add/
> * s/VFIOHostIOMMU/VFIOHostDMAWindow/
> ---
>  hw/vfio/common.c              | 65 +++++++++++++++++++++++++++++++++----------
>  include/hw/vfio/vfio-common.h |  9 ++++--
>  2 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 496eb82..3f2fb23 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "exec/memory.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
> +#include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  
> @@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> +static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> +                                               hwaddr min_iova, hwaddr max_iova)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {
> +            return hostwin;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int vfio_host_win_add(VFIOContainer *container,
> +                             hwaddr min_iova, hwaddr max_iova,
> +                             uint64_t iova_pgsizes)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (ranges_overlap(min_iova, max_iova - min_iova + 1,
> +                           hostwin->min_iova,
> +                           hostwin->max_iova - hostwin->min_iova + 1)) {

Why does vfio_host_win_lookup() not also use ranges_overlap()?  In
fact, why don't we call vfio_host_win_lookup here to find the conflict?

> +            error_report("%s: Overlapped IOMMU are not enabled", __func__);
> +            return -1;

Nobody here tests the return value, shouldn't this be fatal?

> +        }
> +    }
> +
> +    hostwin = g_malloc0(sizeof(*hostwin));
> +
> +    hostwin->min_iova = min_iova;
> +    hostwin->max_iova = max_iova;
> +    hostwin->iova_pgsizes = iova_pgsizes;
> +    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
> +
> +    return 0;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -352,7 +392,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> -    if ((iova < container->min_iova) || (end > container->max_iova)) {
> +    if (!vfio_host_win_lookup(container, iova, end)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>                       container, iova, end);
> @@ -367,10 +407,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
>          /*
> -         * FIXME: We should do some checking to see if the
> -         * capabilities of the host VFIO IOMMU are adequate to model
> -         * the guest IOMMU
> -         *
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
>           * would be the right place to wire that up (tell the KVM
> @@ -826,16 +862,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        container->min_iova = 0;
> -        container->max_iova = (hwaddr)-1;
> -
> -        /* Assume just 4K IOVA page size */
> -        container->iova_pgsizes = 0x1000;
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>          /* Ignore errors */
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {

if (ret || !(info.flags && VFIO_IOMMU_INFO_PGSIZES)) {
    /* Assume 4k IOVA page size */
    info.iova_pgsizes = 4096;
}

vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);

> -            container->iova_pgsizes = info.iova_pgsizes;
> +            vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> +        } else {
> +            /* Assume just 4K IOVA page size */
> +            vfio_host_win_add(container, 0, (hwaddr)-1, 0x1000);
>          }
>      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> @@ -892,11 +926,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto listener_release_exit;
>          }
> -        container->min_iova = info.dma32_window_start;
> -        container->max_iova = container->min_iova + info.dma32_window_size - 1;
>  
> -        /* Assume just 4K IOVA pages for now */
> -        container->iova_pgsizes = 0x1000;
> +        /* The default table uses 4K pages */
> +        vfio_host_win_add(container, info.dma32_window_start,
> +                          info.dma32_window_start +
> +                          info.dma32_window_size - 1,
> +                          0x1000);
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c72e45a..808263b 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -82,9 +82,8 @@ typedef struct VFIOContainer {
>       * contiguous IOVA window.  We may need to generalize that in
>       * future
>       */
> -    hwaddr min_iova, max_iova;
> -    uint64_t iova_pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
> @@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;
>  
> +typedef struct VFIOHostDMAWindow {
> +    hwaddr min_iova, max_iova;
> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> +} VFIOHostDMAWindow;
> +
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  
>  typedef struct VFIODevice {

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-05-13 22:26   ` Alex Williamson
  2016-05-16  8:35     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-13 22:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Wed,  4 May 2016 16:52:29 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> a guest view of the table and a hardware TCE table. If there is no VFIO
> presense in the address space, then just the guest view is used, if
> this is the case, it is allocated in the KVM. However since there is no
> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> we need to move the guest view from KVM to the userspace; and we need
> to do this for every IOMMU on a bus with VFIO devices.
> 
> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> notify IOMMU about changing environment so it can reallocate the table
> to/from KVM or (when available) hook the IOMMU groups with the logical
> bus (LIOBN) in the KVM.
> 
> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> path as the new callbacks do this better - they notify IOMMU at
> the exact moment when the configuration is changed, and this also
> includes the case of PCI hot unplug.
> 
> This postpones vfio_stop() till the end of region_del() as
> vfio_dma_unmap() has to execute before VFIO support is disabled.
> 
> As there can be multiple containers attached to the same PHB/LIOBN,
> this adds a wrapper with a use counter for every IOMMU MR and
> stores them in a list in the VFIOAddressSpace.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v16:
> * added a use counter in VFIOAddressSpace->VFIOIOMMUMR
> 
> v15:
> * s/need_vfio/vfio-Users/g
> ---
>  hw/ppc/spapr_iommu.c          | 12 ++++++++++++
>  hw/ppc/spapr_pci.c            |  6 ------
>  hw/vfio/common.c              | 45 ++++++++++++++++++++++++++++++++++++++++++-
>  include/exec/memory.h         |  4 ++++
>  include/hw/vfio/vfio-common.h |  7 +++++++
>  5 files changed, 67 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index c945dba..7af2700 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> +}
> +
> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> +}
> +
>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
>      .get_page_sizes = spapr_tce_get_page_sizes,
> +    .vfio_start = spapr_tce_vfio_start,
> +    .vfio_stop = spapr_tce_vfio_stop,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 5b9ccff..51e7d56 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1086,12 +1086,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      void *fdt = NULL;
>      int fdt_start_offset = 0, fdt_size;
>  
> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> -
> -        spapr_tce_set_need_vfio(tcet, true);
> -    }
> -
>      if (dev->hotplugged) {
>          fdt = create_device_tree(&fdt_size);
>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 3f2fb23..03daf88 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -421,6 +421,26 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> +
> +        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
> +            VFIOIOMMUMR *iommumr;
> +            bool found = false;
> +
> +            QLIST_FOREACH(iommumr, &container->space->iommumrs, iommumr_next) {
> +                if (iommumr->iommu == section->mr) {
> +                    found = true;
> +                    break;
> +                }
> +            }
> +            if (!found) {
> +                iommumr = g_malloc0(sizeof(*iommumr));
> +                iommumr->iommu = section->mr;
> +                section->mr->iommu_ops->vfio_start(section->mr);
> +                QLIST_INSERT_HEAD(&container->space->iommumrs, iommumr,
> +                                  iommumr_next);
> +            }
> +            ++iommumr->users;
> +        }
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
>                                     false);
>  
> @@ -470,6 +490,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      int ret;
> +    MemoryRegion *iommu = NULL;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -490,13 +511,30 @@ static void vfio_listener_region_del(MemoryListener *listener,
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (giommu->iommu == section->mr) {
> +                VFIOIOMMUMR *iommumr;
> +
>                  memory_region_unregister_iommu_notifier(&giommu->n);
> +
> +                QLIST_FOREACH(iommumr, &container->space->iommumrs,
> +                              iommumr_next) {
> +                    if (iommumr->iommu != section->mr) {
> +                        continue;
> +                    }
> +                    --iommumr->users;
> +                    if (iommumr->users) {
> +                        break;
> +                    }
> +                    QLIST_REMOVE(iommumr, iommumr_next);
> +                    g_free(iommumr);
> +                    iommu = giommu->iommu;
> +                    break;
> +                }
> +
>                  QLIST_REMOVE(giommu, giommu_next);
>                  g_free(giommu);
>                  break;
>              }
>          }
> -
>          /*
>           * FIXME: We assume the one big unmap below is adequate to
>           * remove any individual page mappings in the IOMMU which
> @@ -527,6 +565,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       "0x%"HWADDR_PRIx") = %d (%m)",
>                       container, iova, int128_get64(llsize), ret);
>      }
> +
> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> +        iommu->iommu_ops->vfio_stop(section->mr);
> +    }
>  }
>  
>  static const MemoryListener vfio_memory_listener = {
> @@ -787,6 +829,7 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>      space = g_malloc0(sizeof(*space));
>      space->as = as;
>      QLIST_INIT(&space->containers);
> +    QLIST_INIT(&space->iommumrs);
>  
>      QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
>  
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index a3a1703..52d2c70 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>      /* Returns supported page sizes */
>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> +    /* Called when VFIO starts using this */
> +    void (*vfio_start)(MemoryRegion *iommu);
> +    /* Called when VFIO stops using this */
> +    void (*vfio_stop)(MemoryRegion *iommu);

Really?  Just no.  Generic MemoryRegionIOMMUOps should have no
visibility of vfio and certainly not vfio specific callbacks.  I don't
really understand what guest view versus KVM view is doing here, but
it's clearly something to do with visibility versus acceleration of the
IOMMU tables and the callbacks, if they're even generic at all, should
be reflecting that, not some vague relation to vfio.

>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 808263b..a9e6e33 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -64,9 +64,16 @@ typedef struct VFIORegion {
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> +    QLIST_HEAD(, VFIOIOMMUMR) iommumrs;
>      QLIST_ENTRY(VFIOAddressSpace) list;
>  } VFIOAddressSpace;
>  
> +typedef struct VFIOIOMMUMR {
> +    MemoryRegion *iommu;
> +    int users;
> +    QLIST_ENTRY(VFIOIOMMUMR) iommumr_next;
> +} VFIOIOMMUMR;
> +
>  struct VFIOGroup;
>  
>  typedef struct VFIOContainer {

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-05-13 22:26   ` Alex Williamson
  2016-05-16  4:52     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-13 22:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Wed,  4 May 2016 16:52:30 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses heuristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v16:
> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> * enforced no intersections between windows
> 
> v14:
> * new to the series
> ---
>  hw/vfio/common.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  trace-events     |   2 +
>  2 files changed, 125 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 03daf88..bd2dee8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -240,6 +240,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> +static bool range_contains(hwaddr start, hwaddr end, hwaddr addr)
> +{
> +    return start <= addr && addr <= end;
> +}

a) If you want a "range_foo" function then put it in range.h
b) I suspect there are already range.h functions that can do this.

> +
> +static bool vfio_host_win_intersects(VFIOHostDMAWindow *hostwin,
> +                                     hwaddr min_iova, hwaddr max_iova)
> +{
> +    return range_contains(hostwin->min_iova, hostwin->min_iova, min_iova) ||
> +        range_contains(min_iova, max_iova, hostwin->min_iova);
> +}

How is this different than ranges_overlap()?

> +
>  static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
>                                                 hwaddr min_iova, hwaddr max_iova)
>  {
> @@ -279,6 +291,14 @@ static int vfio_host_win_add(VFIOContainer *container,
>      return 0;
>  }
>  
> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> +{
> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> +
> +    g_assert(hostwin);

Handle the error please.

> +    QLIST_REMOVE(hostwin, hostwin_next);
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -392,6 +412,69 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        VFIOHostDMAWindow *hostwin;
> +        unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
> +        unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
> +        unsigned entries, pages;
> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> +
> +        trace_vfio_listener_region_add_iommu(iova, end);
> +        /*
> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> +         * avoid bouncing all map/unmaps through qemu this way, this
> +         * would be the right place to wire that up (tell the KVM
> +         * device emulation the VFIO iommu handles to use).
> +         */
> +        create.window_size = int128_get64(section->size);
> +        create.page_shift = ctz64(pagesize);
> +        /*
> +         * SPAPR host supports multilevel TCE tables, there is some
> +         * heuristic to decide how many levels we want for our table:
> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +         */
> +        entries = create.window_size >> create.page_shift;
> +        pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> +        pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
> +        create.levels = ctz64(pages) / 6 + 1;
> +
> +        /* For now intersections are not allowed, we may relax this later */
> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +            if (vfio_host_win_intersects(hostwin,
> +                    section->offset_within_address_space,
> +                    section->offset_within_address_space +
> +                    create.window_size - 1)) {
> +                goto fail;
> +            }
> +        }
> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (ret) {
> +            error_report("Failed to create a window, ret = %d (%m)", ret);
> +            goto fail;
> +        }
> +
> +        if (create.start_addr != section->offset_within_address_space) {
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = create.start_addr
> +            };
> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> +                         section->offset_within_address_space,
> +                         create.start_addr);
> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +        trace_vfio_spapr_create_window(create.page_shift,
> +                                       create.window_size,
> +                                       create.start_addr);
> +
> +        vfio_host_win_add(container, create.start_addr,
> +                          create.start_addr + create.window_size - 1,
> +                          1ULL << create.page_shift);
> +    }

This is a function on its own, split it out and why not stop pretending
prereg is some sort of generic interface and let's just make a spapr
support file.

> +
>      if (!vfio_host_win_lookup(container, iova, end)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> @@ -566,6 +649,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       container, iova, int128_get64(llsize), ret);
>      }
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        struct vfio_iommu_spapr_tce_remove remove = {
> +            .argsz = sizeof(remove),
> +            .start_addr = section->offset_within_address_space,
> +        };
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        if (ret) {
> +            error_report("Failed to remove window at %"PRIx64,
> +                         remove.start_addr);
> +        }
> +
> +        vfio_host_win_del(container, section->offset_within_address_space);
> +
> +        trace_vfio_spapr_remove_window(remove.start_addr);
> +    }

This would be in that spapr file too.

> +
>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>          iommu->iommu_ops->vfio_stop(section->mr);
>      }
> @@ -957,11 +1056,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -970,11 +1064,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto listener_release_exit;
>          }
>  
> -        /* The default table uses 4K pages */
> -        vfio_host_win_add(container, info.dma32_window_start,
> -                          info.dma32_window_start +
> -                          info.dma32_window_size - 1,
> -                          0x1000);
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = info.dma32_window_start,
> +            };
> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            if (ret) {
> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            vfio_host_win_add(container, info.dma32_window_start,
> +                              info.dma32_window_start +
> +                              info.dma32_window_size - 1,
> +                              0x1000);
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/trace-events b/trace-events
> index d0d8615..b5419de 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1739,6 +1739,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-05-13 22:25   ` Alex Williamson
@ 2016-05-16  1:10     ` Alexey Kardashevskiy
  2016-05-16 20:20       ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-16  1:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On 05/14/2016 08:25 AM, Alex Williamson wrote:
> On Wed,  4 May 2016 16:52:26 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> This makes use of the new "memory registering" feature. The idea is
>> to provide the userspace ability to notify the host kernel about pages
>> which are going to be used for DMA. Having this information, the host
>> kernel can pin them all once per user process, do locked pages
>> accounting (once) and not spent time on doing that in real time with
>> possible failures which cannot be handled nicely in some cases.
>>
>> This adds a prereg memory listener which listens on address_space_memory
>> and notifies a VFIO container about memory which needs to be
>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>
>> As there is no per-IOMMU-type release() callback anymore, this stores
>> the IOMMU type in the container so vfio_listener_release() can determine
>> if it needs to unregister @prereg_listener.
>>
>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>> not call it when v2 is detected and enabled.
>>
>> This enforces guest RAM blocks to be host page size aligned; however
>> this is not new as KVM already requires memory slots to be host page
>> size aligned.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v16:
>> * switched to 64bit math everywhere as there is no chance to see
>> region_add on RAM blocks even remotely close to 1<<64bytes.
>>
>> v15:
>> * banned unaligned sections
>> * added an vfio_prereg_gpa_to_ua() helper
>>
>> v14:
>> * s/free_container_exit/listener_release_exit/g
>> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
>> ---
>>  hw/vfio/Makefile.objs         |   1 +
>>  hw/vfio/common.c              |  38 +++++++++---
>>  hw/vfio/prereg.c              | 137 ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |   4 ++
>>  trace-events                  |   2 +
>>  5 files changed, 172 insertions(+), 10 deletions(-)
>>  create mode 100644 hw/vfio/prereg.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index ceddbb8..5800e0e 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>  obj-$(CONFIG_SOFTMMU) += platform.o
>>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>> +obj-$(CONFIG_SOFTMMU) += prereg.o
>>  endif
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 2050040..496eb82 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -501,6 +501,9 @@ static const MemoryListener vfio_memory_listener = {
>>  static void vfio_listener_release(VFIOContainer *container)
>>  {
>>      memory_listener_unregister(&container->listener);
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        memory_listener_unregister(&container->prereg_listener);
>> +    }
>>  }
>>
>>  int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
>> @@ -808,8 +811,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              goto free_container_exit;
>>          }
>>
>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>          if (ret) {
>>              error_report("vfio: failed to set iommu for container: %m");
>>              ret = -errno;
>> @@ -834,8 +837,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>              container->iova_pgsizes = info.iova_pgsizes;
>>          }
>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>          struct vfio_iommu_spapr_tce_info info;
>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>
>>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>          if (ret) {
>> @@ -843,7 +848,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              ret = -errno;
>>              goto free_container_exit;
>>          }
>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        container->iommu_type =
>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>          if (ret) {
>>              error_report("vfio: failed to set iommu for container: %m");
>>              ret = -errno;
>> @@ -855,11 +862,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           * when container fd is closed so we do not call it explicitly
>>           * in this file.
>>           */
>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> -        if (ret) {
>> -            error_report("vfio: failed to enable container: %m");
>> -            ret = -errno;
>> -            goto free_container_exit;
>> +        if (!v2) {
>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> +            if (ret) {
>> +                error_report("vfio: failed to enable container: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            container->prereg_listener = vfio_prereg_listener;
>> +
>> +            memory_listener_register(&container->prereg_listener,
>> +                                     &address_space_memory);
>> +            if (container->error) {
>> +                error_report("vfio: RAM memory listener initialization failed for container");
>> +                goto listener_release_exit;
>> +            }
>>          }
>>
>>          /*
>> @@ -872,7 +890,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>          if (ret) {
>>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>>              ret = -errno;
>> -            goto free_container_exit;
>> +            goto listener_release_exit;
>>          }
>>          container->min_iova = info.dma32_window_start;
>>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
>> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
>> new file mode 100644
>> index 0000000..d0e4728
>> --- /dev/null
>> +++ b/hw/vfio/prereg.c
>> @@ -0,0 +1,137 @@
>> +/*
>> + * DMA memory preregistration
>> + *
>> + * Authors:
>> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "hw/hw.h"
>> +#include "qemu/error-report.h"
>> +#include "trace.h"
>> +
>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
>> +{
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        error_report("Cannot possibly preregister IOMMU memory");
>
> What is a user supposed to do with this error_report()?  Is it
> continue-able?  How is it possible?  What should they do differently?


If I remember correctly, David did have theories where this may be 
possible, not today with the existing code though. Could be assert() or 
abort(), what is better here?


>> +        return true;
>> +    }
>> +
>> +    return !memory_region_is_ram(section->mr) ||
>> +            memory_region_is_skip_dump(section->mr);
>> +}
>> +
>> +static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)
>
> What's "ua"?


Userspace address.


>
>> +{
>> +    return memory_region_get_ram_ptr(section->mr) +
>> +        section->offset_within_region +
>> +        (gpa - section->offset_within_address_space);
>> +}
>> +
>> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            prereg_listener);
>> +    const hwaddr gpa = section->offset_within_address_space;
>> +    hwaddr end;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>
> So we're just pretending that this spapr specific code is some sort of
> generic pre-registration interface?

Yes.



>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_add_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>> +                 (section->offset_within_region & ~page_mask) ||
>> +                 (int128_get64(section->size) & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    end = section->offset_within_address_space + int128_get64(section->size);
>> +    g_assert(gpa < end);
>> +
>> +    memory_region_ref(section->mr);
>> +
>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
>
> Hmm, why wasn't that simply gpa_to_vaddr?

I wanted to keep a prefix in all functions, even if they are static, easier 
to grep. Bad idea?


>> +    reg.size = end - gpa;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
>> +    if (ret) {
>> +        /*
>> +         * On the initfn path, store the first error in the container so we
>> +         * can gracefully fail.  Runtime, there's not much we can do other
>> +         * than throw a hardware error.
>> +         */
>> +        if (!container->initialized) {
>> +            if (!container->error) {
>> +                container->error = ret;
>> +            }
>> +        } else {
>> +            hw_error("vfio: Memory registering failed, unable to continue");
>> +        }
>> +    }
>> +}
>> +
>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            prereg_listener);
>> +    const hwaddr gpa = section->offset_within_address_space;
>> +    hwaddr end;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_del_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>> +                 (section->offset_within_region & ~page_mask) ||
>> +                 (int128_get64(section->size) & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    end = section->offset_within_address_space + int128_get64(section->size);
>> +    if (gpa >= end) {
>> +        return;
>> +    }
>> +
>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
>> +    reg.size = end - gpa;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
>> +}
>> +
>> +const MemoryListener vfio_prereg_listener = {
>> +    .region_add = vfio_prereg_listener_region_add,
>> +    .region_del = vfio_prereg_listener_region_del,
>> +};
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index c9b6622..c72e45a 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>>      VFIOAddressSpace *space;
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      MemoryListener listener;
>> +    MemoryListener prereg_listener;
>> +    unsigned iommu_type;
>>      int error;
>>      bool initialized;
>>      /*
>> @@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
>>  int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>                           struct vfio_region_info **info);
>>  #endif
>> +extern const MemoryListener vfio_prereg_listener;
>> +
>>  #endif /* !HW_VFIO_VFIO_COMMON_H */
>> diff --git a/trace-events b/trace-events
>> index dd50005..d0d8615 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1737,6 +1737,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
>>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
>>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>
>>  # hw/vfio/platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-05-13 22:26   ` Alex Williamson
@ 2016-05-16  4:52     ` Alexey Kardashevskiy
  2016-05-16 20:20       ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-16  4:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On 05/14/2016 08:26 AM, Alex Williamson wrote:
> On Wed,  4 May 2016 16:52:30 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>> This adds ability to VFIO common code to dynamically allocate/remove
>> DMA windows in the host kernel when new VFIO container is added/removed.
>>
>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>> and adds just created IOMMU into the host IOMMU list; the opposite
>> action is taken in vfio_listener_region_del.
>>
>> When creating a new window, this uses heuristic to decide on the TCE table
>> levels number.
>>
>> This should cause no guest visible change in behavior.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v16:
>> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
>> * enforced no intersections between windows
>>
>> v14:
>> * new to the series
>> ---
>>  hw/vfio/common.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>  trace-events     |   2 +
>>  2 files changed, 125 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 03daf88..bd2dee8 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -240,6 +240,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>      return -errno;
>>  }
>>
>> +static bool range_contains(hwaddr start, hwaddr end, hwaddr addr)
>> +{
>> +    return start <= addr && addr <= end;
>> +}
>
> a) If you want a "range_foo" function then put it in range.h
> b) I suspect there are already range.h functions that can do this.
>
>> +
>> +static bool vfio_host_win_intersects(VFIOHostDMAWindow *hostwin,
>> +                                     hwaddr min_iova, hwaddr max_iova)
>> +{
>> +    return range_contains(hostwin->min_iova, hostwin->min_iova, min_iova) ||
>> +        range_contains(min_iova, max_iova, hostwin->min_iova);
>> +}
>
> How is this different than ranges_overlap()?
>> +
>>  static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
>>                                                 hwaddr min_iova, hwaddr max_iova)
>>  {
>> @@ -279,6 +291,14 @@ static int vfio_host_win_add(VFIOContainer *container,
>>      return 0;
>>  }
>>
>> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
>> +{
>> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
>> +
>> +    g_assert(hostwin);
>
> Handle the error please.

Will this be enough?

     if (!hostwin) {
         error_report("%s: Cannot delete missing window at %"HWADDR_PRIx,
                      __func__, min_iova);
         return;
     }


>
>> +    QLIST_REMOVE(hostwin, hostwin_next);
>> +}
>> +
>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>  {
>>      return (!memory_region_is_ram(section->mr) &&
>> @@ -392,6 +412,69 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>      }
>>      end = int128_get64(int128_sub(llend, int128_one()));
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        VFIOHostDMAWindow *hostwin;
>> +        unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
>> +        unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
>> +        unsigned entries, pages;
>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>> +
>> +        trace_vfio_listener_region_add_iommu(iova, end);
>> +        /*
>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>> +         * avoid bouncing all map/unmaps through qemu this way, this
>> +         * would be the right place to wire that up (tell the KVM
>> +         * device emulation the VFIO iommu handles to use).
>> +         */
>> +        create.window_size = int128_get64(section->size);
>> +        create.page_shift = ctz64(pagesize);
>> +        /*
>> +         * SPAPR host supports multilevel TCE tables, there is some
>> +         * heuristic to decide how many levels we want for our table:
>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>> +         */
>> +        entries = create.window_size >> create.page_shift;
>> +        pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
>> +        pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
>> +        create.levels = ctz64(pages) / 6 + 1;
>> +
>> +        /* For now intersections are not allowed, we may relax this later */
>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +            if (vfio_host_win_intersects(hostwin,
>> +                    section->offset_within_address_space,
>> +                    section->offset_within_address_space +
>> +                    create.window_size - 1)) {
>> +                goto fail;
>> +            }
>> +        }
>> +
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +        if (ret) {
>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>> +            goto fail;
>> +        }
>> +
>> +        if (create.start_addr != section->offset_within_address_space) {
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = create.start_addr
>> +            };
>> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
>> +                         section->offset_within_address_space,
>> +                         create.start_addr);
>> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            ret = -EINVAL;
>> +            goto fail;
>> +        }
>> +        trace_vfio_spapr_create_window(create.page_shift,
>> +                                       create.window_size,
>> +                                       create.start_addr);
>> +
>> +        vfio_host_win_add(container, create.start_addr,
>> +                          create.start_addr + create.window_size - 1,
>> +                          1ULL << create.page_shift);
>> +    }
>
> This is a function on its own, split it out and why not stop pretending
> prereg is some sort of generic interface and let's just make a spapr
> support file.


Yet another new file - spapr.c, or rename prereg.c to spapr.c and add this 
stuff there?

Also, what bits? I'd keep VFIOHostDMAWindow business in here and moved 
vfio_iommu_spapr_tce_create and ioctl to the new place, will this be ok? 
Thanks.

>
>> +
>>      if (!vfio_host_win_lookup(container, iova, end)) {
>>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>> @@ -566,6 +649,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                       container, iova, int128_get64(llsize), ret);
>>      }
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        struct vfio_iommu_spapr_tce_remove remove = {
>> +            .argsz = sizeof(remove),
>> +            .start_addr = section->offset_within_address_space,
>> +        };
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +        if (ret) {
>> +            error_report("Failed to remove window at %"PRIx64,
>> +                         remove.start_addr);
>> +        }
>> +
>> +        vfio_host_win_del(container, section->offset_within_address_space);
>> +
>> +        trace_vfio_spapr_remove_window(remove.start_addr);
>> +    }
>
> This would be in that spapr file too.
>
>> +
>>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>>          iommu->iommu_ops->vfio_stop(section->mr);
>>      }
>> @@ -957,11 +1056,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              }
>>          }
>>
>> -        /*
>> -         * This only considers the host IOMMU's 32-bit window.  At
>> -         * some point we need to add support for the optional 64-bit
>> -         * window and dynamic windows
>> -         */
>>          info.argsz = sizeof(info);
>>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>>          if (ret) {
>> @@ -970,11 +1064,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              goto listener_release_exit;
>>          }
>>
>> -        /* The default table uses 4K pages */
>> -        vfio_host_win_add(container, info.dma32_window_start,
>> -                          info.dma32_window_start +
>> -                          info.dma32_window_size - 1,
>> -                          0x1000);
>> +        if (v2) {
>> +            /*
>> +             * There is a default window in just created container.
>> +             * To make region_add/del simpler, we better remove this
>> +             * window now and let those iommu_listener callbacks
>> +             * create/remove them when needed.
>> +             */
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = info.dma32_window_start,
>> +            };
>> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            if (ret) {
>> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            /* The default table uses 4K pages */
>> +            vfio_host_win_add(container, info.dma32_window_start,
>> +                              info.dma32_window_start +
>> +                              info.dma32_window_size - 1,
>> +                              0x1000);
>> +        }
>>      } else {
>>          error_report("vfio: No available IOMMU models");
>>          ret = -EINVAL;
>> diff --git a/trace-events b/trace-events
>> index d0d8615..b5419de 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1739,6 +1739,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
>> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>>
>>  # hw/vfio/platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-13  8:41   ` Bharata B Rao
  2016-05-13  8:49     ` Bharata B Rao
@ 2016-05-16  6:25     ` Alexey Kardashevskiy
  2016-05-17  5:32       ` Bharata B Rao
  2016-05-27  4:42     ` David Gibson
  2 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-16  6:25 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc,
	Paolo Bonzini, David Gibson

On 05/13/2016 06:41 PM, Bharata B Rao wrote:
> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.5 machine (TODO: update version) and older disable it.
>> This also creates a single DMA window for the older machines to
>> maintain backward migration.
>>
>> This implements DDW for PHB with emulated and VFIO devices. The host
>> kernel support is required. The advertised IOMMU page sizes are 4K and
>> 64K; 16M pages are supported but not advertised by default, in order to
>> enable them, the user has to specify "pgsz" property for PHB and
>> enable huge pages for RAM.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v16:
>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>>
>> v15:
>> * moved page mask filtering to PHB realize(), use "-mempath" to know
>> if there are huge pages
>> * fixed error reporting in RTAS handlers
>> * max window size accounts now hotpluggable memory boundaries
>> ---
>>  hw/ppc/Makefile.objs        |   1 +
>>  hw/ppc/spapr.c              |   5 +
>>  hw/ppc/spapr_pci.c          |  75 +++++++++---
>>  hw/ppc/spapr_rtas_ddw.c     | 292 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/pci-host/spapr.h |   8 +-
>>  include/hw/ppc/spapr.h      |  16 ++-
>>  trace-events                |   4 +
>>  7 files changed, 381 insertions(+), 20 deletions(-)
>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c1ffc77..986b36f 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>  obj-y += spapr_pci_vfio.o
>>  endif
>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>  # PowerPC 4xx boards
>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>  obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index b69995e..0206609 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2365,6 +2365,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>          .driver   = "spapr-vlan", \
>>          .property = "use-rx-buffer-pools", \
>>          .value    = "off", \
>> +    }, \
>> +    {\
>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +        .property = "ddw",\
>> +        .value    = stringify(off),\
>>      },
>>
>>  static void spapr_machine_2_5_instance_options(MachineState *machine)
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 51e7d56..aa414f2 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -35,6 +35,7 @@
>>  #include "hw/ppc/spapr.h"
>>  #include "hw/pci-host/spapr.h"
>>  #include "exec/address-spaces.h"
>> +#include "exec/ram_addr.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>>  #include "qemu/error-report.h"
>> @@ -44,6 +45,7 @@
>>  #include "hw/pci/pci_bus.h"
>>  #include "hw/ppc/spapr_drc.h"
>>  #include "sysemu/device_tree.h"
>> +#include "sysemu/hostmem.h"
>>
>>  #include "hw/vfio/vfio.h"
>>
>> @@ -1305,11 +1307,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>      PCIBus *bus;
>>      uint64_t msi_window_size = 4096;
>>      sPAPRTCETable *tcet;
>> +    const unsigned windows_supported =
>> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>>
>>      if (sphb->index != (uint32_t)-1) {
>>          hwaddr windows_base;
>>
>> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
>> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>> +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
>>              || (sphb->mem_win_addr != (hwaddr)-1)
>>              || (sphb->io_win_addr != (hwaddr)-1)) {
>>              error_setg(errp, "Either \"index\" or other parameters must"
>> @@ -1324,7 +1329,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>
>>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>> +        for (i = 0; i < windows_supported; ++i) {
>> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
>> +        }
>>
>>          windows_base = SPAPR_PCI_WINDOW_BASE
>>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>> @@ -1337,8 +1344,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          return;
>>      }
>>
>> -    if (sphb->dma_liobn == (uint32_t)-1) {
>> -        error_setg(errp, "LIOBN not specified for PHB");
>> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
>> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
>> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>>          return;
>>      }
>>
>> @@ -1456,16 +1464,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>      }
>>
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> -    if (!tcet) {
>> -        error_setg(errp, "Unable to create TCE table for %s",
>> -                   sphb->dtbusname);
>> -        return;
>> +    /* DMA setup */
>> +    for (i = 0; i < windows_supported; ++i) {
>> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
>> +        if (!tcet) {
>> +            error_setg(errp, "Creating window#%d failed for %s",
>> +                       i, sphb->dtbusname);
>> +            return;
>> +        }
>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> +                                            spapr_tce_get_iommu(tcet), 0);
>>      }
>>
>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> -                                        spapr_tce_get_iommu(tcet), 0);
>> -
>>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>  }
>>
>> @@ -1482,13 +1492,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>
>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>  {
>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>> +    int i;
>> +    sPAPRTCETable *tcet;
>>
>> -    if (tcet && tcet->enabled) {
>> -        spapr_tce_table_disable(tcet);
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
>> +
>> +        if (tcet && tcet->enabled) {
>> +            spapr_tce_table_disable(tcet);
>> +        }
>>      }
>>
>>      /* Register default 32bit DMA window */
>> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>>  }
>> @@ -1510,7 +1526,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>  static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>> @@ -1522,6 +1539,11 @@ static Property spapr_phb_properties[] = {
>>      /* Default DMA window is 0..1GB */
>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>> +                       0x800000000000000ULL),
>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>> +                       (1ULL << 12) | (1ULL << 16)),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>
>> @@ -1598,7 +1620,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>      .post_load = spapr_pci_post_load,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
>> @@ -1775,6 +1797,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      uint32_t interrupt_map_mask[] = {
>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>> +    };
>> +    uint32_t ddw_extensions[] = {
>> +        cpu_to_be32(1),
>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>> +    };
>>      sPAPRTCETable *tcet;
>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>      sPAPRFDT s_fdt;
>> @@ -1799,6 +1830,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>
>> +    /* Dynamic DMA window */
>> +    if (phb->ddw_enabled) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>> +    }
>> +
>>      /* Build the interrupt-map, this must matches what is done
>>       * in pci_spapr_map_irq
>>       */
>> @@ -1822,7 +1861,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>>                       sizeof(interrupt_map)));
>>
>> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>      if (!tcet) {
>>          return -1;
>>      }
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..b4e0686
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,292 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->enabled) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->enabled) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
>> +{
>> +    int i;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
>> +        if (page_mask & (1ULL << masks[i].shift)) {
>> +            mask |= masks[i].mask;
>> +        }
>> +    }
>> +
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid, max_window_size;
>> +    uint32_t avail, addr, pgmask = 0;
>> +    MachineState *machine = MACHINE(spapr);
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    /* Translate page mask to LoPAPR format */
>> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>> +
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>> +     */
>> +    if (machine->ram_size == machine->maxram_size) {
>> +        max_window_size = machine->ram_size >> SPAPR_TCE_PAGE_SHIFT;
>> +    } else {
>> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
>> +
>> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
>> +    }
>
> Guess SPAPR_TCE_PAGE_SHIFT right shift should be applied to
> max_window_size in both the instances (if and else) ?

Yes it should. Thanks for noticing.


>
>> +
>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +    rtas_st(rets, 2, max_window_size);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>
> Kernel has a bug due to which wrong window_shift gets returned here. I
> have posted possible fix here:
> https://patchwork.ozlabs.org/patch/621497/
>
> I have tried to work around this issue in QEMU too
> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
>
> But the above work around involves changing the memory representation
> in DT.

What is wrong with this workaround?


> Hence I feel until the guest kernel changes are available, a
> simpler work around would be to discard the window_shift value above
> and recalculate the right value as below:
>
> if (machine->ram_size == machine->maxram_size) {
>     max_window_size = machine->ram_size;
> } else {
>      MemoryHotplugState *hpms = &spapr->hotplug_memory;
>      max_window_size = hpms->base + memory_region_size(&hpms->mr);
> }
> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT;
>
> and create DDW based on this calculated window_shift value. Does that
> sound reasonable ?

The workaround should only do that for the second window, at least, or for 
the default one but with page size at least 64K; otherwise it is going to 
be a waste of memory (2MB per each 1GB of guest RAM).



-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-05-13 22:26   ` Alex Williamson
@ 2016-05-16  8:35     ` Alexey Kardashevskiy
  2016-05-16 20:13       ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-16  8:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On 05/14/2016 08:26 AM, Alex Williamson wrote:
> On Wed,  4 May 2016 16:52:29 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
>> a guest view of the table and a hardware TCE table. If there is no VFIO
>> presense in the address space, then just the guest view is used, if
>> this is the case, it is allocated in the KVM. However since there is no
>> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
>> we need to move the guest view from KVM to the userspace; and we need
>> to do this for every IOMMU on a bus with VFIO devices.
>>
>> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
>> notify IOMMU about changing environment so it can reallocate the table
>> to/from KVM or (when available) hook the IOMMU groups with the logical
>> bus (LIOBN) in the KVM.
>>
>> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
>> path as the new callbacks do this better - they notify IOMMU at
>> the exact moment when the configuration is changed, and this also
>> includes the case of PCI hot unplug.
>>
>> This postpones vfio_stop() till the end of region_del() as
>> vfio_dma_unmap() has to execute before VFIO support is disabled.
>>
>> As there can be multiple containers attached to the same PHB/LIOBN,
>> this adds a wrapper with a use counter for every IOMMU MR and
>> stores them in a list in the VFIOAddressSpace.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v16:
>> * added a use counter in VFIOAddressSpace->VFIOIOMMUMR
>>
>> v15:
>> * s/need_vfio/vfio-Users/g
>> ---
>>  hw/ppc/spapr_iommu.c          | 12 ++++++++++++
>>  hw/ppc/spapr_pci.c            |  6 ------
>>  hw/vfio/common.c              | 45 ++++++++++++++++++++++++++++++++++++++++++-
>>  include/exec/memory.h         |  4 ++++
>>  include/hw/vfio/vfio-common.h |  7 +++++++
>>  5 files changed, 67 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index c945dba..7af2700 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>      return 1ULL << tcet->page_shift;
>>  }
>>
>> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
>> +{
>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
>> +}
>> +
>> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
>> +{
>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
>> +}
>> +
>>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>>
>> @@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>      .translate = spapr_tce_translate_iommu,
>>      .get_page_sizes = spapr_tce_get_page_sizes,
>> +    .vfio_start = spapr_tce_vfio_start,
>> +    .vfio_stop = spapr_tce_vfio_stop,
>>  };
>>
>>  static int spapr_tce_table_realize(DeviceState *dev)
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 5b9ccff..51e7d56 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -1086,12 +1086,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>      void *fdt = NULL;
>>      int fdt_start_offset = 0, fdt_size;
>>
>> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> -
>> -        spapr_tce_set_need_vfio(tcet, true);
>> -    }
>> -
>>      if (dev->hotplugged) {
>>          fdt = create_device_tree(&fdt_size);
>>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 3f2fb23..03daf88 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -421,6 +421,26 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>
>>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> +
>> +        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
>> +            VFIOIOMMUMR *iommumr;
>> +            bool found = false;
>> +
>> +            QLIST_FOREACH(iommumr, &container->space->iommumrs, iommumr_next) {
>> +                if (iommumr->iommu == section->mr) {
>> +                    found = true;
>> +                    break;
>> +                }
>> +            }
>> +            if (!found) {
>> +                iommumr = g_malloc0(sizeof(*iommumr));
>> +                iommumr->iommu = section->mr;
>> +                section->mr->iommu_ops->vfio_start(section->mr);
>> +                QLIST_INSERT_HEAD(&container->space->iommumrs, iommumr,
>> +                                  iommumr_next);
>> +            }
>> +            ++iommumr->users;
>> +        }
>>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
>>                                     false);
>>
>> @@ -470,6 +490,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>      hwaddr iova, end;
>>      Int128 llend, llsize;
>>      int ret;
>> +    MemoryRegion *iommu = NULL;
>>
>>      if (vfio_listener_skipped_section(section)) {
>>          trace_vfio_listener_region_del_skip(
>> @@ -490,13 +511,30 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>
>>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>              if (giommu->iommu == section->mr) {
>> +                VFIOIOMMUMR *iommumr;
>> +
>>                  memory_region_unregister_iommu_notifier(&giommu->n);
>> +
>> +                QLIST_FOREACH(iommumr, &container->space->iommumrs,
>> +                              iommumr_next) {
>> +                    if (iommumr->iommu != section->mr) {
>> +                        continue;
>> +                    }
>> +                    --iommumr->users;
>> +                    if (iommumr->users) {
>> +                        break;
>> +                    }
>> +                    QLIST_REMOVE(iommumr, iommumr_next);
>> +                    g_free(iommumr);
>> +                    iommu = giommu->iommu;
>> +                    break;
>> +                }
>> +
>>                  QLIST_REMOVE(giommu, giommu_next);
>>                  g_free(giommu);
>>                  break;
>>              }
>>          }
>> -
>>          /*
>>           * FIXME: We assume the one big unmap below is adequate to
>>           * remove any individual page mappings in the IOMMU which
>> @@ -527,6 +565,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                       "0x%"HWADDR_PRIx") = %d (%m)",
>>                       container, iova, int128_get64(llsize), ret);
>>      }
>> +
>> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>> +        iommu->iommu_ops->vfio_stop(section->mr);
>> +    }
>>  }
>>
>>  static const MemoryListener vfio_memory_listener = {
>> @@ -787,6 +829,7 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>>      space = g_malloc0(sizeof(*space));
>>      space->as = as;
>>      QLIST_INIT(&space->containers);
>> +    QLIST_INIT(&space->iommumrs);
>>
>>      QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index a3a1703..52d2c70 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
>>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>>      /* Returns supported page sizes */
>>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>> +    /* Called when VFIO starts using this */
>> +    void (*vfio_start)(MemoryRegion *iommu);
>> +    /* Called when VFIO stops using this */
>> +    void (*vfio_stop)(MemoryRegion *iommu);
>
> Really?  Just no.  Generic MemoryRegionIOMMUOps should have no
> visibility of vfio and certainly not vfio specific callbacks.  I don't
> really understand what guest view versus KVM view is doing here, but
> it's clearly something to do with visibility versus acceleration of the
> IOMMU tables and the callbacks, if they're even generic at all, should
> be reflecting that, not some vague relation to vfio.


Is it 'no' to just having these specific callbacks in the 
MemoryRegionIOMMUOps struct or 'no' to the idea of notifying the MR about 
VFIO started using it?

If the former, I could inherit a QOM object from IOMMU MR, add an QOM 
interface and call it from VFIO. Manual counting vfio-pci devices on a 
hotplug-enabled PHB is painful :-/


>>  };
>>
>>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 808263b..a9e6e33 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -64,9 +64,16 @@ typedef struct VFIORegion {
>>  typedef struct VFIOAddressSpace {
>>      AddressSpace *as;
>>      QLIST_HEAD(, VFIOContainer) containers;
>> +    QLIST_HEAD(, VFIOIOMMUMR) iommumrs;
>>      QLIST_ENTRY(VFIOAddressSpace) list;
>>  } VFIOAddressSpace;
>>
>> +typedef struct VFIOIOMMUMR {
>> +    MemoryRegion *iommu;
>> +    int users;
>> +    QLIST_ENTRY(VFIOIOMMUMR) iommumr_next;
>> +} VFIOIOMMUMR;
>> +
>>  struct VFIOGroup;
>>
>>  typedef struct VFIOContainer {
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-05-16  8:35     ` Alexey Kardashevskiy
@ 2016-05-16 20:13       ` Alex Williamson
  2016-05-20  8:04         ` [Qemu-devel] [RFC PATCH qemu] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-16 20:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Mon, 16 May 2016 18:35:14 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 05/14/2016 08:26 AM, Alex Williamson wrote:
> > On Wed,  4 May 2016 16:52:29 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >  
> >> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> >> a guest view of the table and a hardware TCE table. If there is no VFIO
> >> presense in the address space, then just the guest view is used, if
> >> this is the case, it is allocated in the KVM. However since there is no
> >> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> >> we need to move the guest view from KVM to the userspace; and we need
> >> to do this for every IOMMU on a bus with VFIO devices.
> >>
> >> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> >> notify IOMMU about changing environment so it can reallocate the table
> >> to/from KVM or (when available) hook the IOMMU groups with the logical
> >> bus (LIOBN) in the KVM.
> >>
> >> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> >> path as the new callbacks do this better - they notify IOMMU at
> >> the exact moment when the configuration is changed, and this also
> >> includes the case of PCI hot unplug.
> >>
> >> This postpones vfio_stop() till the end of region_del() as
> >> vfio_dma_unmap() has to execute before VFIO support is disabled.
> >>
> >> As there can be multiple containers attached to the same PHB/LIOBN,
> >> this adds a wrapper with a use counter for every IOMMU MR and
> >> stores them in a list in the VFIOAddressSpace.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v16:
> >> * added a use counter in VFIOAddressSpace->VFIOIOMMUMR
> >>
> >> v15:
> >> * s/need_vfio/vfio-Users/g
> >> ---
> >>  hw/ppc/spapr_iommu.c          | 12 ++++++++++++
> >>  hw/ppc/spapr_pci.c            |  6 ------
> >>  hw/vfio/common.c              | 45 ++++++++++++++++++++++++++++++++++++++++++-
> >>  include/exec/memory.h         |  4 ++++
> >>  include/hw/vfio/vfio-common.h |  7 +++++++
> >>  5 files changed, 67 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >> index c945dba..7af2700 100644
> >> --- a/hw/ppc/spapr_iommu.c
> >> +++ b/hw/ppc/spapr_iommu.c
> >> @@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >>      return 1ULL << tcet->page_shift;
> >>  }
> >>
> >> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
> >> +{
> >> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> >> +}
> >> +
> >> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> >> +{
> >> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> >> +}
> >> +
> >>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> >>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> >>
> >> @@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
> >>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
> >>      .translate = spapr_tce_translate_iommu,
> >>      .get_page_sizes = spapr_tce_get_page_sizes,
> >> +    .vfio_start = spapr_tce_vfio_start,
> >> +    .vfio_stop = spapr_tce_vfio_stop,
> >>  };
> >>
> >>  static int spapr_tce_table_realize(DeviceState *dev)
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 5b9ccff..51e7d56 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -1086,12 +1086,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >>      void *fdt = NULL;
> >>      int fdt_start_offset = 0, fdt_size;
> >>
> >> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> >> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >> -
> >> -        spapr_tce_set_need_vfio(tcet, true);
> >> -    }
> >> -
> >>      if (dev->hotplugged) {
> >>          fdt = create_device_tree(&fdt_size);
> >>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 3f2fb23..03daf88 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -421,6 +421,26 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >>
> >>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> >> +
> >> +        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
> >> +            VFIOIOMMUMR *iommumr;
> >> +            bool found = false;
> >> +
> >> +            QLIST_FOREACH(iommumr, &container->space->iommumrs, iommumr_next) {
> >> +                if (iommumr->iommu == section->mr) {
> >> +                    found = true;
> >> +                    break;
> >> +                }
> >> +            }
> >> +            if (!found) {
> >> +                iommumr = g_malloc0(sizeof(*iommumr));
> >> +                iommumr->iommu = section->mr;
> >> +                section->mr->iommu_ops->vfio_start(section->mr);
> >> +                QLIST_INSERT_HEAD(&container->space->iommumrs, iommumr,
> >> +                                  iommumr_next);
> >> +            }
> >> +            ++iommumr->users;
> >> +        }
> >>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> >>                                     false);
> >>
> >> @@ -470,6 +490,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>      hwaddr iova, end;
> >>      Int128 llend, llsize;
> >>      int ret;
> >> +    MemoryRegion *iommu = NULL;
> >>
> >>      if (vfio_listener_skipped_section(section)) {
> >>          trace_vfio_listener_region_del_skip(
> >> @@ -490,13 +511,30 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>
> >>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >>              if (giommu->iommu == section->mr) {
> >> +                VFIOIOMMUMR *iommumr;
> >> +
> >>                  memory_region_unregister_iommu_notifier(&giommu->n);
> >> +
> >> +                QLIST_FOREACH(iommumr, &container->space->iommumrs,
> >> +                              iommumr_next) {
> >> +                    if (iommumr->iommu != section->mr) {
> >> +                        continue;
> >> +                    }
> >> +                    --iommumr->users;
> >> +                    if (iommumr->users) {
> >> +                        break;
> >> +                    }
> >> +                    QLIST_REMOVE(iommumr, iommumr_next);
> >> +                    g_free(iommumr);
> >> +                    iommu = giommu->iommu;
> >> +                    break;
> >> +                }
> >> +
> >>                  QLIST_REMOVE(giommu, giommu_next);
> >>                  g_free(giommu);
> >>                  break;
> >>              }
> >>          }
> >> -
> >>          /*
> >>           * FIXME: We assume the one big unmap below is adequate to
> >>           * remove any individual page mappings in the IOMMU which
> >> @@ -527,6 +565,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                       "0x%"HWADDR_PRIx") = %d (%m)",
> >>                       container, iova, int128_get64(llsize), ret);
> >>      }
> >> +
> >> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> >> +        iommu->iommu_ops->vfio_stop(section->mr);
> >> +    }
> >>  }
> >>
> >>  static const MemoryListener vfio_memory_listener = {
> >> @@ -787,6 +829,7 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
> >>      space = g_malloc0(sizeof(*space));
> >>      space->as = as;
> >>      QLIST_INIT(&space->containers);
> >> +    QLIST_INIT(&space->iommumrs);
> >>
> >>      QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
> >>
> >> diff --git a/include/exec/memory.h b/include/exec/memory.h
> >> index a3a1703..52d2c70 100644
> >> --- a/include/exec/memory.h
> >> +++ b/include/exec/memory.h
> >> @@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
> >>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> >>      /* Returns supported page sizes */
> >>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> >> +    /* Called when VFIO starts using this */
> >> +    void (*vfio_start)(MemoryRegion *iommu);
> >> +    /* Called when VFIO stops using this */
> >> +    void (*vfio_stop)(MemoryRegion *iommu);  
> >
> > Really?  Just no.  Generic MemoryRegionIOMMUOps should have no
> > visibility of vfio and certainly not vfio specific callbacks.  I don't
> > really understand what guest view versus KVM view is doing here, but
> > it's clearly something to do with visibility versus acceleration of the
> > IOMMU tables and the callbacks, if they're even generic at all, should
> > be reflecting that, not some vague relation to vfio.  
> 
> 
> Is it 'no' to just having these specific callbacks in the 
> MemoryRegionIOMMUOps struct or 'no' to the idea of notifying the MR about 
> VFIO started using it?
> 
> If the former, I could inherit a QOM object from IOMMU MR, add an QOM 
> interface and call it from VFIO. Manual counting vfio-pci devices on a 
> hotplug-enabled PHB is painful :-/

So is this interface.  The idea of putting vfio specific callbacks in a
common structure is just a non-starter.  I think you're trying to
manage the visibility of the iommu to QEMU, so wouldn't it be logical
to tie this into the iommu notifier?

You already have memory_region_register_iommu_notifier() and 
memory_region_unregister_iommu_notifier() as generic entry points where
a caller registering an iommu notifier clearly cares about a QEMU-based
iommu table and de-registration clearly indicates an end of that use.
Could you not make MemoryRegionIOMMUOps callbacks around those events,
rather than tainting it with vfio specific callbacks?  If you don't want
to keep track of usage count yourself, perhaps the callbacks would only
be used when the first entry is added to the notifier list QLIST and
when the last entry is removed.  This would allow vfio/common to know
nothing about this and we wouldn't need to invoke a poorly defined ops
callback interface or duplicate reference counts on platforms that
don't have this issue.


> >>  };
> >>
> >>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 808263b..a9e6e33 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -64,9 +64,16 @@ typedef struct VFIORegion {
> >>  typedef struct VFIOAddressSpace {
> >>      AddressSpace *as;
> >>      QLIST_HEAD(, VFIOContainer) containers;
> >> +    QLIST_HEAD(, VFIOIOMMUMR) iommumrs;
> >>      QLIST_ENTRY(VFIOAddressSpace) list;
> >>  } VFIOAddressSpace;
> >>
> >> +typedef struct VFIOIOMMUMR {
> >> +    MemoryRegion *iommu;
> >> +    int users;
> >> +    QLIST_ENTRY(VFIOIOMMUMR) iommumr_next;
> >> +} VFIOIOMMUMR;
> >> +
> >>  struct VFIOGroup;
> >>
> >>  typedef struct VFIOContainer {  
> >  
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-05-16  4:52     ` Alexey Kardashevskiy
@ 2016-05-16 20:20       ` Alex Williamson
  2016-05-27  0:50         ` David Gibson
  2016-05-27  3:49         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 69+ messages in thread
From: Alex Williamson @ 2016-05-16 20:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Mon, 16 May 2016 14:52:41 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 05/14/2016 08:26 AM, Alex Williamson wrote:
> > On Wed,  4 May 2016 16:52:30 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >  
> >> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >> This adds ability to VFIO common code to dynamically allocate/remove
> >> DMA windows in the host kernel when new VFIO container is added/removed.
> >>
> >> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >> and adds just created IOMMU into the host IOMMU list; the opposite
> >> action is taken in vfio_listener_region_del.
> >>
> >> When creating a new window, this uses heuristic to decide on the TCE table
> >> levels number.
> >>
> >> This should cause no guest visible change in behavior.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v16:
> >> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> >> * enforced no intersections between windows
> >>
> >> v14:
> >> * new to the series
> >> ---
> >>  hw/vfio/common.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  trace-events     |   2 +
> >>  2 files changed, 125 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 03daf88..bd2dee8 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -240,6 +240,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>      return -errno;
> >>  }
> >>
> >> +static bool range_contains(hwaddr start, hwaddr end, hwaddr addr)
> >> +{
> >> +    return start <= addr && addr <= end;
> >> +}  
> >
> > a) If you want a "range_foo" function then put it in range.h
> > b) I suspect there are already range.h functions that can do this.
> >  
> >> +
> >> +static bool vfio_host_win_intersects(VFIOHostDMAWindow *hostwin,
> >> +                                     hwaddr min_iova, hwaddr max_iova)
> >> +{
> >> +    return range_contains(hostwin->min_iova, hostwin->min_iova, min_iova) ||
> >> +        range_contains(min_iova, max_iova, hostwin->min_iova);
> >> +}  
> >
> > How is this different than ranges_overlap()?  
> >> +
> >>  static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> >>                                                 hwaddr min_iova, hwaddr max_iova)
> >>  {
> >> @@ -279,6 +291,14 @@ static int vfio_host_win_add(VFIOContainer *container,
> >>      return 0;
> >>  }
> >>
> >> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> >> +{
> >> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> >> +
> >> +    g_assert(hostwin);  
> >
> > Handle the error please.  
> 
> Will this be enough?
> 
>      if (!hostwin) {
>          error_report("%s: Cannot delete missing window at %"HWADDR_PRIx,
>                       __func__, min_iova);
>          return;
>      }

Better.  I was really thinking to return error to the caller, but if
the caller has no return path, perhaps this is as good as we can do.
Expect that I will push back on any assert() calls added to vfio.


> >> +    QLIST_REMOVE(hostwin, hostwin_next);
> >> +}
> >> +
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >>      return (!memory_region_is_ram(section->mr) &&
> >> @@ -392,6 +412,69 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      }
> >>      end = int128_get64(int128_sub(llend, int128_one()));
> >>
> >> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >> +        VFIOHostDMAWindow *hostwin;
> >> +        unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
> >> +        unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
> >> +        unsigned entries, pages;
> >> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >> +
> >> +        trace_vfio_listener_region_add_iommu(iova, end);
> >> +        /*
> >> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> >> +         * avoid bouncing all map/unmaps through qemu this way, this
> >> +         * would be the right place to wire that up (tell the KVM
> >> +         * device emulation the VFIO iommu handles to use).
> >> +         */
> >> +        create.window_size = int128_get64(section->size);
> >> +        create.page_shift = ctz64(pagesize);
> >> +        /*
> >> +         * SPAPR host supports multilevel TCE tables, there is some
> >> +         * heuristic to decide how many levels we want for our table:
> >> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >> +         */
> >> +        entries = create.window_size >> create.page_shift;
> >> +        pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> >> +        pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
> >> +        create.levels = ctz64(pages) / 6 + 1;
> >> +
> >> +        /* For now intersections are not allowed, we may relax this later */
> >> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> >> +            if (vfio_host_win_intersects(hostwin,
> >> +                    section->offset_within_address_space,
> >> +                    section->offset_within_address_space +
> >> +                    create.window_size - 1)) {
> >> +                goto fail;
> >> +            }
> >> +        }
> >> +
> >> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >> +        if (ret) {
> >> +            error_report("Failed to create a window, ret = %d (%m)", ret);
> >> +            goto fail;
> >> +        }
> >> +
> >> +        if (create.start_addr != section->offset_within_address_space) {
> >> +            struct vfio_iommu_spapr_tce_remove remove = {
> >> +                .argsz = sizeof(remove),
> >> +                .start_addr = create.start_addr
> >> +            };
> >> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> >> +                         section->offset_within_address_space,
> >> +                         create.start_addr);
> >> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >> +            ret = -EINVAL;
> >> +            goto fail;
> >> +        }
> >> +        trace_vfio_spapr_create_window(create.page_shift,
> >> +                                       create.window_size,
> >> +                                       create.start_addr);
> >> +
> >> +        vfio_host_win_add(container, create.start_addr,
> >> +                          create.start_addr + create.window_size - 1,
> >> +                          1ULL << create.page_shift);
> >> +    }  
> >
> > This is a function on its own, split it out and why not stop pretending
> > prereg is some sort of generic interface and let's just make a spapr
> > support file.  
> 
> 
> Yet another new file - spapr.c, or rename prereg.c to spapr.c and add this 
> stuff there?

prereg.c is already spapr specific, so I'd rename it and potentially
add this to it.

> Also, what bits? I'd keep VFIOHostDMAWindow business in here and moved 
> vfio_iommu_spapr_tce_create and ioctl to the new place, will this be ok? 
> Thanks.

VFIOHostDMAWindow is actually generic, so keeping it here makes sense.
  
> >> +
> >>      if (!vfio_host_win_lookup(container, iova, end)) {
> >>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
> >>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> >> @@ -566,6 +649,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                       container, iova, int128_get64(llsize), ret);
> >>      }
> >>
> >> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >> +        struct vfio_iommu_spapr_tce_remove remove = {
> >> +            .argsz = sizeof(remove),
> >> +            .start_addr = section->offset_within_address_space,
> >> +        };
> >> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >> +        if (ret) {
> >> +            error_report("Failed to remove window at %"PRIx64,
> >> +                         remove.start_addr);
> >> +        }
> >> +
> >> +        vfio_host_win_del(container, section->offset_within_address_space);
> >> +
> >> +        trace_vfio_spapr_remove_window(remove.start_addr);
> >> +    }  
> >
> > This would be in that spapr file too.
> >  
> >> +
> >>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> >>          iommu->iommu_ops->vfio_stop(section->mr);
> >>      }
> >> @@ -957,11 +1056,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              }
> >>          }
> >>
> >> -        /*
> >> -         * This only considers the host IOMMU's 32-bit window.  At
> >> -         * some point we need to add support for the optional 64-bit
> >> -         * window and dynamic windows
> >> -         */
> >>          info.argsz = sizeof(info);
> >>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
> >>          if (ret) {
> >> @@ -970,11 +1064,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto listener_release_exit;
> >>          }
> >>
> >> -        /* The default table uses 4K pages */
> >> -        vfio_host_win_add(container, info.dma32_window_start,
> >> -                          info.dma32_window_start +
> >> -                          info.dma32_window_size - 1,
> >> -                          0x1000);
> >> +        if (v2) {
> >> +            /*
> >> +             * There is a default window in just created container.
> >> +             * To make region_add/del simpler, we better remove this
> >> +             * window now and let those iommu_listener callbacks
> >> +             * create/remove them when needed.
> >> +             */
> >> +            struct vfio_iommu_spapr_tce_remove remove = {
> >> +                .argsz = sizeof(remove),
> >> +                .start_addr = info.dma32_window_start,
> >> +            };
> >> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >> +            if (ret) {
> >> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> >> +                ret = -errno;
> >> +                goto free_container_exit;
> >> +            }
> >> +        } else {
> >> +            /* The default table uses 4K pages */
> >> +            vfio_host_win_add(container, info.dma32_window_start,
> >> +                              info.dma32_window_start +
> >> +                              info.dma32_window_size - 1,
> >> +                              0x1000);
> >> +        }
> >>      } else {
> >>          error_report("vfio: No available IOMMU models");
> >>          ret = -EINVAL;
> >> diff --git a/trace-events b/trace-events
> >> index d0d8615..b5419de 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1739,6 +1739,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
> >>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> >>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> >> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
> >>
> >>  # hw/vfio/platform.c
> >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"  
> >  
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-05-16  1:10     ` Alexey Kardashevskiy
@ 2016-05-16 20:20       ` Alex Williamson
  2016-05-26  4:53         ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-16 20:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On Mon, 16 May 2016 11:10:05 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 05/14/2016 08:25 AM, Alex Williamson wrote:
> > On Wed,  4 May 2016 16:52:26 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >  
> >> This makes use of the new "memory registering" feature. The idea is
> >> to provide the userspace ability to notify the host kernel about pages
> >> which are going to be used for DMA. Having this information, the host
> >> kernel can pin them all once per user process, do locked pages
> >> accounting (once) and not spent time on doing that in real time with
> >> possible failures which cannot be handled nicely in some cases.
> >>
> >> This adds a prereg memory listener which listens on address_space_memory
> >> and notifies a VFIO container about memory which needs to be
> >> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> >>
> >> As there is no per-IOMMU-type release() callback anymore, this stores
> >> the IOMMU type in the container so vfio_listener_release() can determine
> >> if it needs to unregister @prereg_listener.
> >>
> >> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >> not call it when v2 is detected and enabled.
> >>
> >> This enforces guest RAM blocks to be host page size aligned; however
> >> this is not new as KVM already requires memory slots to be host page
> >> size aligned.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v16:
> >> * switched to 64bit math everywhere as there is no chance to see
> >> region_add on RAM blocks even remotely close to 1<<64bytes.
> >>
> >> v15:
> >> * banned unaligned sections
> >> * added an vfio_prereg_gpa_to_ua() helper
> >>
> >> v14:
> >> * s/free_container_exit/listener_release_exit/g
> >> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> >> ---
> >>  hw/vfio/Makefile.objs         |   1 +
> >>  hw/vfio/common.c              |  38 +++++++++---
> >>  hw/vfio/prereg.c              | 137 ++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/vfio/vfio-common.h |   4 ++
> >>  trace-events                  |   2 +
> >>  5 files changed, 172 insertions(+), 10 deletions(-)
> >>  create mode 100644 hw/vfio/prereg.c
> >>
> >> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> >> index ceddbb8..5800e0e 100644
> >> --- a/hw/vfio/Makefile.objs
> >> +++ b/hw/vfio/Makefile.objs
> >> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
> >>  obj-$(CONFIG_SOFTMMU) += platform.o
> >>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
> >>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> >> +obj-$(CONFIG_SOFTMMU) += prereg.o
> >>  endif
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 2050040..496eb82 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -501,6 +501,9 @@ static const MemoryListener vfio_memory_listener = {
> >>  static void vfio_listener_release(VFIOContainer *container)
> >>  {
> >>      memory_listener_unregister(&container->listener);
> >> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >> +        memory_listener_unregister(&container->prereg_listener);
> >> +    }
> >>  }
> >>
> >>  int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
> >> @@ -808,8 +811,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto free_container_exit;
> >>          }
> >>
> >> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> >> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> >> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>          if (ret) {
> >>              error_report("vfio: failed to set iommu for container: %m");
> >>              ret = -errno;
> >> @@ -834,8 +837,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >>              container->iova_pgsizes = info.iova_pgsizes;
> >>          }
> >> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >>          struct vfio_iommu_spapr_tce_info info;
> >> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >>
> >>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>          if (ret) {
> >> @@ -843,7 +848,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              ret = -errno;
> >>              goto free_container_exit;
> >>          }
> >> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >> +        container->iommu_type =
> >> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>          if (ret) {
> >>              error_report("vfio: failed to set iommu for container: %m");
> >>              ret = -errno;
> >> @@ -855,11 +862,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>           * when container fd is closed so we do not call it explicitly
> >>           * in this file.
> >>           */
> >> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> -        if (ret) {
> >> -            error_report("vfio: failed to enable container: %m");
> >> -            ret = -errno;
> >> -            goto free_container_exit;
> >> +        if (!v2) {
> >> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> +            if (ret) {
> >> +                error_report("vfio: failed to enable container: %m");
> >> +                ret = -errno;
> >> +                goto free_container_exit;
> >> +            }
> >> +        } else {
> >> +            container->prereg_listener = vfio_prereg_listener;
> >> +
> >> +            memory_listener_register(&container->prereg_listener,
> >> +                                     &address_space_memory);
> >> +            if (container->error) {
> >> +                error_report("vfio: RAM memory listener initialization failed for container");
> >> +                goto listener_release_exit;
> >> +            }
> >>          }
> >>
> >>          /*
> >> @@ -872,7 +890,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>          if (ret) {
> >>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
> >>              ret = -errno;
> >> -            goto free_container_exit;
> >> +            goto listener_release_exit;
> >>          }
> >>          container->min_iova = info.dma32_window_start;
> >>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> >> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> >> new file mode 100644
> >> index 0000000..d0e4728
> >> --- /dev/null
> >> +++ b/hw/vfio/prereg.c
> >> @@ -0,0 +1,137 @@
> >> +/*
> >> + * DMA memory preregistration
> >> + *
> >> + * Authors:
> >> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> >> + * the COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include <sys/ioctl.h>
> >> +#include <linux/vfio.h>
> >> +
> >> +#include "hw/vfio/vfio-common.h"
> >> +#include "hw/hw.h"
> >> +#include "qemu/error-report.h"
> >> +#include "trace.h"
> >> +
> >> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> >> +{
> >> +    if (memory_region_is_iommu(section->mr)) {
> >> +        error_report("Cannot possibly preregister IOMMU memory");  
> >
> > What is a user supposed to do with this error_report()?  Is it
> > continue-able?  How is it possible?  What should they do differently?  
> 
> 
> If I remember correctly, David did have theories where this may be 
> possible, not today with the existing code though. Could be assert() or 
> abort(), what is better here?

If it's a hardware configuration error, then use hw_error(), I prefer
not to add either assert() or abort() calls to vfio.

> >> +        return true;
> >> +    }
> >> +
> >> +    return !memory_region_is_ram(section->mr) ||
> >> +            memory_region_is_skip_dump(section->mr);
> >> +}
> >> +
> >> +static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)  
> >
> > What's "ua"?  
> 
> 
> Userspace address.

But we use it to set a vaddr below, so let's just call it vaddr.

> >  
> >> +{
> >> +    return memory_region_get_ram_ptr(section->mr) +
> >> +        section->offset_within_region +
> >> +        (gpa - section->offset_within_address_space);
> >> +}
> >> +
> >> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> >> +                                            MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            prereg_listener);
> >> +    const hwaddr gpa = section->offset_within_address_space;
> >> +    hwaddr end;
> >> +    int ret;
> >> +    hwaddr page_mask = qemu_real_host_page_mask;
> >> +    struct vfio_iommu_spapr_register_memory reg = {
> >> +        .argsz = sizeof(reg),
> >> +        .flags = 0,
> >> +    };  
> >
> > So we're just pretending that this spapr specific code is some sort of
> > generic pre-registration interface?  
> 
> Yes.

:-\

> >> +
> >> +    if (vfio_prereg_listener_skipped_section(section)) {
> >> +        trace_vfio_listener_region_add_skip(
> >> +                section->offset_within_address_space,
> >> +                section->offset_within_address_space +
> >> +                int128_get64(int128_sub(section->size, int128_one())));
> >> +        return;
> >> +    }
> >> +
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> >> +                 (section->offset_within_region & ~page_mask) ||
> >> +                 (int128_get64(section->size) & ~page_mask))) {
> >> +        error_report("%s received unaligned region", __func__);
> >> +        return;
> >> +    }
> >> +
> >> +    end = section->offset_within_address_space + int128_get64(section->size);
> >> +    g_assert(gpa < end);
> >> +
> >> +    memory_region_ref(section->mr);
> >> +
> >> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);  
> >
> > Hmm, why wasn't that simply gpa_to_vaddr?  
> 
> I wanted to keep a prefix in all functions, even if they are static, easier 
> to grep. Bad idea?

My question about "ua" means that it's not obvious what we're returning
based on the name of the function alone, so I would avoid such a name.

> >> +    reg.size = end - gpa;
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> >> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> >> +    if (ret) {
> >> +        /*
> >> +         * On the initfn path, store the first error in the container so we
> >> +         * can gracefully fail.  Runtime, there's not much we can do other
> >> +         * than throw a hardware error.
> >> +         */
> >> +        if (!container->initialized) {
> >> +            if (!container->error) {
> >> +                container->error = ret;
> >> +            }
> >> +        } else {
> >> +            hw_error("vfio: Memory registering failed, unable to continue");
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> >> +                                            MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            prereg_listener);
> >> +    const hwaddr gpa = section->offset_within_address_space;
> >> +    hwaddr end;
> >> +    int ret;
> >> +    hwaddr page_mask = qemu_real_host_page_mask;
> >> +    struct vfio_iommu_spapr_register_memory reg = {
> >> +        .argsz = sizeof(reg),
> >> +        .flags = 0,
> >> +    };
> >> +
> >> +    if (vfio_prereg_listener_skipped_section(section)) {
> >> +        trace_vfio_listener_region_del_skip(
> >> +                section->offset_within_address_space,
> >> +                section->offset_within_address_space +
> >> +                int128_get64(int128_sub(section->size, int128_one())));
> >> +        return;
> >> +    }
> >> +
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> >> +                 (section->offset_within_region & ~page_mask) ||
> >> +                 (int128_get64(section->size) & ~page_mask))) {
> >> +        error_report("%s received unaligned region", __func__);
> >> +        return;
> >> +    }
> >> +
> >> +    end = section->offset_within_address_space + int128_get64(section->size);
> >> +    if (gpa >= end) {
> >> +        return;
> >> +    }
> >> +
> >> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
> >> +    reg.size = end - gpa;
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> >> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> >> +}
> >> +
> >> +const MemoryListener vfio_prereg_listener = {
> >> +    .region_add = vfio_prereg_listener_region_add,
> >> +    .region_del = vfio_prereg_listener_region_del,
> >> +};
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index c9b6622..c72e45a 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
> >>      VFIOAddressSpace *space;
> >>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >>      MemoryListener listener;
> >> +    MemoryListener prereg_listener;
> >> +    unsigned iommu_type;
> >>      int error;
> >>      bool initialized;
> >>      /*
> >> @@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
> >>  int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>                           struct vfio_region_info **info);
> >>  #endif
> >> +extern const MemoryListener vfio_prereg_listener;
> >> +
> >>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> >> diff --git a/trace-events b/trace-events
> >> index dd50005..d0d8615 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1737,6 +1737,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
> >>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
> >>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
> >>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> >> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>
> >>  # hw/vfio/platform.c
> >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"  
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-16  6:25     ` Alexey Kardashevskiy
@ 2016-05-17  5:32       ` Bharata B Rao
  2016-05-27  4:44         ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Bharata B Rao @ 2016-05-17  5:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc,
	Paolo Bonzini, David Gibson

On Mon, May 16, 2016 at 11:55 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> On 05/13/2016 06:41 PM, Bharata B Rao wrote:
>>
>> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru>
>> wrote:
>
>
>>
>>> +
>>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS -
>>> spapr_phb_get_active_win_num(sphb);
>>> +
>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>> +    rtas_st(rets, 1, avail);
>>> +    rtas_st(rets, 2, max_window_size);
>>> +    rtas_st(rets, 3, pgmask);
>>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>>> +
>>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size,
>>> pgmask);
>>> +    return;
>>> +
>>> +param_error_exit:
>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>> +}
>>> +
>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>> +                                          sPAPRMachineState *spapr,
>>> +                                          uint32_t token, uint32_t
>>> nargs,
>>> +                                          target_ulong args,
>>> +                                          uint32_t nret, target_ulong
>>> rets)
>>> +{
>>> +    sPAPRPHBState *sphb;
>>> +    sPAPRTCETable *tcet = NULL;
>>> +    uint32_t addr, page_shift, window_shift, liobn;
>>> +    uint64_t buid;
>>> +
>>> +    if ((nargs != 5) || (nret != 4)) {
>>> +        goto param_error_exit;
>>> +    }
>>> +
>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>> +    addr = rtas_ld(args, 0);
>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>> +    if (!sphb || !sphb->ddw_enabled) {
>>> +        goto param_error_exit;
>>> +    }
>>> +
>>> +    page_shift = rtas_ld(args, 3);
>>> +    window_shift = rtas_ld(args, 4);
>>
>>
>> Kernel has a bug due to which wrong window_shift gets returned here. I
>> have posted possible fix here:
>> https://patchwork.ozlabs.org/patch/621497/
>>
>> I have tried to work around this issue in QEMU too
>> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
>>
>> But the above work around involves changing the memory representation
>> in DT.
>
>
> What is wrong with this workaround?

The above workaround will result in different representations for
memory in DT before and after the workaround.

Currently for -m 2G, -numa node,nodeid=0,mem=1G -numa
node,nodeid=1,mem=0.5G, we will have the following nodes in DT:

memory@0
memory@40000000
ibm,dynamic-reconfiguration-memory

ibm,dynamic-memory will have only DR LMBs:

[root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
0000000 0000 000a 0000 0000 8000 0000 8000 0008
0000010 0000 0000 ffff ffff 0000 0000 0000 0000
0000020 9000 0000 8000 0009 0000 0000 ffff ffff
0000030 0000 0000 0000 0000 a000 0000 8000 000a
0000040 0000 0000 ffff ffff 0000 0000 0000 0000
0000050 b000 0000 8000 000b 0000 0000 ffff ffff
0000060 0000 0000 0000 0000 c000 0000 8000 000c
0000070 0000 0000 ffff ffff 0000 0000 0000 0000
0000080 d000 0000 8000 000d 0000 0000 ffff ffff
0000090 0000 0000 0000 0000 e000 0000 8000 000e
00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
00000b0 f000 0000 8000 000f 0000 0000 ffff ffff
00000c0 0000 0000 0000 0001 0000 0000 8000 0010
00000d0 0000 0000 ffff ffff 0000 0000 0000 0001
00000e0 1000 0000 8000 0011 0000 0000 ffff ffff
00000f0 0000 0000

The memory region looks like this:

memory-region: system
  0000000000000000-ffffffffffffffff (prio 0, RW): system
    0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
    0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory

After this workaround, all this will change like below:

memory@0
ibm,dynamic-reconfiguration-memory

All LMBs in ibm,dynamic-memory:

[root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory

0000000 0000 0010 0000 0000 0000 0000 8000 0000
0000010 0000 0000 0000 0000 0000 0080 0000 0000
0000020 1000 0000 8000 0001 0000 0000 0000 0000
0000030 0000 0080 0000 0000 2000 0000 8000 0002
0000040 0000 0000 0000 0000 0000 0080 0000 0000
0000050 3000 0000 8000 0003 0000 0000 0000 0000
0000060 0000 0080 0000 0000 4000 0000 8000 0004
0000070 0000 0000 0000 0001 0000 0008 0000 0000
0000080 5000 0000 8000 0005 0000 0000 0000 0001
0000090 0000 0008 0000 0000 6000 0000 8000 0006
00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
00000b0 7000 0000 8000 0007 0000 0000 ffff ffff
00000c0 0000 0000 0000 0000 8000 0000 8000 0008
00000d0 0000 0000 ffff ffff 0000 0000 0000 0000
00000e0 9000 0000 8000 0009 0000 0000 ffff ffff
00000f0 0000 0000 0000 0000 a000 0000 8000 000a
0000100 0000 0000 ffff ffff 0000 0000 0000 0000
0000110 b000 0000 8000 000b 0000 0000 ffff ffff
0000120 0000 0000 0000 0000 c000 0000 8000 000c
0000130 0000 0000 ffff ffff 0000 0000 0000 0000
0000140 d000 0000 8000 000d 0000 0000 ffff ffff
0000150 0000 0000 0000 0000 e000 0000 8000 000e
0000160 0000 0000 ffff ffff 0000 0000 0000 0000
0000170 f000 0000 8000 000f 0000 0000 ffff ffff
0000180 0000 0000

Hotplug memory region gets a new address range now:

memory-region: system
  0000000000000000-ffffffffffffffff (prio 0, RW): system
    0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
    0000000060000000-00000000ffffffff (prio 0, RW): hotplug-memory


So when a guest that was booted with older QEMU is migrated to a newer
QEMU that has this workaround, then it will start exhibiting the above
changes after first reboot post migration.

If user has done memory hotplug by explicitly specifying address in
the source, then even migration would fail because the addr specified
at the target will not be part of hotplug-memory range.

Hence I believe we shoudn't workaround in this manner but have the
workaround in the DDW code where the window can be easily fixed.

>
>> Hence I feel until the guest kernel changes are available, a
>> simpler work around would be to discard the window_shift value above
>> and recalculate the right value as below:
>>
>> if (machine->ram_size == machine->maxram_size) {
>>     max_window_size = machine->ram_size;
>> } else {
>>      MemoryHotplugState *hpms = &spapr->hotplug_memory;
>>      max_window_size = hpms->base + memory_region_size(&hpms->mr);
>> }
>> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT;
>>
>> and create DDW based on this calculated window_shift value. Does that
>> sound reasonable ?
>
>
> The workaround should only do that for the second window, at least, or for
> the default one but with page size at least 64K; otherwise it is going to be
> a waste of memory (2MB per each 1GB of guest RAM).

Ok, will sync up with you separately to understand more about the
'two' windows here.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [Qemu-devel] [RFC PATCH qemu] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening
  2016-05-16 20:13       ` Alex Williamson
@ 2016-05-20  8:04         ` Alexey Kardashevskiy
  2016-05-20 15:19           ` Alex Williamson
  2016-05-27  0:43           ` David Gibson
  0 siblings, 2 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-20  8:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alex Williamson, David Gibson,
	Paolo Bonzini

The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
a guest view of the table and a hardware TCE table. If there is no VFIO
presense in the address space, then just the guest view is used, if
this is the case, it is allocated in the KVM. However since there is no
support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
we need to move the guest view from KVM to the userspace; and we need
to do this for every IOMMU on a bus with VFIO devices.

This adds notify_started/notify_stopped callbacks in MemoryRegionIOMMUOps
to notify IOMMU that listeners were set/removed. This allows IOMMU to
take necessary steps before actual notifications happen and do proper
cleanup when the last notifier is removed.

This implements the callbacks for the sPAPR IOMMU - notify_started()
reallocated the guest view to the user space, notify_stopped() does
the opposite.

This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
path as the new callbacks do this better - they notify IOMMU at
the exact moment when the configuration is changed, and this also
includes the case of PCI hot unplug.

This adds MemoryRegion* to memory_region_unregister_iommu_notifier()
as we need iommu_ops to call notify_stopped() and Notifier* does not
store the owner.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---


Is this any better? If so, I'll repost as a part of "v17". Thanks.



---
Changes:
v17:
* replaced IOMMU users counting with simple QLIST_EMPTY()
* renamed the callbacks
* removed requirement for region_del() to be called on memory_listener_unregister()

v16:
* added a use counter in VFIOAddressSpace->VFIOIOMMUMR

v15:
* s/need_vfio/vfio-Users/g
---
 hw/ppc/spapr_iommu.c  | 12 ++++++++++++
 hw/ppc/spapr_pci.c    |  6 ------
 hw/vfio/common.c      |  5 +++--
 include/exec/memory.h |  8 +++++++-
 memory.c              | 10 +++++++++-
 5 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 73bc26b..fd38006 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -156,6 +156,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
     return 1ULL << tcet->page_shift;
 }
 
+static void spapr_tce_notify_started(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
+}
+
+static void spapr_tce_notify_stopped(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
+}
+
 static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
 
@@ -240,6 +250,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
     .get_page_sizes = spapr_tce_get_page_sizes,
+    .notify_started = spapr_tce_notify_started,
+    .notify_stopped = spapr_tce_notify_stopped,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 68e77b0..7c2c622 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1089,12 +1089,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     void *fdt = NULL;
     int fdt_start_offset = 0, fdt_size;
 
-    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
-
-        spapr_tce_set_need_vfio(tcet, true);
-    }
-
     if (dev->hotplugged) {
         fdt = create_device_tree(&fdt_size);
         fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2e4f703..d1fa9ab 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -523,7 +523,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (giommu->iommu == section->mr) {
-                memory_region_unregister_iommu_notifier(&giommu->n);
+                memory_region_unregister_iommu_notifier(giommu->iommu,
+                                                        &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
                 g_free(giommu);
                 break;
@@ -1040,7 +1041,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
         QLIST_REMOVE(container, next);
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(&giommu->n);
+            memory_region_unregister_iommu_notifier(giommu->iommu, &giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
             g_free(giommu);
         }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index bfb08d4..1c41eb6 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
     /* Returns supported page sizes */
     uint64_t (*get_page_sizes)(MemoryRegion *iommu);
+    /* Called when the first notifier is set */
+    void (*notify_started)(MemoryRegion *iommu);
+    /* Called when the last notifier is removed */
+    void (*notify_stopped)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -620,9 +624,11 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
  * memory_region_unregister_iommu_notifier: unregister a notifier for
  * changes to IOMMU translation entries.
  *
+ * @mr: the memory region which was observed and for which notity_stopped()
+ *      needs to be called
  * @n: the notifier to be removed.
  */
-void memory_region_unregister_iommu_notifier(Notifier *n);
+void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n);
 
 /**
  * memory_region_name: get a memory region's name
diff --git a/memory.c b/memory.c
index d22cf5e..fcf978a 100644
--- a/memory.c
+++ b/memory.c
@@ -1512,6 +1512,10 @@ bool memory_region_is_logging(MemoryRegion *mr, uint8_t client)
 
 void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
 {
+    if (mr->iommu_ops->notify_started &&
+        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
+        mr->iommu_ops->notify_started(mr);
+    }
     notifier_list_add(&mr->iommu_notify, n);
 }
 
@@ -1545,9 +1549,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
     }
 }
 
-void memory_region_unregister_iommu_notifier(Notifier *n)
+void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n)
 {
     notifier_remove(n);
+    if (mr->iommu_ops->notify_stopped &&
+        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
+        mr->iommu_ops->notify_stopped(mr);
+    }
 }
 
 void memory_region_notify_iommu(MemoryRegion *mr,
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening
  2016-05-20  8:04         ` [Qemu-devel] [RFC PATCH qemu] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening Alexey Kardashevskiy
@ 2016-05-20 15:19           ` Alex Williamson
  2016-05-27  0:43           ` David Gibson
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Williamson @ 2016-05-20 15:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, David Gibson, Paolo Bonzini

On Fri, 20 May 2016 18:04:42 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> a guest view of the table and a hardware TCE table. If there is no VFIO
> presense in the address space, then just the guest view is used, if
> this is the case, it is allocated in the KVM. However since there is no
> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> we need to move the guest view from KVM to the userspace; and we need
> to do this for every IOMMU on a bus with VFIO devices.
> 
> This adds notify_started/notify_stopped callbacks in MemoryRegionIOMMUOps
> to notify IOMMU that listeners were set/removed. This allows IOMMU to
> take necessary steps before actual notifications happen and do proper
> cleanup when the last notifier is removed.
> 
> This implements the callbacks for the sPAPR IOMMU - notify_started()
> reallocated the guest view to the user space, notify_stopped() does
> the opposite.
> 
> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> path as the new callbacks do this better - they notify IOMMU at
> the exact moment when the configuration is changed, and this also
> includes the case of PCI hot unplug.
> 
> This adds MemoryRegion* to memory_region_unregister_iommu_notifier()
> as we need iommu_ops to call notify_stopped() and Notifier* does not
> store the owner.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> 
> Is this any better? If so, I'll repost as a part of "v17". Thanks.

Massive improvement, IMO.  Thanks,

Alex

> ---
> Changes:
> v17:
> * replaced IOMMU users counting with simple QLIST_EMPTY()
> * renamed the callbacks
> * removed requirement for region_del() to be called on memory_listener_unregister()
> 
> v16:
> * added a use counter in VFIOAddressSpace->VFIOIOMMUMR
> 
> v15:
> * s/need_vfio/vfio-Users/g
> ---
>  hw/ppc/spapr_iommu.c  | 12 ++++++++++++
>  hw/ppc/spapr_pci.c    |  6 ------
>  hw/vfio/common.c      |  5 +++--
>  include/exec/memory.h |  8 +++++++-
>  memory.c              | 10 +++++++++-
>  5 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 73bc26b..fd38006 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -156,6 +156,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_notify_started(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> +}
> +
> +static void spapr_tce_notify_stopped(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> +}
> +
>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -240,6 +250,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
>      .get_page_sizes = spapr_tce_get_page_sizes,
> +    .notify_started = spapr_tce_notify_started,
> +    .notify_stopped = spapr_tce_notify_stopped,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 68e77b0..7c2c622 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1089,12 +1089,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      void *fdt = NULL;
>      int fdt_start_offset = 0, fdt_size;
>  
> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> -
> -        spapr_tce_set_need_vfio(tcet, true);
> -    }
> -
>      if (dev->hotplugged) {
>          fdt = create_device_tree(&fdt_size);
>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 2e4f703..d1fa9ab 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -523,7 +523,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (giommu->iommu == section->mr) {
> -                memory_region_unregister_iommu_notifier(&giommu->n);
> +                memory_region_unregister_iommu_notifier(giommu->iommu,
> +                                                        &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
>                  g_free(giommu);
>                  break;
> @@ -1040,7 +1041,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          QLIST_REMOVE(container, next);
>  
>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(&giommu->n);
> +            memory_region_unregister_iommu_notifier(giommu->iommu, &giommu->n);
>              QLIST_REMOVE(giommu, giommu_next);
>              g_free(giommu);
>          }
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index bfb08d4..1c41eb6 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>      /* Returns supported page sizes */
>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> +    /* Called when the first notifier is set */
> +    void (*notify_started)(MemoryRegion *iommu);
> +    /* Called when the last notifier is removed */
> +    void (*notify_stopped)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -620,9 +624,11 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
>   * changes to IOMMU translation entries.
>   *
> + * @mr: the memory region which was observed and for which notity_stopped()
> + *      needs to be called
>   * @n: the notifier to be removed.
>   */
> -void memory_region_unregister_iommu_notifier(Notifier *n);
> +void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n);
>  
>  /**
>   * memory_region_name: get a memory region's name
> diff --git a/memory.c b/memory.c
> index d22cf5e..fcf978a 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1512,6 +1512,10 @@ bool memory_region_is_logging(MemoryRegion *mr, uint8_t client)
>  
>  void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>  {
> +    if (mr->iommu_ops->notify_started &&
> +        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
> +        mr->iommu_ops->notify_started(mr);
> +    }
>      notifier_list_add(&mr->iommu_notify, n);
>  }
>  
> @@ -1545,9 +1549,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
>      }
>  }
>  
> -void memory_region_unregister_iommu_notifier(Notifier *n)
> +void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n)
>  {
>      notifier_remove(n);
> +    if (mr->iommu_ops->notify_stopped &&
> +        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
> +        mr->iommu_ops->notify_stopped(mr);
> +    }
>  }
>  
>  void memory_region_notify_iommu(MemoryRegion *mr,

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release
  2016-05-13 22:24       ` Alex Williamson
@ 2016-05-25  6:34         ` David Gibson
  2016-05-25 13:59           ` Alex Williamson
  0 siblings, 1 reply; 69+ messages in thread
From: David Gibson @ 2016-05-25  6:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7637 bytes --]

On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote:
> On Fri, 13 May 2016 17:16:48 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 05/06/2016 08:39 AM, Alex Williamson wrote:
> > > On Wed,  4 May 2016 16:52:13 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >  
> > >> This postpones VFIO container deinitialization to let region_del()
> > >> callbacks (called via vfio_listener_release) do proper clean up
> > >> while the group is still attached to the container.  
> > >
> > > Any mappings within the container should clean themselves up when the
> > > container is deprivleged by removing the last group in the kernel. Is
> > > the issue that that doesn't happen, which would be a spapr vfio kernel
> > > bug, or that our QEMU side structures get all out of whack if we let
> > > that happen?  
> > 
> > My mailbase got corrupted, missed that.
> > 
> > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> > Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> > 01/19 and 02/19 right before 17/19, sorry about that.
> 
> Which I object to, it's just ridiculous to have vfio start/stop
> callbacks in a set of generic iommu region ops.

It's ugly, but I don't actually see a better way to do this (the
general concept of having vfio start/stop callbacks, that is, not the
specifics of the patches).

The fact is that how we implement the guest side IOMMU *does* need to
change depending on whether VFIO devices are present or not.  That's
due essentially to incompatibilities between a couple of kernel
mechanisms.  Which in itself is ugly, but nonetheless real.

A (usually blank) vfio on/off callback in the guest side IOMMU ops
seems like the least-bad way to handle this.

> > Every reboot the spapr machine removes all (i.e. one or two) windows and 
> > creates the default one.
> > 
> > I do this by memory_region_del_subregion(iommu_mr) + 
> > memory_region_add_subregion(iommu_mr). Which gets translated to 
> > VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via 
> > vfio_memory_listener if there is VFIO; no direct calls from spapr to vfio 
> > => cool. During the machine reset, the VFIO device is there with the   
> > container and groups attached, at some point with no windows.
> > 
> > Now to VFIO plug/unplug.
> > 
> > When VFIO plug happens, vfio_memory_listener is created, region_add() is 
> > called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE).
> > Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If 
> > region_del() is not called when the container is being destroyed (as before 
> > this patchset), then the kernel cleans and destroys windows when 
> > close(container->fd) is called or when qemu is killed (and this fd is 
> > naturally closed), I hope this answers the comment from 02/19.
> > 
> > So far so good (right?)
> > 
> > However I also have a guest view of the TCE table, this is what the guest 
> > sees and this is what emulated PCI devices use. This guest view is either 
> > allocated in the KVM (so H_PUT_TCE can be handled quickly right in the host 
> > kernel, even in real mode) or userspace (VFIO case).
> > 
> > I generally want the guest view to be in the KVM. However when I plug VFIO, 
> > I have to move the table to the userspace. When I unplug VFIO, I want to do 
> > the opposite so I need a way to tell spapr that it can move the table. 
> > region_del() seemed a natural way of doing this as region_add() is already 
> > doing the opposite part.
> > 
> > With this patchset, each IOMMU MR gets a usage counter, region_add() does 
> > +1, region_del() does -1 (yeah, not extremely optimal during reset). When 
> > the counter goes from 0 to 1, vfio_start() hook is called, when the counter 
> > becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers on 
> > the same PHB.
> > 
> > Without 01/19 and 02/19, I'll have to repeat region_del()'s counter 
> > decrement steps in vfio_disconnect_container(). And I still cannot move 
> > counting from region_add() to vfio_connect_container() so there will be 
> > asymmetry which I am fine with, I am just checking here - what would be the 
> > best approach here?
> 
> 
> You're imposing on other iommu models (type1) that in order to release
> a container we first deregister the listener, which un-plays all of
> the mappings within that region.  That's inefficient when we can simply
> unset the container and move on.  So you're imposing an inefficiency on
> a separate vfio iommu model for the book keeping of your own.  I don't
> think that's a reasonable approach.  Has it even been testing how that
> affects type1 users?  When a container is closed, clearly it shouldn't
> be contributing to reference counts, so it seems like there must be
> other ways to handle this.

My first guess is to agree, but I'll look at that more carefully when
I actually get to the patch doing that.

What I really don't understand about this one is what the
group<->container connection - an entirely host side matter - has to
do with the reference counting here, which is per *guest* side IOMMU.

> 
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> ---
> > >>  hw/vfio/common.c | 22 +++++++++++++++-------
> > >>  1 file changed, 15 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >> index fe5ec6a..0b40262 100644
> > >> --- a/hw/vfio/common.c
> > >> +++ b/hw/vfio/common.c
> > >> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
> > >>  {
> > >>      VFIOContainer *container = group->container;
> > >>
> > >> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> > >> -        error_report("vfio: error disconnecting group %d from container",
> > >> -                     group->groupid);
> > >> -    }
> > >> -
> > >>      QLIST_REMOVE(group, container_next);
> > >> +
> > >> +    if (QLIST_EMPTY(&container->group_list)) {
> > >> +        VFIOGuestIOMMU *giommu;
> > >> +
> > >> +        vfio_listener_release(container);
> > >> +
> > >> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> > >> +            memory_region_unregister_iommu_notifier(&giommu->n);
> > >> +        }
> > >> +    }
> > >> +
> > >>      group->container = NULL;
> > >> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> > >> +        error_report("vfio: error disconnecting group %d from container",
> > >> +                     group->groupid);
> > >> +    }
> > >>
> > >>      if (QLIST_EMPTY(&container->group_list)) {
> > >>          VFIOAddressSpace *space = container->space;
> > >>          VFIOGuestIOMMU *giommu, *tmp;
> > >>
> > >> -        vfio_listener_release(container);
> > >>          QLIST_REMOVE(container, next);
> > >>
> > >>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> > >> -            memory_region_unregister_iommu_notifier(&giommu->n);
> > >>              QLIST_REMOVE(giommu, giommu_next);
> > >>              g_free(giommu);
> > >>          }  
> > >
> > > I'm not spotting why this is a 2-pass process vs simply moving the
> > > existing QLIST_EMPTY cleanup above the ioctl.  Thanks,  
> > 
> > 
> > 
> > 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release
  2016-05-25  6:34         ` David Gibson
@ 2016-05-25 13:59           ` Alex Williamson
  2016-05-26  1:00             ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Williamson @ 2016-05-25 13:59 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Paolo Bonzini

On Wed, 25 May 2016 16:34:37 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote:
> > On Fri, 13 May 2016 17:16:48 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 05/06/2016 08:39 AM, Alex Williamson wrote:  
> > > > On Wed,  4 May 2016 16:52:13 +1000
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >    
> > > >> This postpones VFIO container deinitialization to let region_del()
> > > >> callbacks (called via vfio_listener_release) do proper clean up
> > > >> while the group is still attached to the container.    
> > > >
> > > > Any mappings within the container should clean themselves up when the
> > > > container is deprivleged by removing the last group in the kernel. Is
> > > > the issue that that doesn't happen, which would be a spapr vfio kernel
> > > > bug, or that our QEMU side structures get all out of whack if we let
> > > > that happen?    
> > > 
> > > My mailbase got corrupted, missed that.
> > > 
> > > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> > > Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> > > 01/19 and 02/19 right before 17/19, sorry about that.  
> > 
> > Which I object to, it's just ridiculous to have vfio start/stop
> > callbacks in a set of generic iommu region ops.  
> 
> It's ugly, but I don't actually see a better way to do this (the
> general concept of having vfio start/stop callbacks, that is, not the
> specifics of the patches).
> 
> The fact is that how we implement the guest side IOMMU *does* need to
> change depending on whether VFIO devices are present or not. 

No, how the guest side iommu is implemented needs to change depending
on whether there's someone, anyone, in QEMU that cares about the iommu,
which can be determined by whether the iommu notifier has any clients.
Alexey has posted another patch that does this.

> That's
> due essentially to incompatibilities between a couple of kernel
> mechanisms.  Which in itself is ugly, but nonetheless real.
> 
> A (usually blank) vfio on/off callback in the guest side IOMMU ops
> seems like the least-bad way to handle this.

I disagree, we already call memory_region_register_iommu_notifier() to
indicate we care about the guest iommu, so the abstraction is already
there, there's absolutely no reason to make a vfio specific interface.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release
  2016-05-25 13:59           ` Alex Williamson
@ 2016-05-26  1:00             ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  1:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3025 bytes --]

On Wed, May 25, 2016 at 07:59:26AM -0600, Alex Williamson wrote:
> On Wed, 25 May 2016 16:34:37 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote:
> > > On Fri, 13 May 2016 17:16:48 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >   
> > > > On 05/06/2016 08:39 AM, Alex Williamson wrote:  
> > > > > On Wed,  4 May 2016 16:52:13 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >    
> > > > >> This postpones VFIO container deinitialization to let region_del()
> > > > >> callbacks (called via vfio_listener_release) do proper clean up
> > > > >> while the group is still attached to the container.    
> > > > >
> > > > > Any mappings within the container should clean themselves up when the
> > > > > container is deprivleged by removing the last group in the kernel. Is
> > > > > the issue that that doesn't happen, which would be a spapr vfio kernel
> > > > > bug, or that our QEMU side structures get all out of whack if we let
> > > > > that happen?    
> > > > 
> > > > My mailbase got corrupted, missed that.
> > > > 
> > > > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> > > > Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> > > > 01/19 and 02/19 right before 17/19, sorry about that.  
> > > 
> > > Which I object to, it's just ridiculous to have vfio start/stop
> > > callbacks in a set of generic iommu region ops.  
> > 
> > It's ugly, but I don't actually see a better way to do this (the
> > general concept of having vfio start/stop callbacks, that is, not the
> > specifics of the patches).
> > 
> > The fact is that how we implement the guest side IOMMU *does* need to
> > change depending on whether VFIO devices are present or not. 
> 
> No, how the guest side iommu is implemented needs to change depending
> on whether there's someone, anyone, in QEMU that cares about the iommu,
> which can be determined by whether the iommu notifier has any clients.
> Alexey has posted another patch that does this.

*thinks*  ah, yes, you're right of course.  So instead we need some
hook that's triggered on transition of number of notifier listeners
from zero<->non-zero.

> > That's
> > due essentially to incompatibilities between a couple of kernel
> > mechanisms.  Which in itself is ugly, but nonetheless real.
> > 
> > A (usually blank) vfio on/off callback in the guest side IOMMU ops
> > seems like the least-bad way to handle this.
> 
> I disagree, we already call memory_region_register_iommu_notifier() to
> indicate we care about the guest iommu, so the abstraction is already
> there, there's absolutely no reason to make a vfio specific interface.
> Thanks,
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 02/19] memory: Call region_del() callbacks on memory listener unregistering
  2016-05-05 22:45   ` Alex Williamson
@ 2016-05-26  1:48     ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  1:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2173 bytes --]

On Thu, May 05, 2016 at 04:45:04PM -0600, Alex Williamson wrote:
> On Wed,  4 May 2016 16:52:14 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > When a new memory listener is registered, listener_add_address_space()
> > is called and which in turn calls region_add() callbacks of memory regions.
> > However when unregistering the memory listener, it is just removed from
> > the listening chain and no region_del() is called.
> > 
> > This adds listener_del_address_space() and uses it in
> > memory_listener_unregister(). listener_add_address_space() was used as
> > a template with the following changes:
> > s/log_global_start/log_global_stop/
> > s/log_start/log_stop/
> > s/region_add/region_del/
> > 
> > This will allow the following patches to add/remove DMA windows
> > dynamically from VFIO's PCI address space's region_add()/region_del().
> 
> Following patch 1 comments, it would be a bug if the kernel actually
> needed this to do cleanup, we must release everything if QEMU gets shot
> with a SIGKILL anyway.  So what does this cleanup facilitate in QEMU?
> Having QEMU trigger an unmap for each region_del is not going to be as
> efficient as just dropping the container and letting the kernel handle
> the cleanup all in one go.  Thanks,

So, what the kernel does is kind of a red herring, because that's only
relevant to the specific case of the VFIO listener, whereas this is a
change to the behaviour of all memory listeners.

It seems plausible that some memory listeners could have a legitimate
reason to want clean up region_del calls when unregistered.  But, we
know this could be expensive for other listeners, so I don't think we
should make that behaviour standard.

So I'd be thinking either a special unregister_with_delete() call, or
a standalone "delete all" helper function.

Assuming this is still needed at all, once the other changes to the
reference counting we've discussed have been done.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 03/19] memory: Fix IOMMU replay base address
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 03/19] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-05-26  1:50   ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  1:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5159 bytes --]

On Wed, May 04, 2016 at 04:52:15PM +1000, Alexey Kardashevskiy wrote:
> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> when new VFIO listener is added, all existing IOMMU mappings are
> replayed. However there is a problem that the base address of
> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> for the existing user (which is pseries) with its default 32bit DMA
> window starting at 0 but it is if there is another DMA window.
> 
> This stores the IOMMU's offset_within_address_space and adjusts
> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> 
> As the IOMMU notifier expects IOVA offset rather than the absolute
> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> calling notifier(s).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Alex, this is a real fix independent of the other stuff.  Can we apply
it ASAP.

> ---
> Changes:
> v15:
> * accounted section->offset_within_region
> * s/giommu->offset_within_address_space/giommu->iommu_offset/
> ---
>  hw/ppc/spapr_iommu.c          |  2 +-
>  hw/vfio/common.c              | 14 ++++++++------
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 7dd4588..277f289 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
>      tcet->table[index] = tce;
>  
>      entry.target_as = &address_space_memory,
> -    entry.iova = ioba & page_mask;
> +    entry.iova = (ioba - tcet->bus_offset) & page_mask;
>      entry.translated_addr = tce & page_mask;
>      entry.addr_mask = ~page_mask;
>      entry.perm = spapr_tce_iommu_access_flags(tce);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0b40262..f32cc49 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>      VFIOContainer *container = giommu->container;
>      IOMMUTLBEntry *iotlb = data;
> +    hwaddr iova = iotlb->iova + giommu->iommu_offset;
>      MemoryRegion *mr;
>      hwaddr xlat;
>      hwaddr len = iotlb->addr_mask + 1;
>      void *vaddr;
>      int ret;
>  
> -    trace_vfio_iommu_map_notify(iotlb->iova,
> -                                iotlb->iova + iotlb->addr_mask);
> +    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
>      /*
>       * The IOMMU TLB entry we have just covers translation through
> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>          vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -        ret = vfio_dma_map(container, iotlb->iova,
> +        ret = vfio_dma_map(container, iova,
>                             iotlb->addr_mask + 1, vaddr,
>                             !(iotlb->perm & IOMMU_WO) || mr->readonly);
>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, vaddr, ret);
>          }
>      } else {
> -        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, ret);
>          }
>      }
> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           */
>          giommu = g_malloc0(sizeof(*giommu));
>          giommu->iommu = section->mr;
> +        giommu->iommu_offset = section->offset_within_address_space -
> +            section->offset_within_region;
>          giommu->container = container;
>          giommu->n.notify = vfio_iommu_map_notify;
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eb0e1b0..c9b6622 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -90,6 +90,7 @@ typedef struct VFIOContainer {
>  typedef struct VFIOGuestIOMMU {
>      VFIOContainer *container;
>      MemoryRegion *iommu;
> +    hwaddr iommu_offset;
>      Notifier n;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 05/19] vfio: Check that IOMMU MR translates to system address space
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 05/19] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
@ 2016-05-26  1:51   ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  1:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1572 bytes --]

On Wed, May 04, 2016 at 04:52:17PM +1000, Alexey Kardashevskiy wrote:
> At the moment IOMMU MR only translate to the system memory.
> However if some new code changes this, we will need clear indication why
> it is not working so here is the check.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Alex,

I think this is a reasonable sanity check regardless of what happens
with the rest of the series.  Can you apply this?

> ---
> Changes:
> v15:
> * added some spaces
> 
> v14:
> * new to the series
> ---
>  hw/vfio/common.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f32cc49..6d23d0f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>  
>      trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
> +    if (iotlb->target_as != &address_space_memory) {
> +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> +                     iotlb->target_as->name ? iotlb->target_as->name : "none");
> +        return;
> +    }
> +
>      /*
>       * The IOMMU TLB entry we have just covers translation through
>       * this IOMMU to its immediate target.  We need to translate

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 06/19] spapr_pci: Use correct DMA LIOBN when composing the device tree
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 06/19] spapr_pci: Use correct DMA LIOBN when composing the device tree Alexey Kardashevskiy
@ 2016-05-26  3:17   ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  3:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1299 bytes --]

On Wed, May 04, 2016 at 04:52:18PM +1000, Alexey Kardashevskiy wrote:
> The user could have picked LIOBN via the CLI but the device tree
> rendering code would still use the value derived from the PHB index
> (which is the default fallback if LIOBN is not set in the CLI).
> 
> This replaces SPAPR_PCI_LIOBN() with the actual DMA LIOBN value.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to ppc-for-2.7.

> ---
> Changes:
> v16:
> * new in the series
> ---
>  hw/ppc/spapr_pci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 573e635..742d127 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1815,7 +1815,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>                       sizeof(interrupt_map)));
>  
> -    tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(phb->index, 0));
> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>      if (!tcet) {
>          return -1;
>      }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 09/19] spapr_iommu: Finish renaming vfio_accel to need_vfio
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 09/19] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
@ 2016-05-26  3:18   ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  3:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

On Wed, May 04, 2016 at 04:52:21PM +1000, Alexey Kardashevskiy wrote:
> 6a81dd17 "spapr_iommu: Rename vfio_accel parameter" renamed vfio_accel
> flag everywhere but one spot was missed.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Applied to ppc-for-2.7.


> ---
>  target-ppc/kvm_ppc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
> index fc79312..3b2090e 100644
> --- a/target-ppc/kvm_ppc.h
> +++ b/target-ppc/kvm_ppc.h
> @@ -163,7 +163,7 @@ static inline bool kvmppc_spapr_use_multitce(void)
>  
>  static inline void *kvmppc_create_spapr_tce(uint32_t liobn,
>                                              uint32_t window_size, int *fd,
> -                                            bool vfio_accel)
> +                                            bool need_vfio)
>  {
>      return NULL;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 07/19] spapr_iommu: Move table allocation to helpers
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 07/19] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
@ 2016-05-26  3:32   ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  3:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5617 bytes --]

On Wed, May 04, 2016 at 04:52:19PM +1000, Alexey Kardashevskiy wrote:
> At the moment presence of vfio-pci devices on a bus affect the way
> the guest view table is allocated. If there is no vfio-pci on a PHB
> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> is allocated in KVM. However, if there is vfio-pci and we do yet not
> KVM acceleration for these, the table has to be allocated by
> the userspace. At the moment the table is allocated once at boot time
> but next patches will reallocate it.
> 
> This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
> to helpers.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

This is a reasonable clean up on its own, so I've applied to ppc-for-2.7.

> ---
>  hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
>  trace-events         |  2 +-
>  2 files changed, 40 insertions(+), 20 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 277f289..8132f64 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -75,6 +75,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
>      }
>  }
>  
> +static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
> +                                       uint32_t page_shift,
> +                                       uint32_t nb_table,
> +                                       int *fd,
> +                                       bool need_vfio)
> +{
> +    uint64_t *table = NULL;
> +    uint64_t window_size = (uint64_t)nb_table << page_shift;
> +
> +    if (kvm_enabled() && !(window_size >> 32)) {
> +        table = kvmppc_create_spapr_tce(liobn, window_size, fd, need_vfio);
> +    }
> +
> +    if (!table) {
> +        *fd = -1;
> +        table = g_malloc0(nb_table * sizeof(uint64_t));
> +    }
> +
> +    trace_spapr_iommu_new_table(liobn, table, *fd);
> +
> +    return table;
> +}
> +
> +static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
> +{
> +    if (!kvm_enabled() ||
> +        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
> +        g_free(table);
> +    }
> +}
> +
>  /* Called from RCU critical section */
>  static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>                                                 bool is_write)
> @@ -141,21 +172,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>  static int spapr_tce_table_realize(DeviceState *dev)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> -    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
>  
> -    if (kvm_enabled() && !(window_size >> 32)) {
> -        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
> -                                              window_size,
> -                                              &tcet->fd,
> -                                              tcet->need_vfio);
> -    }
> -
> -    if (!tcet->table) {
> -        size_t table_size = tcet->nb_table * sizeof(uint64_t);
> -        tcet->table = g_malloc0(table_size);
> -    }
> -
> -    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
> +    tcet->fd = -1;
> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +                                        tcet->page_shift,
> +                                        tcet->nb_table,
> +                                        &tcet->fd,
> +                                        tcet->need_vfio);
>  
>      memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
>                               "iommu-spapr",
> @@ -241,11 +264,8 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>  
>      QLIST_REMOVE(tcet, list);
>  
> -    if (!kvm_enabled() ||
> -        (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
> -                                 tcet->nb_table) != 0)) {
> -        g_free(tcet->table);
> -    }
> +    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> +    tcet->fd = -1;
>  }
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
> diff --git a/trace-events b/trace-events
> index 8350743..d96d344 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1431,7 +1431,7 @@ spapr_iommu_pci_get(uint64_t liobn, uint64_t ioba, uint64_t ret, uint64_t tce) "
>  spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN, uint64_t tceN, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcelist=0x%"PRIx64" iobaN=0x%"PRIx64" tceN=0x%"PRIx64" ret=%"PRId64
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
> -spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
> +spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 08/19] spapr_iommu: Introduce "enabled" state for TCE table
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 08/19] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-05-26  3:39   ` David Gibson
  2016-05-27  8:01     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: David Gibson @ 2016-05-26  3:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 11445 bytes --]

On Wed, May 04, 2016 at 04:52:20PM +1000, Alexey Kardashevskiy wrote:
> Currently TCE tables are created once at start and their sizes never
> change. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
> 
> This changes spapr_tce_new_table() to create an empty zero-size IOMMU
> memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
> It still will be called once at the owner object (VIO or PHB) creation.
> 
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> - spapr_tce_table_enable() receives TCE table parameters, allocates
> a guest view of the TCE table (in the user space or KVM) and
> sets the correct size on the IOMMU MR.
> - spapr_tce_table_disable() disposes the table and resets the IOMMU MR
> size.
> 
> This changes the PHB reset handler to do the default DMA initialization
> instead of spapr_phb_realize(). This does not make differenct now but
> later with more than just one DMA window, we will have to remove them all
> and create the default one on a system reset.
> 
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
> 
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as the migration code expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
> 
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it already has all the properties set after the migration; the same is
> done for spapr_tce_table_disable().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v15:
> * made adjustments after removing spapr_phb_dma_window_enable()
> 
> v14:
> * added spapr_tce_table_do_disable(), will make difference in following
> patch with fully dynamic table migration
> 
> # Conflicts:
> #	hw/ppc/spapr_pci.c
> ---
>  hw/ppc/spapr_iommu.c   | 86 ++++++++++++++++++++++++++++++++++++--------------
>  hw/ppc/spapr_pci.c     |  8 +++--
>  hw/ppc/spapr_vio.c     |  8 ++---
>  include/hw/ppc/spapr.h | 10 +++---
>  4 files changed, 75 insertions(+), 37 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 8132f64..9bcd3f6 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -17,6 +17,7 @@
>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  #include "hw/hw.h"
>  #include "sysemu/kvm.h"
>  #include "hw/qdev.h"
> @@ -174,15 +175,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
>      tcet->fd = -1;
> -    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> -                                        tcet->page_shift,
> -                                        tcet->nb_table,
> -                                        &tcet->fd,
> -                                        tcet->need_vfio);
> -
> +    tcet->need_vfio = false;
>      memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
> -                             "iommu-spapr",
> -                             (uint64_t)tcet->nb_table << tcet->page_shift);
> +                             "iommu-spapr", 0);
>  
>      QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
>  
> @@ -224,14 +219,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>      tcet->table = newtable;
>  }
>  
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool need_vfio)
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>  {
>      sPAPRTCETable *tcet;
> -    char tmp[64];
> +    char tmp[32];
>  
>      if (spapr_tce_find_by_liobn(liobn)) {
>          fprintf(stderr, "Attempted to create TCE table with duplicate"
> @@ -239,16 +230,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>          return NULL;
>      }
>  
> -    if (!nb_table) {
> -        return NULL;
> -    }
> -
>      tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>      tcet->liobn = liobn;
> -    tcet->bus_offset = bus_offset;
> -    tcet->page_shift = page_shift;
> -    tcet->nb_table = nb_table;
> -    tcet->need_vfio = need_vfio;
>  
>      snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>      object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> @@ -258,14 +241,69 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>      return tcet;
>  }
>  
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
> +{
> +    if (!tcet->nb_table) {
> +        return;
> +    }
> +
> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +                                        tcet->page_shift,
> +                                        tcet->nb_table,
> +                                        &tcet->fd,
> +                                        tcet->need_vfio);
> +
> +    memory_region_set_size(&tcet->iommu,
> +                           (uint64_t)tcet->nb_table << tcet->page_shift);
> +
> +    tcet->enabled = true;
> +}
> +
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint32_t page_shift, uint64_t bus_offset,
> +                            uint32_t nb_table)
> +{
> +    if (tcet->enabled) {
> +        error_report("Warning: trying to enable already enabled TCE table");
> +        return;
> +    }
> +
> +    tcet->bus_offset = bus_offset;
> +    tcet->page_shift = page_shift;
> +    tcet->nb_table = nb_table;
> +
> +    spapr_tce_table_do_enable(tcet);
> +}
> +
> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
> +{
> +    memory_region_set_size(&tcet->iommu, 0);
> +
> +    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> +    tcet->fd = -1;
> +    tcet->table = NULL;
> +    tcet->enabled = false;
> +    tcet->bus_offset = 0;
> +    tcet->page_shift = 0;
> +    tcet->nb_table = 0;
> +}
> +
> +static void spapr_tce_table_disable(sPAPRTCETable *tcet)
> +{
> +    if (!tcet->enabled) {
> +        error_report("Warning: trying to disable already disabled TCE table");
> +        return;
> +    }
> +    spapr_tce_table_do_disable(tcet);
> +}
> +
>  static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
>      QLIST_REMOVE(tcet, list);
>  
> -    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> -    tcet->fd = -1;
> +    spapr_tce_table_disable(tcet);

This should probably be do_disable(), or you'll get a spurious error
if you start and stop a VM but don't enable the table in between, or
if the guest disables all the tables before the shutdown.

>  }
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 742d127..beeac06 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1464,8 +1464,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      }
>  
>      nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> -                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> +    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>      if (!tcet) {
>          error_setg(errp, "Unable to create TCE table for %s",
>                     sphb->dtbusname);
> @@ -1473,7 +1472,10 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      }
>  
>      /* Register default 32bit DMA window */
> -    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
> +    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
> +                           nb_table);
> +
> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>                                  spapr_tce_get_iommu(tcet));
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
> index 8aa021f..a7d49a0 100644
> --- a/hw/ppc/spapr_vio.c
> +++ b/hw/ppc/spapr_vio.c
> @@ -482,11 +482,9 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
>          memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
>          address_space_init(&dev->as, &dev->mrroot, qdev->id);
>  
> -        dev->tcet = spapr_tce_new_table(qdev, liobn,
> -                                        0,
> -                                        SPAPR_TCE_PAGE_SHIFT,
> -                                        pc->rtce_window_size >>
> -                                        SPAPR_TCE_PAGE_SHIFT, false);
> +        dev->tcet = spapr_tce_new_table(qdev, liobn);
> +        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
> +                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT);
>          dev->tcet->vdev = dev;
>          memory_region_add_subregion_overlap(&dev->mrroot, 0,
>                                              spapr_tce_get_iommu(dev->tcet), 2);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 815d5ee..0140810 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -534,6 +534,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
>  
>  struct sPAPRTCETable {
>      DeviceState parent;
> +    bool enabled;
>      uint32_t liobn;
>      uint32_t nb_table;
>      uint64_t bus_offset;
> @@ -561,11 +562,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
>  int spapr_h_cas_compose_response(sPAPRMachineState *sm,
>                                   target_ulong addr, target_ulong size,
>                                   bool cpu_update, bool memory_update);
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool need_vfio);
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint32_t page_shift, uint64_t bus_offset,
> +                            uint32_t nb_table);
>  void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 10/19] spapr_iommu: Migrate full state
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 10/19] spapr_iommu: Migrate full state Alexey Kardashevskiy
@ 2016-05-26  4:01   ` David Gibson
  2016-05-31  8:19     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: David Gibson @ 2016-05-26  4:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6660 bytes --]

On Wed, May 04, 2016 at 04:52:22PM +1000, Alexey Kardashevskiy wrote:
> The source guest could have reallocated the default TCE table and
> migrate bigger/smaller table. This adds reallocation in post_load()
> if the default table size is different on source and destination.
> 
> This adds @bus_offset, @page_shift, @enabled to the migration stream.
> These cannot change without dynamic DMA windows so no change in
> behavior is expected now.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v15:
> * squashed "migrate full state" into this
> * added missing tcet->mig_nb_table initialization in spapr_tce_table_pre_save()
> * instead of bumping the version, moved extra parameters to subsection
> 
> v14:
> * new to the series
> ---
>  hw/ppc/spapr_iommu.c   | 67 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/hw/ppc/spapr.h |  2 ++
>  trace-events           |  2 ++
>  3 files changed, 69 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 9bcd3f6..52b1e0d 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -137,33 +137,96 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>      return ret;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +    tcet->mig_table = tcet->table;
> +    tcet->mig_nb_table = tcet->nb_table;
> +
> +    trace_spapr_iommu_pre_save(tcet->liobn, tcet->mig_nb_table,
> +                               tcet->bus_offset, tcet->page_shift);
> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +    uint32_t old_nb_table = tcet->nb_table;
>  
>      if (tcet->vdev) {
>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>      }
>  
> +    if (tcet->enabled) {
> +        if (tcet->nb_table != tcet->mig_nb_table) {
> +            if (tcet->nb_table) {
> +                spapr_tce_table_do_disable(tcet);
> +            }
> +            tcet->nb_table = tcet->mig_nb_table;
> +            spapr_tce_table_do_enable(tcet);
> +        }
> +
> +        memcpy(tcet->table, tcet->mig_table,
> +               tcet->nb_table * sizeof(tcet->table[0]));
> +
> +        free(tcet->mig_table);
> +        tcet->mig_table = NULL;
> +    } else if (tcet->table) {
> +        /* Destination guest has a default table but source does not -> free */
> +        spapr_tce_table_do_disable(tcet);
> +    }
> +
> +    trace_spapr_iommu_post_load(tcet->liobn, old_nb_table, tcet->nb_table,
> +                                tcet->bus_offset, tcet->page_shift);
> +
>      return 0;
>  }
>  
> +static bool spapr_tce_table_ex_needed(void *opaque)
> +{
> +    sPAPRTCETable *tcet = opaque;
> +
> +    return tcet->bus_offset || tcet->page_shift != 0xC;

	|| !tcet->enabled ??

AFAICT you're assuming that the existing tcet on the destination will
be enabled prior to an incoming migration.

> +}
> +
> +static const VMStateDescription vmstate_spapr_tce_table_ex = {
> +    .name = "spapr_iommu_ex",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_tce_table_ex_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BOOL(enabled, sPAPRTCETable),

..or could you encode enabled as !!mig_nb_table?

> +        VMSTATE_UINT64(bus_offset, sPAPRTCETable),
> +        VMSTATE_UINT32(page_shift, sPAPRTCETable),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
>  static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
>      .version_id = 2,
>      .minimum_version_id = 2,
> +    .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
>      .fields      = (VMStateField []) {
>          /* Sanity check */
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> +        VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, mig_nb_table, 0,
> +                                    vmstate_info_uint64, uint64_t),
>  
>          VMSTATE_END_OF_LIST()
>      },
> +    .subsections = (const VMStateDescription*[]) {
> +        &vmstate_spapr_tce_table_ex,
> +        NULL
> +    }
>  };
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 0140810..d36dda2 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -540,6 +540,8 @@ struct sPAPRTCETable {
>      uint64_t bus_offset;
>      uint32_t page_shift;
>      uint64_t *table;
> +    uint32_t mig_nb_table;
> +    uint64_t *mig_table;
>      bool bypass;
>      bool need_vfio;
>      int fd;
> diff --git a/trace-events b/trace-events
> index d96d344..dd50005 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1432,6 +1432,8 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> +spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> +spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-05-16 20:20       ` Alex Williamson
@ 2016-05-26  4:53         ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-26  4:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 17248 bytes --]

On Mon, May 16, 2016 at 02:20:33PM -0600, Alex Williamson wrote:
> On Mon, 16 May 2016 11:10:05 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 05/14/2016 08:25 AM, Alex Williamson wrote:
> > > On Wed,  4 May 2016 16:52:26 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >  
> > >> This makes use of the new "memory registering" feature. The idea is
> > >> to provide the userspace ability to notify the host kernel about pages
> > >> which are going to be used for DMA. Having this information, the host
> > >> kernel can pin them all once per user process, do locked pages
> > >> accounting (once) and not spent time on doing that in real time with
> > >> possible failures which cannot be handled nicely in some cases.
> > >>
> > >> This adds a prereg memory listener which listens on address_space_memory
> > >> and notifies a VFIO container about memory which needs to be
> > >> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> > >>
> > >> As there is no per-IOMMU-type release() callback anymore, this stores
> > >> the IOMMU type in the container so vfio_listener_release() can determine
> > >> if it needs to unregister @prereg_listener.
> > >>
> > >> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> > >> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> > >> not call it when v2 is detected and enabled.
> > >>
> > >> This enforces guest RAM blocks to be host page size aligned; however
> > >> this is not new as KVM already requires memory slots to be host page
> > >> size aligned.
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> ---
> > >> Changes:
> > >> v16:
> > >> * switched to 64bit math everywhere as there is no chance to see
> > >> region_add on RAM blocks even remotely close to 1<<64bytes.
> > >>
> > >> v15:
> > >> * banned unaligned sections
> > >> * added an vfio_prereg_gpa_to_ua() helper
> > >>
> > >> v14:
> > >> * s/free_container_exit/listener_release_exit/g
> > >> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> > >> ---
> > >>  hw/vfio/Makefile.objs         |   1 +
> > >>  hw/vfio/common.c              |  38 +++++++++---
> > >>  hw/vfio/prereg.c              | 137 ++++++++++++++++++++++++++++++++++++++++++
> > >>  include/hw/vfio/vfio-common.h |   4 ++
> > >>  trace-events                  |   2 +
> > >>  5 files changed, 172 insertions(+), 10 deletions(-)
> > >>  create mode 100644 hw/vfio/prereg.c
> > >>
> > >> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > >> index ceddbb8..5800e0e 100644
> > >> --- a/hw/vfio/Makefile.objs
> > >> +++ b/hw/vfio/Makefile.objs
> > >> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
> > >>  obj-$(CONFIG_SOFTMMU) += platform.o
> > >>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
> > >>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> > >> +obj-$(CONFIG_SOFTMMU) += prereg.o
> > >>  endif
> > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >> index 2050040..496eb82 100644
> > >> --- a/hw/vfio/common.c
> > >> +++ b/hw/vfio/common.c
> > >> @@ -501,6 +501,9 @@ static const MemoryListener vfio_memory_listener = {
> > >>  static void vfio_listener_release(VFIOContainer *container)
> > >>  {
> > >>      memory_listener_unregister(&container->listener);
> > >> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> > >> +        memory_listener_unregister(&container->prereg_listener);
> > >> +    }
> > >>  }
> > >>
> > >>  int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
> > >> @@ -808,8 +811,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > >>              goto free_container_exit;
> > >>          }
> > >>
> > >> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> > >> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> > >> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> > >> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> > >>          if (ret) {
> > >>              error_report("vfio: failed to set iommu for container: %m");
> > >>              ret = -errno;
> > >> @@ -834,8 +837,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > >>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> > >>              container->iova_pgsizes = info.iova_pgsizes;
> > >>          }
> > >> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> > >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> > >> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> > >>          struct vfio_iommu_spapr_tce_info info;
> > >> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> > >>
> > >>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> > >>          if (ret) {
> > >> @@ -843,7 +848,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > >>              ret = -errno;
> > >>              goto free_container_exit;
> > >>          }
> > >> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> > >> +        container->iommu_type =
> > >> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> > >> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> > >>          if (ret) {
> > >>              error_report("vfio: failed to set iommu for container: %m");
> > >>              ret = -errno;
> > >> @@ -855,11 +862,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > >>           * when container fd is closed so we do not call it explicitly
> > >>           * in this file.
> > >>           */
> > >> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> > >> -        if (ret) {
> > >> -            error_report("vfio: failed to enable container: %m");
> > >> -            ret = -errno;
> > >> -            goto free_container_exit;
> > >> +        if (!v2) {
> > >> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> > >> +            if (ret) {
> > >> +                error_report("vfio: failed to enable container: %m");
> > >> +                ret = -errno;
> > >> +                goto free_container_exit;
> > >> +            }
> > >> +        } else {
> > >> +            container->prereg_listener = vfio_prereg_listener;
> > >> +
> > >> +            memory_listener_register(&container->prereg_listener,
> > >> +                                     &address_space_memory);
> > >> +            if (container->error) {
> > >> +                error_report("vfio: RAM memory listener initialization failed for container");
> > >> +                goto listener_release_exit;
> > >> +            }
> > >>          }
> > >>
> > >>          /*
> > >> @@ -872,7 +890,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> > >>          if (ret) {
> > >>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
> > >>              ret = -errno;
> > >> -            goto free_container_exit;
> > >> +            goto listener_release_exit;
> > >>          }
> > >>          container->min_iova = info.dma32_window_start;
> > >>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> > >> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> > >> new file mode 100644
> > >> index 0000000..d0e4728
> > >> --- /dev/null
> > >> +++ b/hw/vfio/prereg.c
> > >> @@ -0,0 +1,137 @@
> > >> +/*
> > >> + * DMA memory preregistration
> > >> + *
> > >> + * Authors:
> > >> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> + *
> > >> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> > >> + * the COPYING file in the top-level directory.
> > >> + */
> > >> +
> > >> +#include "qemu/osdep.h"
> > >> +#include <sys/ioctl.h>
> > >> +#include <linux/vfio.h>
> > >> +
> > >> +#include "hw/vfio/vfio-common.h"
> > >> +#include "hw/hw.h"
> > >> +#include "qemu/error-report.h"
> > >> +#include "trace.h"
> > >> +
> > >> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> > >> +{
> > >> +    if (memory_region_is_iommu(section->mr)) {
> > >> +        error_report("Cannot possibly preregister IOMMU memory");  
> > >
> > > What is a user supposed to do with this error_report()?  Is it
> > > continue-able?  How is it possible?  What should they do differently?  
> > 
> > 
> > If I remember correctly, David did have theories where this may be 
> > possible, not today with the existing code though. Could be assert() or 
> > abort(), what is better here?
> 
> If it's a hardware configuration error, then use hw_error(), I prefer
> not to add either assert() or abort() calls to vfio.

Personally I would have gone with assert() - hitting this is _almost_
certainly an indication of a bug in the code (an IOMMU translating
into an AS that itself has IOMMUs).  In theory a combination of a
wierd platform with a multi-layered IOMMU and the wrong config options
could trigger it, but it's very unlikely.

hw_error() is fine by me too.

> > >> +        return true;
> > >> +    }
> > >> +
> > >> +    return !memory_region_is_ram(section->mr) ||
> > >> +            memory_region_is_skip_dump(section->mr);
> > >> +}
> > >> +
> > >> +static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)  
> > >
> > > What's "ua"?  
> > 
> > 
> > Userspace address.
> 
> But we use it to set a vaddr below, so let's just call it vaddr.
> 
> > >  
> > >> +{
> > >> +    return memory_region_get_ram_ptr(section->mr) +
> > >> +        section->offset_within_region +
> > >> +        (gpa - section->offset_within_address_space);
> > >> +}
> > >> +
> > >> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> > >> +                                            MemoryRegionSection *section)
> > >> +{
> > >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> > >> +                                            prereg_listener);
> > >> +    const hwaddr gpa = section->offset_within_address_space;
> > >> +    hwaddr end;
> > >> +    int ret;
> > >> +    hwaddr page_mask = qemu_real_host_page_mask;
> > >> +    struct vfio_iommu_spapr_register_memory reg = {
> > >> +        .argsz = sizeof(reg),
> > >> +        .flags = 0,
> > >> +    };  
> > >
> > > So we're just pretending that this spapr specific code is some sort of
> > > generic pre-registration interface?  
> > 
> > Yes.
> 
> :-\

It's a bit of an odd mix because the actual caps and ioctls are spapr
specific, but there's nothing inherently spapr specific about the
concept of pre-registration.  It's only implemented for spapr now, but
pre-reg would probably be a good idea for performance on any platform
where the guest expects to actively manage an IOMMU with a reasonably
small window.

> 
> > >> +
> > >> +    if (vfio_prereg_listener_skipped_section(section)) {
> > >> +        trace_vfio_listener_region_add_skip(
> > >> +                section->offset_within_address_space,
> > >> +                section->offset_within_address_space +
> > >> +                int128_get64(int128_sub(section->size, int128_one())));
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> > >> +                 (section->offset_within_region & ~page_mask) ||
> > >> +                 (int128_get64(section->size) & ~page_mask))) {
> > >> +        error_report("%s received unaligned region", __func__);
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    end = section->offset_within_address_space + int128_get64(section->size);
> > >> +    g_assert(gpa < end);
> > >> +
> > >> +    memory_region_ref(section->mr);
> > >> +
> > >> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);  
> > >
> > > Hmm, why wasn't that simply gpa_to_vaddr?  
> > 
> > I wanted to keep a prefix in all functions, even if they are static, easier 
> > to grep. Bad idea?
> 
> My question about "ua" means that it's not obvious what we're returning
> based on the name of the function alone, so I would avoid such a name.
> 
> > >> +    reg.size = end - gpa;
> > >> +
> > >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> > >> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> > >> +    if (ret) {
> > >> +        /*
> > >> +         * On the initfn path, store the first error in the container so we
> > >> +         * can gracefully fail.  Runtime, there's not much we can do other
> > >> +         * than throw a hardware error.
> > >> +         */
> > >> +        if (!container->initialized) {
> > >> +            if (!container->error) {
> > >> +                container->error = ret;
> > >> +            }
> > >> +        } else {
> > >> +            hw_error("vfio: Memory registering failed, unable to continue");
> > >> +        }
> > >> +    }
> > >> +}
> > >> +
> > >> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> > >> +                                            MemoryRegionSection *section)
> > >> +{
> > >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> > >> +                                            prereg_listener);
> > >> +    const hwaddr gpa = section->offset_within_address_space;
> > >> +    hwaddr end;
> > >> +    int ret;
> > >> +    hwaddr page_mask = qemu_real_host_page_mask;
> > >> +    struct vfio_iommu_spapr_register_memory reg = {
> > >> +        .argsz = sizeof(reg),
> > >> +        .flags = 0,
> > >> +    };
> > >> +
> > >> +    if (vfio_prereg_listener_skipped_section(section)) {
> > >> +        trace_vfio_listener_region_del_skip(
> > >> +                section->offset_within_address_space,
> > >> +                section->offset_within_address_space +
> > >> +                int128_get64(int128_sub(section->size, int128_one())));
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> > >> +                 (section->offset_within_region & ~page_mask) ||
> > >> +                 (int128_get64(section->size) & ~page_mask))) {
> > >> +        error_report("%s received unaligned region", __func__);
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    end = section->offset_within_address_space + int128_get64(section->size);
> > >> +    if (gpa >= end) {
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
> > >> +    reg.size = end - gpa;
> > >> +
> > >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> > >> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> > >> +}
> > >> +
> > >> +const MemoryListener vfio_prereg_listener = {
> > >> +    .region_add = vfio_prereg_listener_region_add,
> > >> +    .region_del = vfio_prereg_listener_region_del,
> > >> +};
> > >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > >> index c9b6622..c72e45a 100644
> > >> --- a/include/hw/vfio/vfio-common.h
> > >> +++ b/include/hw/vfio/vfio-common.h
> > >> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
> > >>      VFIOAddressSpace *space;
> > >>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> > >>      MemoryListener listener;
> > >> +    MemoryListener prereg_listener;
> > >> +    unsigned iommu_type;
> > >>      int error;
> > >>      bool initialized;
> > >>      /*
> > >> @@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
> > >>  int vfio_get_region_info(VFIODevice *vbasedev, int index,
> > >>                           struct vfio_region_info **info);
> > >>  #endif
> > >> +extern const MemoryListener vfio_prereg_listener;
> > >> +
> > >>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> > >> diff --git a/trace-events b/trace-events
> > >> index dd50005..d0d8615 100644
> > >> --- a/trace-events
> > >> +++ b/trace-events
> > >> @@ -1737,6 +1737,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
> > >>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
> > >>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
> > >>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> > >> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> > >> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> > >>
> > >>  # hw/vfio/platform.c
> > >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"  
> > 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 16/19] vfio: Add host side DMA window capabilities
  2016-05-13 22:25   ` Alex Williamson
@ 2016-05-27  0:36     ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-27  0:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3155 bytes --]

On Fri, May 13, 2016 at 04:25:59PM -0600, Alex Williamson wrote:
> On Wed,  4 May 2016 16:52:28 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > There are going to be multiple IOMMUs per a container. This moves
> > the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
> > 
> > This should cause no behavioral change and will be used later by
> > the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > Changes:
> > v16:
> > * adjusted commit log with changes from v15
> > 
> > v15:
> > * s/vfio_host_iommu_add/vfio_host_win_add/
> > * s/VFIOHostIOMMU/VFIOHostDMAWindow/
> > ---
> >  hw/vfio/common.c              | 65 +++++++++++++++++++++++++++++++++----------
> >  include/hw/vfio/vfio-common.h |  9 ++++--
> >  2 files changed, 57 insertions(+), 17 deletions(-)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 496eb82..3f2fb23 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -29,6 +29,7 @@
> >  #include "exec/memory.h"
> >  #include "hw/hw.h"
> >  #include "qemu/error-report.h"
> > +#include "qemu/range.h"
> >  #include "sysemu/kvm.h"
> >  #include "trace.h"
> >  
> > @@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >      return -errno;
> >  }
> >  
> > +static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> > +                                               hwaddr min_iova, hwaddr max_iova)
> > +{
> > +    VFIOHostDMAWindow *hostwin;
> > +
> > +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> > +        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {
> > +            return hostwin;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static int vfio_host_win_add(VFIOContainer *container,
> > +                             hwaddr min_iova, hwaddr max_iova,
> > +                             uint64_t iova_pgsizes)
> > +{
> > +    VFIOHostDMAWindow *hostwin;
> > +
> > +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> > +        if (ranges_overlap(min_iova, max_iova - min_iova + 1,
> > +                           hostwin->min_iova,
> > +                           hostwin->max_iova - hostwin->min_iova + 1)) {
> 
> Why does vfio_host_win_lookup() not also use ranges_overlap()?  In
> fact, why don't we call vfio_host_win_lookup here to find the conflict?
> 
> > +            error_report("%s: Overlapped IOMMU are not enabled", __func__);
> > +            return -1;
> 
> Nobody here tests the return value, shouldn't this be fatal?

Hm, yes.  I think hw_error() would be the right choice here.  This
would represent either a qemu programming error, or seriously
unexpected behaviour from the host kernel.
-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening
  2016-05-20  8:04         ` [Qemu-devel] [RFC PATCH qemu] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening Alexey Kardashevskiy
  2016-05-20 15:19           ` Alex Williamson
@ 2016-05-27  0:43           ` David Gibson
  1 sibling, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-27  0:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 8003 bytes --]

On Fri, May 20, 2016 at 06:04:42PM +1000, Alexey Kardashevskiy wrote:
> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> a guest view of the table and a hardware TCE table. If there is no VFIO
> presense in the address space, then just the guest view is used, if

Nit: s/presense/presence/

> this is the case, it is allocated in the KVM. However since there is no
> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> we need to move the guest view from KVM to the userspace; and we need
> to do this for every IOMMU on a bus with VFIO devices.
> 
> This adds notify_started/notify_stopped callbacks in MemoryRegionIOMMUOps
> to notify IOMMU that listeners were set/removed. This allows IOMMU to
> take necessary steps before actual notifications happen and do proper
> cleanup when the last notifier is removed.
> 
> This implements the callbacks for the sPAPR IOMMU - notify_started()
> reallocated the guest view to the user space, notify_stopped() does
> the opposite.
> 
> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> path as the new callbacks do this better - they notify IOMMU at
> the exact moment when the configuration is changed, and this also
> includes the case of PCI hot unplug.
> 
> This adds MemoryRegion* to memory_region_unregister_iommu_notifier()
> as we need iommu_ops to call notify_stopped() and Notifier* does not
> store the owner.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> 
> Is this any better? If so, I'll repost as a part of "v17". Thanks.

I agree with Alex that this is a much better approach.  I'm sad I
didn't think of it earlier.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v17:
> * replaced IOMMU users counting with simple QLIST_EMPTY()
> * renamed the callbacks
> * removed requirement for region_del() to be called on memory_listener_unregister()
> 
> v16:
> * added a use counter in VFIOAddressSpace->VFIOIOMMUMR
> 
> v15:
> * s/need_vfio/vfio-Users/g
> ---
>  hw/ppc/spapr_iommu.c  | 12 ++++++++++++
>  hw/ppc/spapr_pci.c    |  6 ------
>  hw/vfio/common.c      |  5 +++--
>  include/exec/memory.h |  8 +++++++-
>  memory.c              | 10 +++++++++-
>  5 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 73bc26b..fd38006 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -156,6 +156,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_notify_started(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> +}
> +
> +static void spapr_tce_notify_stopped(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> +}
> +

At some point we probably want to use this cleanup to remove the
"need_vfio" names inside the spapr code, but I don't think that's
reasonably within the scope of this patch.

>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -240,6 +250,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
>      .get_page_sizes = spapr_tce_get_page_sizes,
> +    .notify_started = spapr_tce_notify_started,
> +    .notify_stopped = spapr_tce_notify_stopped,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 68e77b0..7c2c622 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1089,12 +1089,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      void *fdt = NULL;
>      int fdt_start_offset = 0, fdt_size;
>  
> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> -
> -        spapr_tce_set_need_vfio(tcet, true);
> -    }
> -
>      if (dev->hotplugged) {
>          fdt = create_device_tree(&fdt_size);
>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 2e4f703..d1fa9ab 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -523,7 +523,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (giommu->iommu == section->mr) {
> -                memory_region_unregister_iommu_notifier(&giommu->n);
> +                memory_region_unregister_iommu_notifier(giommu->iommu,
> +                                                        &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
>                  g_free(giommu);
>                  break;
> @@ -1040,7 +1041,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          QLIST_REMOVE(container, next);
>  
>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(&giommu->n);
> +            memory_region_unregister_iommu_notifier(giommu->iommu, &giommu->n);
>              QLIST_REMOVE(giommu, giommu_next);
>              g_free(giommu);
>          }
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index bfb08d4..1c41eb6 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>      /* Returns supported page sizes */
>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> +    /* Called when the first notifier is set */
> +    void (*notify_started)(MemoryRegion *iommu);
> +    /* Called when the last notifier is removed */
> +    void (*notify_stopped)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -620,9 +624,11 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
>   * changes to IOMMU translation entries.
>   *
> + * @mr: the memory region which was observed and for which notity_stopped()
> + *      needs to be called
>   * @n: the notifier to be removed.
>   */
> -void memory_region_unregister_iommu_notifier(Notifier *n);
> +void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n);
>  
>  /**
>   * memory_region_name: get a memory region's name
> diff --git a/memory.c b/memory.c
> index d22cf5e..fcf978a 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1512,6 +1512,10 @@ bool memory_region_is_logging(MemoryRegion *mr, uint8_t client)
>  
>  void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>  {
> +    if (mr->iommu_ops->notify_started &&
> +        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
> +        mr->iommu_ops->notify_started(mr);
> +    }
>      notifier_list_add(&mr->iommu_notify, n);
>  }
>  
> @@ -1545,9 +1549,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
>      }
>  }
>  
> -void memory_region_unregister_iommu_notifier(Notifier *n)
> +void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n)
>  {
>      notifier_remove(n);
> +    if (mr->iommu_ops->notify_stopped &&
> +        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
> +        mr->iommu_ops->notify_stopped(mr);
> +    }
>  }
>  
>  void memory_region_notify_iommu(MemoryRegion *mr,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-05-16 20:20       ` Alex Williamson
@ 2016-05-27  0:50         ` David Gibson
  2016-05-27  3:49         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-27  0:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3924 bytes --]

On Mon, May 16, 2016 at 02:20:02PM -0600, Alex Williamson wrote:
> On Mon, 16 May 2016 14:52:41 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 05/14/2016 08:26 AM, Alex Williamson wrote:
> > > On Wed,  4 May 2016 16:52:30 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >  
> > >> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> > >> This adds ability to VFIO common code to dynamically allocate/remove
> > >> DMA windows in the host kernel when new VFIO container is added/removed.
> > >>
> > >> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> > >> and adds just created IOMMU into the host IOMMU list; the opposite
> > >> action is taken in vfio_listener_region_del.
> > >>
> > >> When creating a new window, this uses heuristic to decide on the TCE table
> > >> levels number.
> > >>
> > >> This should cause no guest visible change in behavior.
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> ---
> > >> Changes:
> > >> v16:
> > >> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> > >> * enforced no intersections between windows
> > >>
> > >> v14:
> > >> * new to the series
> > >> ---
> > >>  hw/vfio/common.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> > >>  trace-events     |   2 +
> > >>  2 files changed, 125 insertions(+), 10 deletions(-)
> > >>
> > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >> index 03daf88..bd2dee8 100644
> > >> --- a/hw/vfio/common.c
> > >> +++ b/hw/vfio/common.c
> > >> @@ -240,6 +240,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> > >>      return -errno;
> > >>  }
> > >>
> > >> +static bool range_contains(hwaddr start, hwaddr end, hwaddr addr)
> > >> +{
> > >> +    return start <= addr && addr <= end;
> > >> +}  
> > >
> > > a) If you want a "range_foo" function then put it in range.h
> > > b) I suspect there are already range.h functions that can do this.
> > >  
> > >> +
> > >> +static bool vfio_host_win_intersects(VFIOHostDMAWindow *hostwin,
> > >> +                                     hwaddr min_iova, hwaddr max_iova)
> > >> +{
> > >> +    return range_contains(hostwin->min_iova, hostwin->min_iova, min_iova) ||
> > >> +        range_contains(min_iova, max_iova, hostwin->min_iova);
> > >> +}  
> > >
> > > How is this different than ranges_overlap()?  
> > >> +
> > >>  static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> > >>                                                 hwaddr min_iova, hwaddr max_iova)
> > >>  {
> > >> @@ -279,6 +291,14 @@ static int vfio_host_win_add(VFIOContainer *container,
> > >>      return 0;
> > >>  }
> > >>
> > >> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> > >> +{
> > >> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> > >> +
> > >> +    g_assert(hostwin);  
> > >
> > > Handle the error please.  
> > 
> > Will this be enough?
> > 
> >      if (!hostwin) {
> >          error_report("%s: Cannot delete missing window at %"HWADDR_PRIx,
> >                       __func__, min_iova);
> >          return;
> >      }
> 
> Better.  I was really thinking to return error to the caller, but if
> the caller has no return path, perhaps this is as good as we can do.
> Expect that I will push back on any assert() calls added to vfio.

In this case I think returning an error makes more sense.  The caller
understands the context and so can make reasonable error handling
decisions.   .. which will probably just be a fatal error, but it
still makes more logical sense there.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-05-16 20:20       ` Alex Williamson
  2016-05-27  0:50         ` David Gibson
@ 2016-05-27  3:49         ` Alexey Kardashevskiy
  2016-05-27  4:05           ` David Gibson
  1 sibling, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-27  3:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson, Paolo Bonzini

On 17/05/16 06:20, Alex Williamson wrote:
> On Mon, 16 May 2016 14:52:41 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 05/14/2016 08:26 AM, Alex Williamson wrote:
>>> On Wed,  4 May 2016 16:52:30 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>  
>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>>>> This adds ability to VFIO common code to dynamically allocate/remove
>>>> DMA windows in the host kernel when new VFIO container is added/removed.
>>>>
>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>>>> and adds just created IOMMU into the host IOMMU list; the opposite
>>>> action is taken in vfio_listener_region_del.
>>>>
>>>> When creating a new window, this uses heuristic to decide on the TCE table
>>>> levels number.
>>>>
>>>> This should cause no guest visible change in behavior.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v16:
>>>> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
>>>> * enforced no intersections between windows
>>>>
>>>> v14:
>>>> * new to the series
>>>> ---
>>>>  hw/vfio/common.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>  trace-events     |   2 +
>>>>  2 files changed, 125 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 03daf88..bd2dee8 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -240,6 +240,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>      return -errno;
>>>>  }
>>>>
>>>> +static bool range_contains(hwaddr start, hwaddr end, hwaddr addr)
>>>> +{
>>>> +    return start <= addr && addr <= end;
>>>> +}  
>>>
>>> a) If you want a "range_foo" function then put it in range.h
>>> b) I suspect there are already range.h functions that can do this.
>>>  
>>>> +
>>>> +static bool vfio_host_win_intersects(VFIOHostDMAWindow *hostwin,
>>>> +                                     hwaddr min_iova, hwaddr max_iova)
>>>> +{
>>>> +    return range_contains(hostwin->min_iova, hostwin->min_iova, min_iova) ||
>>>> +        range_contains(min_iova, max_iova, hostwin->min_iova);
>>>> +}  
>>>
>>> How is this different than ranges_overlap()?  
>>>> +
>>>>  static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
>>>>                                                 hwaddr min_iova, hwaddr max_iova)
>>>>  {
>>>> @@ -279,6 +291,14 @@ static int vfio_host_win_add(VFIOContainer *container,
>>>>      return 0;
>>>>  }
>>>>
>>>> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
>>>> +{
>>>> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
>>>> +
>>>> +    g_assert(hostwin);  
>>>
>>> Handle the error please.  
>>
>> Will this be enough?
>>
>>      if (!hostwin) {
>>          error_report("%s: Cannot delete missing window at %"HWADDR_PRIx,
>>                       __func__, min_iova);
>>          return;
>>      }
> 
> Better.  I was really thinking to return error to the caller, but if
> the caller has no return path, perhaps this is as good as we can do.
> Expect that I will push back on any assert() calls added to vfio.
> 
> 
>>>> +    QLIST_REMOVE(hostwin, hostwin_next);
>>>> +}
>>>> +
>>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>  {
>>>>      return (!memory_region_is_ram(section->mr) &&
>>>> @@ -392,6 +412,69 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>      }
>>>>      end = int128_get64(int128_sub(llend, int128_one()));
>>>>
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +        VFIOHostDMAWindow *hostwin;
>>>> +        unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
>>>> +        unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
>>>> +        unsigned entries, pages;
>>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>>>> +
>>>> +        trace_vfio_listener_region_add_iommu(iova, end);
>>>> +        /*
>>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>>>> +         * avoid bouncing all map/unmaps through qemu this way, this
>>>> +         * would be the right place to wire that up (tell the KVM
>>>> +         * device emulation the VFIO iommu handles to use).
>>>> +         */
>>>> +        create.window_size = int128_get64(section->size);
>>>> +        create.page_shift = ctz64(pagesize);
>>>> +        /*
>>>> +         * SPAPR host supports multilevel TCE tables, there is some
>>>> +         * heuristic to decide how many levels we want for our table:
>>>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>>>> +         */
>>>> +        entries = create.window_size >> create.page_shift;
>>>> +        pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
>>>> +        pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
>>>> +        create.levels = ctz64(pages) / 6 + 1;
>>>> +
>>>> +        /* For now intersections are not allowed, we may relax this later */
>>>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>>>> +            if (vfio_host_win_intersects(hostwin,
>>>> +                    section->offset_within_address_space,
>>>> +                    section->offset_within_address_space +
>>>> +                    create.window_size - 1)) {
>>>> +                goto fail;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>> +        if (ret) {
>>>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>>>> +            goto fail;
>>>> +        }
>>>> +
>>>> +        if (create.start_addr != section->offset_within_address_space) {
>>>> +            struct vfio_iommu_spapr_tce_remove remove = {
>>>> +                .argsz = sizeof(remove),
>>>> +                .start_addr = create.start_addr
>>>> +            };
>>>> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
>>>> +                         section->offset_within_address_space,
>>>> +                         create.start_addr);
>>>> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>>>> +            ret = -EINVAL;
>>>> +            goto fail;
>>>> +        }
>>>> +        trace_vfio_spapr_create_window(create.page_shift,
>>>> +                                       create.window_size,
>>>> +                                       create.start_addr);
>>>> +
>>>> +        vfio_host_win_add(container, create.start_addr,
>>>> +                          create.start_addr + create.window_size - 1,
>>>> +                          1ULL << create.page_shift);
>>>> +    }  
>>>
>>> This is a function on its own, split it out and why not stop pretending
>>> prereg is some sort of generic interface and let's just make a spapr
>>> support file.  
>>
>>
>> Yet another new file - spapr.c, or rename prereg.c to spapr.c and add this 
>> stuff there?
> 
> prereg.c is already spapr specific, so I'd rename it and potentially
> add this to it.


It would help if you two decided what I should do about prereg.c vs.
spapr.c - merge or keep 2 files. Thanks.




-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-05-27  3:49         ` Alexey Kardashevskiy
@ 2016-05-27  4:05           ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-27  4:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-devel, qemu-ppc, Alexander Graf, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 8238 bytes --]

On Fri, May 27, 2016 at 01:49:19PM +1000, Alexey Kardashevskiy wrote:
> On 17/05/16 06:20, Alex Williamson wrote:
> > On Mon, 16 May 2016 14:52:41 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > 
> >> On 05/14/2016 08:26 AM, Alex Williamson wrote:
> >>> On Wed,  4 May 2016 16:52:30 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>  
> >>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>>> This adds ability to VFIO common code to dynamically allocate/remove
> >>>> DMA windows in the host kernel when new VFIO container is added/removed.
> >>>>
> >>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>>> and adds just created IOMMU into the host IOMMU list; the opposite
> >>>> action is taken in vfio_listener_region_del.
> >>>>
> >>>> When creating a new window, this uses heuristic to decide on the TCE table
> >>>> levels number.
> >>>>
> >>>> This should cause no guest visible change in behavior.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> Changes:
> >>>> v16:
> >>>> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> >>>> * enforced no intersections between windows
> >>>>
> >>>> v14:
> >>>> * new to the series
> >>>> ---
> >>>>  hw/vfio/common.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>>>  trace-events     |   2 +
> >>>>  2 files changed, 125 insertions(+), 10 deletions(-)
> >>>>
> >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>> index 03daf88..bd2dee8 100644
> >>>> --- a/hw/vfio/common.c
> >>>> +++ b/hw/vfio/common.c
> >>>> @@ -240,6 +240,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>>>      return -errno;
> >>>>  }
> >>>>
> >>>> +static bool range_contains(hwaddr start, hwaddr end, hwaddr addr)
> >>>> +{
> >>>> +    return start <= addr && addr <= end;
> >>>> +}  
> >>>
> >>> a) If you want a "range_foo" function then put it in range.h
> >>> b) I suspect there are already range.h functions that can do this.
> >>>  
> >>>> +
> >>>> +static bool vfio_host_win_intersects(VFIOHostDMAWindow *hostwin,
> >>>> +                                     hwaddr min_iova, hwaddr max_iova)
> >>>> +{
> >>>> +    return range_contains(hostwin->min_iova, hostwin->min_iova, min_iova) ||
> >>>> +        range_contains(min_iova, max_iova, hostwin->min_iova);
> >>>> +}  
> >>>
> >>> How is this different than ranges_overlap()?  
> >>>> +
> >>>>  static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> >>>>                                                 hwaddr min_iova, hwaddr max_iova)
> >>>>  {
> >>>> @@ -279,6 +291,14 @@ static int vfio_host_win_add(VFIOContainer *container,
> >>>>      return 0;
> >>>>  }
> >>>>
> >>>> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> >>>> +{
> >>>> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> >>>> +
> >>>> +    g_assert(hostwin);  
> >>>
> >>> Handle the error please.  
> >>
> >> Will this be enough?
> >>
> >>      if (!hostwin) {
> >>          error_report("%s: Cannot delete missing window at %"HWADDR_PRIx,
> >>                       __func__, min_iova);
> >>          return;
> >>      }
> > 
> > Better.  I was really thinking to return error to the caller, but if
> > the caller has no return path, perhaps this is as good as we can do.
> > Expect that I will push back on any assert() calls added to vfio.
> > 
> > 
> >>>> +    QLIST_REMOVE(hostwin, hostwin_next);
> >>>> +}
> >>>> +
> >>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>  {
> >>>>      return (!memory_region_is_ram(section->mr) &&
> >>>> @@ -392,6 +412,69 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>>>      }
> >>>>      end = int128_get64(int128_sub(llend, int128_one()));
> >>>>
> >>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>> +        VFIOHostDMAWindow *hostwin;
> >>>> +        unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
> >>>> +        unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
> >>>> +        unsigned entries, pages;
> >>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >>>> +
> >>>> +        trace_vfio_listener_region_add_iommu(iova, end);
> >>>> +        /*
> >>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>>> +         * avoid bouncing all map/unmaps through qemu this way, this
> >>>> +         * would be the right place to wire that up (tell the KVM
> >>>> +         * device emulation the VFIO iommu handles to use).
> >>>> +         */
> >>>> +        create.window_size = int128_get64(section->size);
> >>>> +        create.page_shift = ctz64(pagesize);
> >>>> +        /*
> >>>> +         * SPAPR host supports multilevel TCE tables, there is some
> >>>> +         * heuristic to decide how many levels we want for our table:
> >>>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >>>> +         */
> >>>> +        entries = create.window_size >> create.page_shift;
> >>>> +        pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> >>>> +        pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
> >>>> +        create.levels = ctz64(pages) / 6 + 1;
> >>>> +
> >>>> +        /* For now intersections are not allowed, we may relax this later */
> >>>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> >>>> +            if (vfio_host_win_intersects(hostwin,
> >>>> +                    section->offset_within_address_space,
> >>>> +                    section->offset_within_address_space +
> >>>> +                    create.window_size - 1)) {
> >>>> +                goto fail;
> >>>> +            }
> >>>> +        }
> >>>> +
> >>>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>>> +        if (ret) {
> >>>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
> >>>> +            goto fail;
> >>>> +        }
> >>>> +
> >>>> +        if (create.start_addr != section->offset_within_address_space) {
> >>>> +            struct vfio_iommu_spapr_tce_remove remove = {
> >>>> +                .argsz = sizeof(remove),
> >>>> +                .start_addr = create.start_addr
> >>>> +            };
> >>>> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> >>>> +                         section->offset_within_address_space,
> >>>> +                         create.start_addr);
> >>>> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>>> +            ret = -EINVAL;
> >>>> +            goto fail;
> >>>> +        }
> >>>> +        trace_vfio_spapr_create_window(create.page_shift,
> >>>> +                                       create.window_size,
> >>>> +                                       create.start_addr);
> >>>> +
> >>>> +        vfio_host_win_add(container, create.start_addr,
> >>>> +                          create.start_addr + create.window_size - 1,
> >>>> +                          1ULL << create.page_shift);
> >>>> +    }  
> >>>
> >>> This is a function on its own, split it out and why not stop pretending
> >>> prereg is some sort of generic interface and let's just make a spapr
> >>> support file.  
> >>
> >>
> >> Yet another new file - spapr.c, or rename prereg.c to spapr.c and add this 
> >> stuff there?
> > 
> > prereg.c is already spapr specific, so I'd rename it and potentially
> > add this to it.
> 
> 
> It would help if you two decided what I should do about prereg.c vs.
> spapr.c - merge or keep 2 files. Thanks.

I'm not particularly fussed either way.  So I suggest putting the
prereg stuff in spapr.c for now, and we can move it out if/when
another platform starts using the mechanism.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-13  8:41   ` Bharata B Rao
  2016-05-13  8:49     ` Bharata B Rao
  2016-05-16  6:25     ` Alexey Kardashevskiy
@ 2016-05-27  4:42     ` David Gibson
  2 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-05-27  4:42 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Alexey Kardashevskiy, qemu-devel, Alexander Graf,
	Alex Williamson, qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 19828 bytes --]

On Fri, May 13, 2016 at 02:11:48PM +0530, Bharata B Rao wrote:
> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > This adds support for Dynamic DMA Windows (DDW) option defined by
> > the SPAPR specification which allows to have additional DMA window(s)
> >
> > The "ddw" property is enabled by default on a PHB but for compatibility
> > the pseries-2.5 machine (TODO: update version) and older disable it.
> > This also creates a single DMA window for the older machines to
> > maintain backward migration.
> >
> > This implements DDW for PHB with emulated and VFIO devices. The host
> > kernel support is required. The advertised IOMMU page sizes are 4K and
> > 64K; 16M pages are supported but not advertised by default, in order to
> > enable them, the user has to specify "pgsz" property for PHB and
> > enable huge pages for RAM.
> >
> > The existing linux guests try creating one additional huge DMA window
> > with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> > the guest switches to dma_direct_ops and never calls TCE hypercalls
> > (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> > and not waste time on map/unmap later. This adds a "dma64_win_addr"
> > property which is a bus address for the 64bit window and by default
> > set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> > uses and this allows having emulated and VFIO devices on the same bus.
> >
> > This adds 4 RTAS handlers:
> > * ibm,query-pe-dma-window
> > * ibm,create-pe-dma-window
> > * ibm,remove-pe-dma-window
> > * ibm,reset-pe-dma-window
> > These are registered from type_init() callback.
> >
> > These RTAS handlers are implemented in a separate file to avoid polluting
> > spapr_iommu.c with PCI.
> >
> > This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> >
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > Changes:
> > v16:
> > * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> > * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> >
> > v15:
> > * moved page mask filtering to PHB realize(), use "-mempath" to know
> > if there are huge pages
> > * fixed error reporting in RTAS handlers
> > * max window size accounts now hotpluggable memory boundaries
> > ---
> >  hw/ppc/Makefile.objs        |   1 +
> >  hw/ppc/spapr.c              |   5 +
> >  hw/ppc/spapr_pci.c          |  75 +++++++++---
> >  hw/ppc/spapr_rtas_ddw.c     | 292 ++++++++++++++++++++++++++++++++++++++++++++
> >  include/hw/pci-host/spapr.h |   8 +-
> >  include/hw/ppc/spapr.h      |  16 ++-
> >  trace-events                |   4 +
> >  7 files changed, 381 insertions(+), 20 deletions(-)
> >  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >
> > diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> > index c1ffc77..986b36f 100644
> > --- a/hw/ppc/Makefile.objs
> > +++ b/hw/ppc/Makefile.objs
> > @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >  obj-y += spapr_pci_vfio.o
> >  endif
> > +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >  # PowerPC 4xx boards
> >  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >  obj-y += ppc4xx_pci.o
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index b69995e..0206609 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2365,6 +2365,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >          .driver   = "spapr-vlan", \
> >          .property = "use-rx-buffer-pools", \
> >          .value    = "off", \
> > +    }, \
> > +    {\
> > +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> > +        .property = "ddw",\
> > +        .value    = stringify(off),\
> >      },
> >
> >  static void spapr_machine_2_5_instance_options(MachineState *machine)
> > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > index 51e7d56..aa414f2 100644
> > --- a/hw/ppc/spapr_pci.c
> > +++ b/hw/ppc/spapr_pci.c
> > @@ -35,6 +35,7 @@
> >  #include "hw/ppc/spapr.h"
> >  #include "hw/pci-host/spapr.h"
> >  #include "exec/address-spaces.h"
> > +#include "exec/ram_addr.h"
> >  #include <libfdt.h>
> >  #include "trace.h"
> >  #include "qemu/error-report.h"
> > @@ -44,6 +45,7 @@
> >  #include "hw/pci/pci_bus.h"
> >  #include "hw/ppc/spapr_drc.h"
> >  #include "sysemu/device_tree.h"
> > +#include "sysemu/hostmem.h"
> >
> >  #include "hw/vfio/vfio.h"
> >
> > @@ -1305,11 +1307,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >      PCIBus *bus;
> >      uint64_t msi_window_size = 4096;
> >      sPAPRTCETable *tcet;
> > +    const unsigned windows_supported =
> > +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >
> >      if (sphb->index != (uint32_t)-1) {
> >          hwaddr windows_base;
> >
> > -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> > +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> > +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
> >              || (sphb->mem_win_addr != (hwaddr)-1)
> >              || (sphb->io_win_addr != (hwaddr)-1)) {
> >              error_setg(errp, "Either \"index\" or other parameters must"
> > @@ -1324,7 +1329,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >          }
> >
> >          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> > -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> > +        for (i = 0; i < windows_supported; ++i) {
> > +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> > +        }
> >
> >          windows_base = SPAPR_PCI_WINDOW_BASE
> >              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> > @@ -1337,8 +1344,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >          return;
> >      }
> >
> > -    if (sphb->dma_liobn == (uint32_t)-1) {
> > -        error_setg(errp, "LIOBN not specified for PHB");
> > +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
> > +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
> > +        error_setg(errp, "LIOBN(s) not specified for PHB");
> >          return;
> >      }
> >
> > @@ -1456,16 +1464,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >          }
> >      }
> >
> > -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> > -    if (!tcet) {
> > -        error_setg(errp, "Unable to create TCE table for %s",
> > -                   sphb->dtbusname);
> > -        return;
> > +    /* DMA setup */
> > +    for (i = 0; i < windows_supported; ++i) {
> > +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
> > +        if (!tcet) {
> > +            error_setg(errp, "Creating window#%d failed for %s",
> > +                       i, sphb->dtbusname);
> > +            return;
> > +        }
> > +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> > +                                            spapr_tce_get_iommu(tcet), 0);
> >      }
> >
> > -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> > -                                        spapr_tce_get_iommu(tcet), 0);
> > -
> >      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >  }
> >
> > @@ -1482,13 +1492,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
> >
> >  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >  {
> > -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> > +    int i;
> > +    sPAPRTCETable *tcet;
> >
> > -    if (tcet && tcet->enabled) {
> > -        spapr_tce_table_disable(tcet);
> > +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> > +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
> > +
> > +        if (tcet && tcet->enabled) {
> > +            spapr_tce_table_disable(tcet);
> > +        }
> >      }
> >
> >      /* Register default 32bit DMA window */
> > +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
> >      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
> >                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
> >  }
> > @@ -1510,7 +1526,8 @@ static void spapr_phb_reset(DeviceState *qdev)
> >  static Property spapr_phb_properties[] = {
> >      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
> >      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
> > -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
> > +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
> > +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
> >      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
> >      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
> >                         SPAPR_PCI_MMIO_WIN_SIZE),
> > @@ -1522,6 +1539,11 @@ static Property spapr_phb_properties[] = {
> >      /* Default DMA window is 0..1GB */
> >      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> > +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> > +                       0x800000000000000ULL),
> > +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> > +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> > +                       (1ULL << 12) | (1ULL << 16)),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> >
> > @@ -1598,7 +1620,7 @@ static const VMStateDescription vmstate_spapr_pci = {
> >      .post_load = spapr_pci_post_load,
> >      .fields = (VMStateField[]) {
> >          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> > -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> > +        VMSTATE_UNUSED(4), /* dma_liobn */
> >          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
> >          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
> >          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> > @@ -1775,6 +1797,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >      uint32_t interrupt_map_mask[] = {
> >          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> > +    uint32_t ddw_applicable[] = {
> > +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> > +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> > +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> > +    };
> > +    uint32_t ddw_extensions[] = {
> > +        cpu_to_be32(1),
> > +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> > +    };
> >      sPAPRTCETable *tcet;
> >      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >      sPAPRFDT s_fdt;
> > @@ -1799,6 +1830,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >
> > +    /* Dynamic DMA window */
> > +    if (phb->ddw_enabled) {
> > +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> > +                         sizeof(ddw_applicable)));
> > +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> > +                         &ddw_extensions, sizeof(ddw_extensions)));
> > +    }
> > +
> >      /* Build the interrupt-map, this must matches what is done
> >       * in pci_spapr_map_irq
> >       */
> > @@ -1822,7 +1861,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
> >                       sizeof(interrupt_map)));
> >
> > -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> > +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >      if (!tcet) {
> >          return -1;
> >      }
> > diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> > new file mode 100644
> > index 0000000..b4e0686
> > --- /dev/null
> > +++ b/hw/ppc/spapr_rtas_ddw.c
> > @@ -0,0 +1,292 @@
> > +/*
> > + * QEMU sPAPR Dynamic DMA windows support
> > + *
> > + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> > + *
> > + *  This program is free software; you can redistribute it and/or modify
> > + *  it under the terms of the GNU General Public License as published by
> > + *  the Free Software Foundation; either version 2 of the License,
> > + *  or (at your option) any later version.
> > + *
> > + *  This program is distributed in the hope that it will be useful,
> > + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + *  GNU General Public License for more details.
> > + *
> > + *  You should have received a copy of the GNU General Public License
> > + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qemu/error-report.h"
> > +#include "hw/ppc/spapr.h"
> > +#include "hw/pci-host/spapr.h"
> > +#include "trace.h"
> > +
> > +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> > +{
> > +    sPAPRTCETable *tcet;
> > +
> > +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> > +    if (tcet && tcet->enabled) {
> > +        ++*(unsigned *)opaque;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> > +{
> > +    unsigned ret = 0;
> > +
> > +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> > +
> > +    return ret;
> > +}
> > +
> > +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> > +{
> > +    sPAPRTCETable *tcet;
> > +
> > +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> > +    if (tcet && !tcet->enabled) {
> > +        *(uint32_t *)opaque = tcet->liobn;
> > +        return 1;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> > +{
> > +    uint32_t liobn = 0;
> > +
> > +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> > +
> > +    return liobn;
> > +}
> > +
> > +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> > +{
> > +    int i;
> > +    uint32_t mask = 0;
> > +    const struct { int shift; uint32_t mask; } masks[] = {
> > +        { 12, RTAS_DDW_PGSIZE_4K },
> > +        { 16, RTAS_DDW_PGSIZE_64K },
> > +        { 24, RTAS_DDW_PGSIZE_16M },
> > +        { 25, RTAS_DDW_PGSIZE_32M },
> > +        { 26, RTAS_DDW_PGSIZE_64M },
> > +        { 27, RTAS_DDW_PGSIZE_128M },
> > +        { 28, RTAS_DDW_PGSIZE_256M },
> > +        { 34, RTAS_DDW_PGSIZE_16G },
> > +    };
> > +
> > +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> > +        if (page_mask & (1ULL << masks[i].shift)) {
> > +            mask |= masks[i].mask;
> > +        }
> > +    }
> > +
> > +    return mask;
> > +}
> > +
> > +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> > +                                         sPAPRMachineState *spapr,
> > +                                         uint32_t token, uint32_t nargs,
> > +                                         target_ulong args,
> > +                                         uint32_t nret, target_ulong rets)
> > +{
> > +    sPAPRPHBState *sphb;
> > +    uint64_t buid, max_window_size;
> > +    uint32_t avail, addr, pgmask = 0;
> > +    MachineState *machine = MACHINE(spapr);
> > +
> > +    if ((nargs != 3) || (nret != 5)) {
> > +        goto param_error_exit;
> > +    }
> > +
> > +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> > +    addr = rtas_ld(args, 0);
> > +    sphb = spapr_pci_find_phb(spapr, buid);
> > +    if (!sphb || !sphb->ddw_enabled) {
> > +        goto param_error_exit;
> > +    }
> > +
> > +    /* Translate page mask to LoPAPR format */
> > +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> > +
> > +    /*
> > +     * This is "Largest contiguous block of TCEs allocated specifically
> > +     * for (that is, are reserved for) this PE".
> > +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> > +     */
> > +    if (machine->ram_size == machine->maxram_size) {
> > +        max_window_size = machine->ram_size >> SPAPR_TCE_PAGE_SHIFT;
> > +    } else {
> > +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> > +
> > +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> > +    }
> 
> Guess SPAPR_TCE_PAGE_SHIFT right shift should be applied to
> max_window_size in both the instances (if and else) ?
> 
> > +
> > +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> > +
> > +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> > +    rtas_st(rets, 1, avail);
> > +    rtas_st(rets, 2, max_window_size);
> > +    rtas_st(rets, 3, pgmask);
> > +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> > +
> > +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> > +    return;
> > +
> > +param_error_exit:
> > +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> > +}
> > +
> > +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> > +                                          sPAPRMachineState *spapr,
> > +                                          uint32_t token, uint32_t nargs,
> > +                                          target_ulong args,
> > +                                          uint32_t nret, target_ulong rets)
> > +{
> > +    sPAPRPHBState *sphb;
> > +    sPAPRTCETable *tcet = NULL;
> > +    uint32_t addr, page_shift, window_shift, liobn;
> > +    uint64_t buid;
> > +
> > +    if ((nargs != 5) || (nret != 4)) {
> > +        goto param_error_exit;
> > +    }
> > +
> > +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> > +    addr = rtas_ld(args, 0);
> > +    sphb = spapr_pci_find_phb(spapr, buid);
> > +    if (!sphb || !sphb->ddw_enabled) {
> > +        goto param_error_exit;
> > +    }
> > +
> > +    page_shift = rtas_ld(args, 3);
> > +    window_shift = rtas_ld(args, 4);
> 
> Kernel has a bug due to which wrong window_shift gets returned here. I
> have posted possible fix here:
> https://patchwork.ozlabs.org/patch/621497/
> 
> I have tried to work around this issue in QEMU too
> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
> 
> But the above work around involves changing the memory representation
> in DT. Hence I feel until the guest kernel changes are available, a
> simpler work around would be to discard the window_shift value above
> and recalculate the right value as below:
> 
> if (machine->ram_size == machine->maxram_size) {
>     max_window_size = machine->ram_size;
> } else {
>      MemoryHotplugState *hpms = &spapr->hotplug_memory;
>      max_window_size = hpms->base + memory_region_size(&hpms->mr);
> }
> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT;
> 
> and create DDW based on this calculated window_shift value. Does that
> sound reasonable ?

No, it really doesn't.  Silently ignoring the parameters to an RTAS
call, and substituting what we think the guest meant sounds like a
terrible idea.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-17  5:32       ` Bharata B Rao
@ 2016-05-27  4:44         ` David Gibson
  2016-05-27  5:49           ` Bharata B Rao
  0 siblings, 1 reply; 69+ messages in thread
From: David Gibson @ 2016-05-27  4:44 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Alexey Kardashevskiy, qemu-devel, Alexander Graf,
	Alex Williamson, qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7506 bytes --]

On Tue, May 17, 2016 at 11:02:48AM +0530, Bharata B Rao wrote:
> On Mon, May 16, 2016 at 11:55 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > On 05/13/2016 06:41 PM, Bharata B Rao wrote:
> >>
> >> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru>
> >> wrote:
> >
> >
> >>
> >>> +
> >>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS -
> >>> spapr_phb_get_active_win_num(sphb);
> >>> +
> >>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>> +    rtas_st(rets, 1, avail);
> >>> +    rtas_st(rets, 2, max_window_size);
> >>> +    rtas_st(rets, 3, pgmask);
> >>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> >>> +
> >>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size,
> >>> pgmask);
> >>> +    return;
> >>> +
> >>> +param_error_exit:
> >>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >>> +}
> >>> +
> >>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >>> +                                          sPAPRMachineState *spapr,
> >>> +                                          uint32_t token, uint32_t
> >>> nargs,
> >>> +                                          target_ulong args,
> >>> +                                          uint32_t nret, target_ulong
> >>> rets)
> >>> +{
> >>> +    sPAPRPHBState *sphb;
> >>> +    sPAPRTCETable *tcet = NULL;
> >>> +    uint32_t addr, page_shift, window_shift, liobn;
> >>> +    uint64_t buid;
> >>> +
> >>> +    if ((nargs != 5) || (nret != 4)) {
> >>> +        goto param_error_exit;
> >>> +    }
> >>> +
> >>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>> +    addr = rtas_ld(args, 0);
> >>> +    sphb = spapr_pci_find_phb(spapr, buid);
> >>> +    if (!sphb || !sphb->ddw_enabled) {
> >>> +        goto param_error_exit;
> >>> +    }
> >>> +
> >>> +    page_shift = rtas_ld(args, 3);
> >>> +    window_shift = rtas_ld(args, 4);
> >>
> >>
> >> Kernel has a bug due to which wrong window_shift gets returned here. I
> >> have posted possible fix here:
> >> https://patchwork.ozlabs.org/patch/621497/
> >>
> >> I have tried to work around this issue in QEMU too
> >> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
> >>
> >> But the above work around involves changing the memory representation
> >> in DT.
> >
> >
> > What is wrong with this workaround?
> 
> The above workaround will result in different representations for
> memory in DT before and after the workaround.
> 
> Currently for -m 2G, -numa node,nodeid=0,mem=1G -numa
> node,nodeid=1,mem=0.5G, we will have the following nodes in DT:
> 
> memory@0
> memory@40000000
> ibm,dynamic-reconfiguration-memory
> 
> ibm,dynamic-memory will have only DR LMBs:
> 
> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
> 0000000 0000 000a 0000 0000 8000 0000 8000 0008
> 0000010 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000020 9000 0000 8000 0009 0000 0000 ffff ffff
> 0000030 0000 0000 0000 0000 a000 0000 8000 000a
> 0000040 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000050 b000 0000 8000 000b 0000 0000 ffff ffff
> 0000060 0000 0000 0000 0000 c000 0000 8000 000c
> 0000070 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000080 d000 0000 8000 000d 0000 0000 ffff ffff
> 0000090 0000 0000 0000 0000 e000 0000 8000 000e
> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
> 00000b0 f000 0000 8000 000f 0000 0000 ffff ffff
> 00000c0 0000 0000 0000 0001 0000 0000 8000 0010
> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0001
> 00000e0 1000 0000 8000 0011 0000 0000 ffff ffff
> 00000f0 0000 0000
> 
> The memory region looks like this:
> 
> memory-region: system
>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>     0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory
> 
> After this workaround, all this will change like below:
> 
> memory@0
> ibm,dynamic-reconfiguration-memory
> 
> All LMBs in ibm,dynamic-memory:
> 
> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
> 
> 0000000 0000 0010 0000 0000 0000 0000 8000 0000
> 0000010 0000 0000 0000 0000 0000 0080 0000 0000
> 0000020 1000 0000 8000 0001 0000 0000 0000 0000
> 0000030 0000 0080 0000 0000 2000 0000 8000 0002
> 0000040 0000 0000 0000 0000 0000 0080 0000 0000
> 0000050 3000 0000 8000 0003 0000 0000 0000 0000
> 0000060 0000 0080 0000 0000 4000 0000 8000 0004
> 0000070 0000 0000 0000 0001 0000 0008 0000 0000
> 0000080 5000 0000 8000 0005 0000 0000 0000 0001
> 0000090 0000 0008 0000 0000 6000 0000 8000 0006
> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
> 00000b0 7000 0000 8000 0007 0000 0000 ffff ffff
> 00000c0 0000 0000 0000 0000 8000 0000 8000 0008
> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0000
> 00000e0 9000 0000 8000 0009 0000 0000 ffff ffff
> 00000f0 0000 0000 0000 0000 a000 0000 8000 000a
> 0000100 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000110 b000 0000 8000 000b 0000 0000 ffff ffff
> 0000120 0000 0000 0000 0000 c000 0000 8000 000c
> 0000130 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000140 d000 0000 8000 000d 0000 0000 ffff ffff
> 0000150 0000 0000 0000 0000 e000 0000 8000 000e
> 0000160 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000170 f000 0000 8000 000f 0000 0000 ffff ffff
> 0000180 0000 0000
> 
> Hotplug memory region gets a new address range now:
> 
> memory-region: system
>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>     0000000060000000-00000000ffffffff (prio 0, RW): hotplug-memory
> 
> 
> So when a guest that was booted with older QEMU is migrated to a newer
> QEMU that has this workaround, then it will start exhibiting the above
> changes after first reboot post migration.

Ok.. why is that bad?

> If user has done memory hotplug by explicitly specifying address in
> the source, then even migration would fail because the addr specified
> at the target will not be part of hotplug-memory range.

Sorry, not really following the situation you're describing here.

> Hence I believe we shoudn't workaround in this manner but have the
> workaround in the DDW code where the window can be easily fixed.
> 
> >
> >> Hence I feel until the guest kernel changes are available, a
> >> simpler work around would be to discard the window_shift value above
> >> and recalculate the right value as below:
> >>
> >> if (machine->ram_size == machine->maxram_size) {
> >>     max_window_size = machine->ram_size;
> >> } else {
> >>      MemoryHotplugState *hpms = &spapr->hotplug_memory;
> >>      max_window_size = hpms->base + memory_region_size(&hpms->mr);
> >> }
> >> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT;
> >>
> >> and create DDW based on this calculated window_shift value. Does that
> >> sound reasonable ?
> >
> >
> > The workaround should only do that for the second window, at least, or for
> > the default one but with page size at least 64K; otherwise it is going to be
> > a waste of memory (2MB per each 1GB of guest RAM).
> 
> Ok, will sync up with you separately to understand more about the
> 'two' windows here.
> 
> Regards,
> Bharata.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-27  4:44         ` David Gibson
@ 2016-05-27  5:49           ` Bharata B Rao
  2016-06-01  3:32             ` Bharata B Rao
  0 siblings, 1 reply; 69+ messages in thread
From: Bharata B Rao @ 2016-05-27  5:49 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, qemu-devel, Alexander Graf,
	Alex Williamson, qemu-ppc, Paolo Bonzini

On Fri, May 27, 2016 at 10:14 AM, David Gibson
<david@gibson.dropbear.id.au> wrote:
> On Tue, May 17, 2016 at 11:02:48AM +0530, Bharata B Rao wrote:
>> On Mon, May 16, 2016 at 11:55 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> > On 05/13/2016 06:41 PM, Bharata B Rao wrote:
>> >>
>> >> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru>
>> >> wrote:
>> >
>> >
>> >>
>> >>> +
>> >>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS -
>> >>> spapr_phb_get_active_win_num(sphb);
>> >>> +
>> >>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> >>> +    rtas_st(rets, 1, avail);
>> >>> +    rtas_st(rets, 2, max_window_size);
>> >>> +    rtas_st(rets, 3, pgmask);
>> >>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> >>> +
>> >>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size,
>> >>> pgmask);
>> >>> +    return;
>> >>> +
>> >>> +param_error_exit:
>> >>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> >>> +}
>> >>> +
>> >>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> >>> +                                          sPAPRMachineState *spapr,
>> >>> +                                          uint32_t token, uint32_t
>> >>> nargs,
>> >>> +                                          target_ulong args,
>> >>> +                                          uint32_t nret, target_ulong
>> >>> rets)
>> >>> +{
>> >>> +    sPAPRPHBState *sphb;
>> >>> +    sPAPRTCETable *tcet = NULL;
>> >>> +    uint32_t addr, page_shift, window_shift, liobn;
>> >>> +    uint64_t buid;
>> >>> +
>> >>> +    if ((nargs != 5) || (nret != 4)) {
>> >>> +        goto param_error_exit;
>> >>> +    }
>> >>> +
>> >>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> >>> +    addr = rtas_ld(args, 0);
>> >>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> >>> +    if (!sphb || !sphb->ddw_enabled) {
>> >>> +        goto param_error_exit;
>> >>> +    }
>> >>> +
>> >>> +    page_shift = rtas_ld(args, 3);
>> >>> +    window_shift = rtas_ld(args, 4);
>> >>
>> >>
>> >> Kernel has a bug due to which wrong window_shift gets returned here. I
>> >> have posted possible fix here:
>> >> https://patchwork.ozlabs.org/patch/621497/
>> >>
>> >> I have tried to work around this issue in QEMU too
>> >> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
>> >>
>> >> But the above work around involves changing the memory representation
>> >> in DT.
>> >
>> >
>> > What is wrong with this workaround?
>>
>> The above workaround will result in different representations for
>> memory in DT before and after the workaround.
>>
>> Currently for -m 2G, -numa node,nodeid=0,mem=1G -numa
>> node,nodeid=1,mem=0.5G, we will have the following nodes in DT:
>>
>> memory@0
>> memory@40000000
>> ibm,dynamic-reconfiguration-memory
>>
>> ibm,dynamic-memory will have only DR LMBs:
>>
>> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
>> 0000000 0000 000a 0000 0000 8000 0000 8000 0008
>> 0000010 0000 0000 ffff ffff 0000 0000 0000 0000
>> 0000020 9000 0000 8000 0009 0000 0000 ffff ffff
>> 0000030 0000 0000 0000 0000 a000 0000 8000 000a
>> 0000040 0000 0000 ffff ffff 0000 0000 0000 0000
>> 0000050 b000 0000 8000 000b 0000 0000 ffff ffff
>> 0000060 0000 0000 0000 0000 c000 0000 8000 000c
>> 0000070 0000 0000 ffff ffff 0000 0000 0000 0000
>> 0000080 d000 0000 8000 000d 0000 0000 ffff ffff
>> 0000090 0000 0000 0000 0000 e000 0000 8000 000e
>> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
>> 00000b0 f000 0000 8000 000f 0000 0000 ffff ffff
>> 00000c0 0000 0000 0000 0001 0000 0000 8000 0010
>> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0001
>> 00000e0 1000 0000 8000 0011 0000 0000 ffff ffff
>> 00000f0 0000 0000
>>
>> The memory region looks like this:
>>
>> memory-region: system
>>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>>     0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory
>>
>> After this workaround, all this will change like below:
>>
>> memory@0
>> ibm,dynamic-reconfiguration-memory
>>
>> All LMBs in ibm,dynamic-memory:
>>
>> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
>>
>> 0000000 0000 0010 0000 0000 0000 0000 8000 0000
>> 0000010 0000 0000 0000 0000 0000 0080 0000 0000
>> 0000020 1000 0000 8000 0001 0000 0000 0000 0000
>> 0000030 0000 0080 0000 0000 2000 0000 8000 0002
>> 0000040 0000 0000 0000 0000 0000 0080 0000 0000
>> 0000050 3000 0000 8000 0003 0000 0000 0000 0000
>> 0000060 0000 0080 0000 0000 4000 0000 8000 0004
>> 0000070 0000 0000 0000 0001 0000 0008 0000 0000
>> 0000080 5000 0000 8000 0005 0000 0000 0000 0001
>> 0000090 0000 0008 0000 0000 6000 0000 8000 0006
>> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
>> 00000b0 7000 0000 8000 0007 0000 0000 ffff ffff
>> 00000c0 0000 0000 0000 0000 8000 0000 8000 0008
>> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0000
>> 00000e0 9000 0000 8000 0009 0000 0000 ffff ffff
>> 00000f0 0000 0000 0000 0000 a000 0000 8000 000a
>> 0000100 0000 0000 ffff ffff 0000 0000 0000 0000
>> 0000110 b000 0000 8000 000b 0000 0000 ffff ffff
>> 0000120 0000 0000 0000 0000 c000 0000 8000 000c
>> 0000130 0000 0000 ffff ffff 0000 0000 0000 0000
>> 0000140 d000 0000 8000 000d 0000 0000 ffff ffff
>> 0000150 0000 0000 0000 0000 e000 0000 8000 000e
>> 0000160 0000 0000 ffff ffff 0000 0000 0000 0000
>> 0000170 f000 0000 8000 000f 0000 0000 ffff ffff
>> 0000180 0000 0000
>>
>> Hotplug memory region gets a new address range now:
>>
>> memory-region: system
>>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>>     0000000060000000-00000000ffffffff (prio 0, RW): hotplug-memory
>>
>>
>> So when a guest that was booted with older QEMU is migrated to a newer
>> QEMU that has this workaround, then it will start exhibiting the above
>> changes after first reboot post migration.
>
> Ok.. why is that bad?
>
>> If user has done memory hotplug by explicitly specifying address in
>> the source, then even migration would fail because the addr specified
>> at the target will not be part of hotplug-memory range.
>
> Sorry, not really following the situation you're describing here.

If the original case where the hotplug region was this:
0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory

one could hoplug a DIMM at a specified address like this:

(qemu) object_add memory-backend-ram,id=ram0,size=256M
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,addr=0x100000000
(qemu) info mtree
0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory
      0000000100000000-000000010fffffff (prio 0, RW): ram0

Now if this guest has to be migrated to a target where we have this
workaround enabled, then the target QEMU started with

-incoming ... -object memory-backend-ram,id=ram0,size=256M -device
pc-dimm,id=dimm0,memdev=ram0,addr=0x100000000

will fail because addr=0x100000000 isn't part of the hotplug-memory
region at the target.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC
  2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2016-05-27  7:54   ` Alexey Kardashevskiy
  2016-06-01  2:29     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-27  7:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-ppc, Alexander Graf, David Gibson, Alex Williamson, Paolo Bonzini

On 04/05/16 16:52, Alexey Kardashevskiy wrote:
> This allows dynamic allocation for migrating arrays.
> 
> Already existing VMSTATE_VARRAY_UINT32 requires an array to be
> pre-allocated, however there are cases when the size is not known in
> advance and there is no real need to enforce it.
> 
> This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
> flag which tells the receiving side to allocate memory for the array
> before receiving the data.
> 
> The first user of it is a dynamic DMA window which existence and size
> are totally dynamic.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Thomas Huth <thuth@redhat.com>


In what tree is this going to go? pseries? Or migration?



> ---
>  include/migration/vmstate.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 84ee355..1622638 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
>      .offset     = vmstate_offset_pointer(_state, _field, _type),     \
>  }
>  
> +#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
> +    .name       = (stringify(_field)),                               \
> +    .version_id = (_version),                                        \
> +    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
> +    .info       = &(_info),                                          \
> +    .size       = sizeof(_type),                                     \
> +    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
> +    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
> +}
> +
>  #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
>      .name       = (stringify(_field)),                               \
>      .version_id = (_version),                                        \
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 08/19] spapr_iommu: Introduce "enabled" state for TCE table
  2016-05-26  3:39   ` David Gibson
@ 2016-05-27  8:01     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-27  8:01 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7886 bytes --]

On 26/05/16 13:39, David Gibson wrote:
> On Wed, May 04, 2016 at 04:52:20PM +1000, Alexey Kardashevskiy wrote:
>> Currently TCE tables are created once at start and their sizes never
>> change. We are going to change that by introducing a Dynamic DMA windows
>> support where DMA configuration may change during the guest execution.
>>
>> This changes spapr_tce_new_table() to create an empty zero-size IOMMU
>> memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
>> It still will be called once at the owner object (VIO or PHB) creation.
>>
>> This introduces an "enabled" state for TCE table objects with two
>> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
>> - spapr_tce_table_enable() receives TCE table parameters, allocates
>> a guest view of the TCE table (in the user space or KVM) and
>> sets the correct size on the IOMMU MR.
>> - spapr_tce_table_disable() disposes the table and resets the IOMMU MR
>> size.
>>
>> This changes the PHB reset handler to do the default DMA initialization
>> instead of spapr_phb_realize(). This does not make differenct now but
>> later with more than just one DMA window, we will have to remove them all
>> and create the default one on a system reset.
>>
>> No visible change in behaviour is expected except the actual table
>> will be reallocated every reset. We might optimize this later.
>>
>> The other way to implement this would be dynamically create/remove
>> the TCE table QOM objects but this would make migration impossible
>> as the migration code expects all QOM objects to exist at the receiver
>> so we have to have TCE table objects created when migration begins.
>>
>> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
>> as later it will be called at the sPAPRTCETable post-migration stage when
>> it already has all the properties set after the migration; the same is
>> done for spapr_tce_table_disable().
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> ---
>> Changes:
>> v15:
>> * made adjustments after removing spapr_phb_dma_window_enable()
>>
>> v14:
>> * added spapr_tce_table_do_disable(), will make difference in following
>> patch with fully dynamic table migration
>>
>> # Conflicts:
>> #	hw/ppc/spapr_pci.c
>> ---
>>  hw/ppc/spapr_iommu.c   | 86 ++++++++++++++++++++++++++++++++++++--------------
>>  hw/ppc/spapr_pci.c     |  8 +++--
>>  hw/ppc/spapr_vio.c     |  8 ++---
>>  include/hw/ppc/spapr.h | 10 +++---
>>  4 files changed, 75 insertions(+), 37 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 8132f64..9bcd3f6 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -17,6 +17,7 @@
>>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>>   */
>>  #include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>>  #include "hw/hw.h"
>>  #include "sysemu/kvm.h"
>>  #include "hw/qdev.h"
>> @@ -174,15 +175,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
>>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>>  
>>      tcet->fd = -1;
>> -    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>> -                                        tcet->page_shift,
>> -                                        tcet->nb_table,
>> -                                        &tcet->fd,
>> -                                        tcet->need_vfio);
>> -
>> +    tcet->need_vfio = false;
>>      memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
>> -                             "iommu-spapr",
>> -                             (uint64_t)tcet->nb_table << tcet->page_shift);
>> +                             "iommu-spapr", 0);
>>  
>>      QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
>>  
>> @@ -224,14 +219,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>>      tcet->table = newtable;
>>  }
>>  
>> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>> -                                   uint64_t bus_offset,
>> -                                   uint32_t page_shift,
>> -                                   uint32_t nb_table,
>> -                                   bool need_vfio)
>> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>>  {
>>      sPAPRTCETable *tcet;
>> -    char tmp[64];
>> +    char tmp[32];
>>  
>>      if (spapr_tce_find_by_liobn(liobn)) {
>>          fprintf(stderr, "Attempted to create TCE table with duplicate"
>> @@ -239,16 +230,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>>          return NULL;
>>      }
>>  
>> -    if (!nb_table) {
>> -        return NULL;
>> -    }
>> -
>>      tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>>      tcet->liobn = liobn;
>> -    tcet->bus_offset = bus_offset;
>> -    tcet->page_shift = page_shift;
>> -    tcet->nb_table = nb_table;
>> -    tcet->need_vfio = need_vfio;
>>  
>>      snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>>      object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
>> @@ -258,14 +241,69 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>>      return tcet;
>>  }
>>  
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
>> +{
>> +    if (!tcet->nb_table) {
>> +        return;
>> +    }
>> +
>> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>> +                                        tcet->page_shift,
>> +                                        tcet->nb_table,
>> +                                        &tcet->fd,
>> +                                        tcet->need_vfio);
>> +
>> +    memory_region_set_size(&tcet->iommu,
>> +                           (uint64_t)tcet->nb_table << tcet->page_shift);
>> +
>> +    tcet->enabled = true;
>> +}
>> +
>> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
>> +                            uint32_t page_shift, uint64_t bus_offset,
>> +                            uint32_t nb_table)
>> +{
>> +    if (tcet->enabled) {
>> +        error_report("Warning: trying to enable already enabled TCE table");
>> +        return;
>> +    }
>> +
>> +    tcet->bus_offset = bus_offset;
>> +    tcet->page_shift = page_shift;
>> +    tcet->nb_table = nb_table;
>> +
>> +    spapr_tce_table_do_enable(tcet);
>> +}
>> +
>> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
>> +{
>> +    memory_region_set_size(&tcet->iommu, 0);
>> +
>> +    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
>> +    tcet->fd = -1;
>> +    tcet->table = NULL;
>> +    tcet->enabled = false;
>> +    tcet->bus_offset = 0;
>> +    tcet->page_shift = 0;
>> +    tcet->nb_table = 0;
>> +}
>> +
>> +static void spapr_tce_table_disable(sPAPRTCETable *tcet)
>> +{
>> +    if (!tcet->enabled) {
>> +        error_report("Warning: trying to disable already disabled TCE table");
>> +        return;
>> +    }
>> +    spapr_tce_table_do_disable(tcet);
>> +}
>> +
>>  static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>>  {
>>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>>  
>>      QLIST_REMOVE(tcet, list);
>>  
>> -    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
>> -    tcet->fd = -1;
>> +    spapr_tce_table_disable(tcet);
> 
> This should probably be do_disable(), or you'll get a spurious error
> if you start and stop a VM but don't enable the table in between, or
> if the guest disables all the tables before the shutdown.


Well, I'll change this as you say (seems more correct) but since
unrealize() can only be called on PHB hotunplug and we do not support this,
this code won't execute.



-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 10/19] spapr_iommu: Migrate full state
  2016-05-26  4:01   ` David Gibson
@ 2016-05-31  8:19     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-05-31  8:19 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3911 bytes --]

On 26/05/16 14:01, David Gibson wrote:
> On Wed, May 04, 2016 at 04:52:22PM +1000, Alexey Kardashevskiy wrote:
>> The source guest could have reallocated the default TCE table and
>> migrate bigger/smaller table. This adds reallocation in post_load()
>> if the default table size is different on source and destination.
>>
>> This adds @bus_offset, @page_shift, @enabled to the migration stream.
>> These cannot change without dynamic DMA windows so no change in
>> behavior is expected now.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> David Gibson <david@gibson.dropbear.id.au>
>> ---
>> Changes:
>> v15:
>> * squashed "migrate full state" into this
>> * added missing tcet->mig_nb_table initialization in spapr_tce_table_pre_save()
>> * instead of bumping the version, moved extra parameters to subsection
>>
>> v14:
>> * new to the series
>> ---
>>  hw/ppc/spapr_iommu.c   | 67 ++++++++++++++++++++++++++++++++++++++++++++++++--
>>  include/hw/ppc/spapr.h |  2 ++
>>  trace-events           |  2 ++
>>  3 files changed, 69 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 9bcd3f6..52b1e0d 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -137,33 +137,96 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>>      return ret;
>>  }
>>  
>> +static void spapr_tce_table_pre_save(void *opaque)
>> +{
>> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +
>> +    tcet->mig_table = tcet->table;
>> +    tcet->mig_nb_table = tcet->nb_table;
>> +
>> +    trace_spapr_iommu_pre_save(tcet->liobn, tcet->mig_nb_table,
>> +                               tcet->bus_offset, tcet->page_shift);
>> +}
>> +
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>> +
>>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>>  {
>>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +    uint32_t old_nb_table = tcet->nb_table;
>>  
>>      if (tcet->vdev) {
>>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>>      }
>>  
>> +    if (tcet->enabled) {
>> +        if (tcet->nb_table != tcet->mig_nb_table) {
>> +            if (tcet->nb_table) {
>> +                spapr_tce_table_do_disable(tcet);
>> +            }
>> +            tcet->nb_table = tcet->mig_nb_table;
>> +            spapr_tce_table_do_enable(tcet);
>> +        }
>> +
>> +        memcpy(tcet->table, tcet->mig_table,
>> +               tcet->nb_table * sizeof(tcet->table[0]));
>> +
>> +        free(tcet->mig_table);
>> +        tcet->mig_table = NULL;
>> +    } else if (tcet->table) {
>> +        /* Destination guest has a default table but source does not -> free */
>> +        spapr_tce_table_do_disable(tcet);
>> +    }
>> +
>> +    trace_spapr_iommu_post_load(tcet->liobn, old_nb_table, tcet->nb_table,
>> +                                tcet->bus_offset, tcet->page_shift);
>> +
>>      return 0;
>>  }
>>  
>> +static bool spapr_tce_table_ex_needed(void *opaque)
>> +{
>> +    sPAPRTCETable *tcet = opaque;
>> +
>> +    return tcet->bus_offset || tcet->page_shift != 0xC;
> 
> 	|| !tcet->enabled ??
> 
> AFAICT you're assuming that the existing tcet on the destination will
> be enabled prior to an incoming migration.
> 
>> +}
>> +
>> +static const VMStateDescription vmstate_spapr_tce_table_ex = {
>> +    .name = "spapr_iommu_ex",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .needed = spapr_tce_table_ex_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_BOOL(enabled, sPAPRTCETable),
> 
> ..or could you encode enabled as !!mig_nb_table?

Sure.

After a closer look, I can get rid of "enabled" field at all and use
nb_table instead.



-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC
  2016-05-27  7:54   ` Alexey Kardashevskiy
@ 2016-06-01  2:29     ` Alexey Kardashevskiy
  2016-06-01  8:11       ` Paolo Bonzini
  0 siblings, 1 reply; 69+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  2:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-ppc, Alexander Graf, David Gibson, Alex Williamson, Paolo Bonzini

On 27/05/16 17:54, Alexey Kardashevskiy wrote:
> On 04/05/16 16:52, Alexey Kardashevskiy wrote:
>> This allows dynamic allocation for migrating arrays.
>>
>> Already existing VMSTATE_VARRAY_UINT32 requires an array to be
>> pre-allocated, however there are cases when the size is not known in
>> advance and there is no real need to enforce it.
>>
>> This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
>> flag which tells the receiving side to allocate memory for the array
>> before receiving the data.
>>
>> The first user of it is a dynamic DMA window which existence and size
>> are totally dynamic.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> Reviewed-by: Thomas Huth <thuth@redhat.com>
> 
> 
> In what tree is this going to go? pseries? Or migration?


Anyone?


> 
> 
> 
>> ---
>>  include/migration/vmstate.h | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index 84ee355..1622638 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
>>      .offset     = vmstate_offset_pointer(_state, _field, _type),     \
>>  }
>>  
>> +#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
>> +    .name       = (stringify(_field)),                               \
>> +    .version_id = (_version),                                        \
>> +    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
>> +    .info       = &(_info),                                          \
>> +    .size       = sizeof(_type),                                     \
>> +    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
>> +    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
>> +}
>> +
>>  #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
>>      .name       = (stringify(_field)),                               \
>>      .version_id = (_version),                                        \
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-05-27  5:49           ` Bharata B Rao
@ 2016-06-01  3:32             ` Bharata B Rao
  0 siblings, 0 replies; 69+ messages in thread
From: Bharata B Rao @ 2016-06-01  3:32 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, qemu-devel, Alexander Graf,
	Alex Williamson, qemu-ppc, Paolo Bonzini

On Fri, May 27, 2016 at 11:19 AM, Bharata B Rao <bharata.rao@gmail.com> wrote:
> On Fri, May 27, 2016 at 10:14 AM, David Gibson
> <david@gibson.dropbear.id.au> wrote:
>> On Tue, May 17, 2016 at 11:02:48AM +0530, Bharata B Rao wrote:
>>> On Mon, May 16, 2016 at 11:55 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>> > On 05/13/2016 06:41 PM, Bharata B Rao wrote:
>>> >>
>>> >> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru>
>>> >> wrote:
>>> >
>>> >
>>> >>
>>> >>> +
>>> >>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS -
>>> >>> spapr_phb_get_active_win_num(sphb);
>>> >>> +
>>> >>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>> >>> +    rtas_st(rets, 1, avail);
>>> >>> +    rtas_st(rets, 2, max_window_size);
>>> >>> +    rtas_st(rets, 3, pgmask);
>>> >>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>>> >>> +
>>> >>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size,
>>> >>> pgmask);
>>> >>> +    return;
>>> >>> +
>>> >>> +param_error_exit:
>>> >>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>> >>> +}
>>> >>> +
>>> >>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>> >>> +                                          sPAPRMachineState *spapr,
>>> >>> +                                          uint32_t token, uint32_t
>>> >>> nargs,
>>> >>> +                                          target_ulong args,
>>> >>> +                                          uint32_t nret, target_ulong
>>> >>> rets)
>>> >>> +{
>>> >>> +    sPAPRPHBState *sphb;
>>> >>> +    sPAPRTCETable *tcet = NULL;
>>> >>> +    uint32_t addr, page_shift, window_shift, liobn;
>>> >>> +    uint64_t buid;
>>> >>> +
>>> >>> +    if ((nargs != 5) || (nret != 4)) {
>>> >>> +        goto param_error_exit;
>>> >>> +    }
>>> >>> +
>>> >>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>> >>> +    addr = rtas_ld(args, 0);
>>> >>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>> >>> +    if (!sphb || !sphb->ddw_enabled) {
>>> >>> +        goto param_error_exit;
>>> >>> +    }
>>> >>> +
>>> >>> +    page_shift = rtas_ld(args, 3);
>>> >>> +    window_shift = rtas_ld(args, 4);
>>> >>
>>> >>
>>> >> Kernel has a bug due to which wrong window_shift gets returned here. I
>>> >> have posted possible fix here:
>>> >> https://patchwork.ozlabs.org/patch/621497/
>>> >>
>>> >> I have tried to work around this issue in QEMU too
>>> >> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
>>> >>
>>> >> But the above work around involves changing the memory representation
>>> >> in DT.
>>> >
>>> >
>>> > What is wrong with this workaround?
>>>
>>> The above workaround will result in different representations for
>>> memory in DT before and after the workaround.
>>>
>>> Currently for -m 2G, -numa node,nodeid=0,mem=1G -numa
>>> node,nodeid=1,mem=0.5G, we will have the following nodes in DT:
>>>
>>> memory@0
>>> memory@40000000
>>> ibm,dynamic-reconfiguration-memory
>>>
>>> ibm,dynamic-memory will have only DR LMBs:
>>>
>>> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
>>> 0000000 0000 000a 0000 0000 8000 0000 8000 0008
>>> 0000010 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 0000020 9000 0000 8000 0009 0000 0000 ffff ffff
>>> 0000030 0000 0000 0000 0000 a000 0000 8000 000a
>>> 0000040 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 0000050 b000 0000 8000 000b 0000 0000 ffff ffff
>>> 0000060 0000 0000 0000 0000 c000 0000 8000 000c
>>> 0000070 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 0000080 d000 0000 8000 000d 0000 0000 ffff ffff
>>> 0000090 0000 0000 0000 0000 e000 0000 8000 000e
>>> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 00000b0 f000 0000 8000 000f 0000 0000 ffff ffff
>>> 00000c0 0000 0000 0000 0001 0000 0000 8000 0010
>>> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0001
>>> 00000e0 1000 0000 8000 0011 0000 0000 ffff ffff
>>> 00000f0 0000 0000
>>>
>>> The memory region looks like this:
>>>
>>> memory-region: system
>>>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>>>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>>>     0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory
>>>
>>> After this workaround, all this will change like below:
>>>
>>> memory@0
>>> ibm,dynamic-reconfiguration-memory
>>>
>>> All LMBs in ibm,dynamic-memory:
>>>
>>> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
>>>
>>> 0000000 0000 0010 0000 0000 0000 0000 8000 0000
>>> 0000010 0000 0000 0000 0000 0000 0080 0000 0000
>>> 0000020 1000 0000 8000 0001 0000 0000 0000 0000
>>> 0000030 0000 0080 0000 0000 2000 0000 8000 0002
>>> 0000040 0000 0000 0000 0000 0000 0080 0000 0000
>>> 0000050 3000 0000 8000 0003 0000 0000 0000 0000
>>> 0000060 0000 0080 0000 0000 4000 0000 8000 0004
>>> 0000070 0000 0000 0000 0001 0000 0008 0000 0000
>>> 0000080 5000 0000 8000 0005 0000 0000 0000 0001
>>> 0000090 0000 0008 0000 0000 6000 0000 8000 0006
>>> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 00000b0 7000 0000 8000 0007 0000 0000 ffff ffff
>>> 00000c0 0000 0000 0000 0000 8000 0000 8000 0008
>>> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 00000e0 9000 0000 8000 0009 0000 0000 ffff ffff
>>> 00000f0 0000 0000 0000 0000 a000 0000 8000 000a
>>> 0000100 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 0000110 b000 0000 8000 000b 0000 0000 ffff ffff
>>> 0000120 0000 0000 0000 0000 c000 0000 8000 000c
>>> 0000130 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 0000140 d000 0000 8000 000d 0000 0000 ffff ffff
>>> 0000150 0000 0000 0000 0000 e000 0000 8000 000e
>>> 0000160 0000 0000 ffff ffff 0000 0000 0000 0000
>>> 0000170 f000 0000 8000 000f 0000 0000 ffff ffff
>>> 0000180 0000 0000
>>>
>>> Hotplug memory region gets a new address range now:
>>>
>>> memory-region: system
>>>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>>>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>>>     0000000060000000-00000000ffffffff (prio 0, RW): hotplug-memory
>>>
>>>
>>> So when a guest that was booted with older QEMU is migrated to a newer
>>> QEMU that has this workaround, then it will start exhibiting the above
>>> changes after first reboot post migration.
>>
>> Ok.. why is that bad?
>>
>>> If user has done memory hotplug by explicitly specifying address in
>>> the source, then even migration would fail because the addr specified
>>> at the target will not be part of hotplug-memory range.
>>
>> Sorry, not really following the situation you're describing here.
>
> If the original case where the hotplug region was this:
> 0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory
>
> one could hoplug a DIMM at a specified address like this:
>
> (qemu) object_add memory-backend-ram,id=ram0,size=256M
> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,addr=0x100000000
> (qemu) info mtree
> 0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory
>       0000000100000000-000000010fffffff (prio 0, RW): ram0
>
> Now if this guest has to be migrated to a target where we have this
> workaround enabled, then the target QEMU started with
>
> -incoming ... -object memory-backend-ram,id=ram0,size=256M -device
> pc-dimm,id=dimm0,memdev=ram0,addr=0x100000000
>
> will fail because addr=0x100000000 isn't part of the hotplug-memory
> region at the target.

And I verified that libvirt indeed always updates XML with slot and
addr explicitly for the DIMM device and the same is used at the target
during migration even when user hasn't explicitly specified slot or
addr when hotplugging memory DIMM. So when addr is being used
explicitly like this, any change in hotplug memory region layout will
break migration.

So is that a good enough reason to put the work around in the DDW code itself ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC
  2016-06-01  2:29     ` Alexey Kardashevskiy
@ 2016-06-01  8:11       ` Paolo Bonzini
  2016-06-02  0:43         ` David Gibson
  0 siblings, 1 reply; 69+ messages in thread
From: Paolo Bonzini @ 2016-06-01  8:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel
  Cc: qemu-ppc, Alexander Graf, David Gibson, Alex Williamson



On 01/06/2016 04:29, Alexey Kardashevskiy wrote:
> On 27/05/16 17:54, Alexey Kardashevskiy wrote:
>> On 04/05/16 16:52, Alexey Kardashevskiy wrote:
>>> This allows dynamic allocation for migrating arrays.
>>>
>>> Already existing VMSTATE_VARRAY_UINT32 requires an array to be
>>> pre-allocated, however there are cases when the size is not known in
>>> advance and there is no real need to enforce it.
>>>
>>> This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
>>> flag which tells the receiving side to allocate memory for the array
>>> before receiving the data.
>>>
>>> The first user of it is a dynamic DMA window which existence and size
>>> are totally dynamic.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>> Reviewed-by: Thomas Huth <thuth@redhat.com>
>>
>>
>> In what tree is this going to go? pseries? Or migration?
> 
> Anyone?

Go ahead, include it.

Paolo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC
  2016-06-01  8:11       ` Paolo Bonzini
@ 2016-06-02  0:43         ` David Gibson
  0 siblings, 0 replies; 69+ messages in thread
From: David Gibson @ 2016-06-02  0:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1432 bytes --]

On Wed, Jun 01, 2016 at 10:11:05AM +0200, Paolo Bonzini wrote:
> 
> 
> On 01/06/2016 04:29, Alexey Kardashevskiy wrote:
> > On 27/05/16 17:54, Alexey Kardashevskiy wrote:
> >> On 04/05/16 16:52, Alexey Kardashevskiy wrote:
> >>> This allows dynamic allocation for migrating arrays.
> >>>
> >>> Already existing VMSTATE_VARRAY_UINT32 requires an array to be
> >>> pre-allocated, however there are cases when the size is not known in
> >>> advance and there is no real need to enforce it.
> >>>
> >>> This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
> >>> flag which tells the receiving side to allocate memory for the array
> >>> before receiving the data.
> >>>
> >>> The first user of it is a dynamic DMA window which existence and size
> >>> are totally dynamic.
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>> Reviewed-by: Thomas Huth <thuth@redhat.com>
> >>
> >>
> >> In what tree is this going to go? pseries? Or migration?
> > 
> > Anyone?
> 
> Go ahead, include it.

I'm guessing that's an invitation to merge it via my tree, since
Alexey doesn't send direct pull requests.  I've now done so.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2016-06-02  0:46 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-04  6:52 [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 01/19] vfio: Delay DMA address space listener release Alexey Kardashevskiy
2016-05-05 22:39   ` Alex Williamson
2016-05-13  7:16     ` Alexey Kardashevskiy
2016-05-13 22:24       ` Alex Williamson
2016-05-25  6:34         ` David Gibson
2016-05-25 13:59           ` Alex Williamson
2016-05-26  1:00             ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 02/19] memory: Call region_del() callbacks on memory listener unregistering Alexey Kardashevskiy
2016-05-05 22:45   ` Alex Williamson
2016-05-26  1:48     ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 03/19] memory: Fix IOMMU replay base address Alexey Kardashevskiy
2016-05-26  1:50   ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 04/19] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2016-05-27  7:54   ` Alexey Kardashevskiy
2016-06-01  2:29     ` Alexey Kardashevskiy
2016-06-01  8:11       ` Paolo Bonzini
2016-06-02  0:43         ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 05/19] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
2016-05-26  1:51   ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 06/19] spapr_pci: Use correct DMA LIOBN when composing the device tree Alexey Kardashevskiy
2016-05-26  3:17   ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 07/19] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2016-05-26  3:32   ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 08/19] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2016-05-26  3:39   ` David Gibson
2016-05-27  8:01     ` Alexey Kardashevskiy
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 09/19] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
2016-05-26  3:18   ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 10/19] spapr_iommu: Migrate full state Alexey Kardashevskiy
2016-05-26  4:01   ` David Gibson
2016-05-31  8:19     ` Alexey Kardashevskiy
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 11/19] spapr_iommu: Add root memory region Alexey Kardashevskiy
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 12/19] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 13/19] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 14/19] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
2016-05-13 22:25   ` Alex Williamson
2016-05-16  1:10     ` Alexey Kardashevskiy
2016-05-16 20:20       ` Alex Williamson
2016-05-26  4:53         ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 15/19] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 16/19] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
2016-05-13 22:25   ` Alex Williamson
2016-05-27  0:36     ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 17/19] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
2016-05-13 22:26   ` Alex Williamson
2016-05-16  8:35     ` Alexey Kardashevskiy
2016-05-16 20:13       ` Alex Williamson
2016-05-20  8:04         ` [Qemu-devel] [RFC PATCH qemu] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening Alexey Kardashevskiy
2016-05-20 15:19           ` Alex Williamson
2016-05-27  0:43           ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 18/19] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
2016-05-13 22:26   ` Alex Williamson
2016-05-16  4:52     ` Alexey Kardashevskiy
2016-05-16 20:20       ` Alex Williamson
2016-05-27  0:50         ` David Gibson
2016-05-27  3:49         ` Alexey Kardashevskiy
2016-05-27  4:05           ` David Gibson
2016-05-04  6:52 ` [Qemu-devel] [PATCH qemu v16 19/19] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-05-13  8:41   ` Bharata B Rao
2016-05-13  8:49     ` Bharata B Rao
2016-05-16  6:25     ` Alexey Kardashevskiy
2016-05-17  5:32       ` Bharata B Rao
2016-05-27  4:44         ` David Gibson
2016-05-27  5:49           ` Bharata B Rao
2016-06-01  3:32             ` Bharata B Rao
2016-05-27  4:42     ` David Gibson
2016-05-13  4:54 ` [Qemu-devel] [PATCH qemu v16 00/19] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2016-05-13  5:36   ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.