All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW)
@ 2016-04-04  9:33 Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 01/17] memory: Fix IOMMU replay base address Alexey Kardashevskiy
                   ` (16 more replies)
  0 siblings, 17 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows on pseries.

This patchset is based on git://github.com/dgibson/qemu.git ppc-for-2.7 branch
and was pushed to git@github.com:aik/qemu.git vfio-v15 branch along with
a few patches on top (automatic support of huge pages and in-kernel
acceleration to be reworked).

This implements comments from v14.

This includes "vmstate: Define VARRAY with VMS_ALLOC" as the patchset needs
it and it has been posted separately but has not been neither accepted
nor rejected so far.

Please comment. Thanks!


Alexey Kardashevskiy (17):
  memory: Fix IOMMU replay base address
  vmstate: Define VARRAY with VMS_ALLOC
  vfio: Check that IOMMU MR translates to system address space
  spapr_iommu: Move table allocation to helpers
  spapr_iommu: Introduce "enabled" state for TCE table
  spapr_iommu: Finish renaming vfio_accel to need_vfio
  spapr_iommu: Migrate full state
  spapr_iommu: Add root memory region
  spapr_pci: Reset DMA config on PHB reset
  memory: Add reporting of supported page sizes
  vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  spapr_pci: Add and export DMA resetting helper
  vfio: Add host side DMA window capabilities
  spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being
    used by VFIO
  spapr_pci: Get rid of dma_loibn
  vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

 hw/ppc/Makefile.objs          |   1 +
 hw/ppc/spapr.c                |   7 +-
 hw/ppc/spapr_iommu.c          | 242 +++++++++++++++++++++++++++-------
 hw/ppc/spapr_pci.c            |  92 +++++++++----
 hw/ppc/spapr_rtas_ddw.c       | 292 ++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_vio.c            |   8 +-
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              | 243 +++++++++++++++++++++++++++++------
 hw/vfio/prereg.c              | 138 ++++++++++++++++++++
 include/exec/memory.h         |  22 +++-
 include/hw/pci-host/spapr.h   |  10 +-
 include/hw/ppc/spapr.h        |  33 +++--
 include/hw/vfio/vfio-common.h |  14 +-
 include/migration/vmstate.h   |  10 ++
 memory.c                      |  17 ++-
 target-ppc/kvm_ppc.h          |   2 +-
 trace-events                  |  12 +-
 17 files changed, 1001 insertions(+), 143 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c
 create mode 100644 hw/vfio/prereg.c

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 01/17] memory: Fix IOMMU replay base address
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-05  1:34   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 02/17] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
when new VFIO listener is added, all existing IOMMU mappings are
replayed. However there is a problem that the base address of
an IOMMU memory region (IOMMU MR) is ignored which is not a problem
for the existing user (which is pseries) with its default 32bit DMA
window starting at 0 but it is if there is another DMA window.

This stores the IOMMU's offset_within_address_space and adjusts
the IOVA before calling vfio_dma_map/vfio_dma_unmap.

As the IOMMU notifier expects IOVA offset rather than the absolute
address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
calling notifier(s).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* accounted section->offset_within_region
* s/giommu->offset_within_address_space/giommu->iommu_offset/
---
 hw/ppc/spapr_iommu.c          |  2 +-
 hw/vfio/common.c              | 14 ++++++++------
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 7dd4588..277f289 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
     tcet->table[index] = tce;
 
     entry.target_as = &address_space_memory,
-    entry.iova = ioba & page_mask;
+    entry.iova = (ioba - tcet->bus_offset) & page_mask;
     entry.translated_addr = tce & page_mask;
     entry.addr_mask = ~page_mask;
     entry.perm = spapr_tce_iommu_access_flags(tce);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb588d8..27753d8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
     IOMMUTLBEntry *iotlb = data;
+    hwaddr iova = iotlb->iova + giommu->iommu_offset;
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
     void *vaddr;
     int ret;
 
-    trace_vfio_iommu_map_notify(iotlb->iova,
-                                iotlb->iova + iotlb->addr_mask);
+    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
 
     /*
      * The IOMMU TLB entry we have just covers translation through
@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         vaddr = memory_region_get_ram_ptr(mr) + xlat;
-        ret = vfio_dma_map(container, iotlb->iova,
+        ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
                            !(iotlb->perm & IOMMU_WO) || mr->readonly);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, ret);
         }
     }
@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
          */
         giommu = g_malloc0(sizeof(*giommu));
         giommu->iommu = section->mr;
+        giommu->iommu_offset = section->offset_within_address_space -
+            section->offset_within_region;
         giommu->container = container;
         giommu->n.notify = vfio_iommu_map_notify;
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eb0e1b0..c9b6622 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -90,6 +90,7 @@ typedef struct VFIOContainer {
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
     MemoryRegion *iommu;
+    hwaddr iommu_offset;
     Notifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 02/17] vmstate: Define VARRAY with VMS_ALLOC
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 01/17] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 03/17] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This allows dynamic allocation for migrating arrays.

Already existing VMSTATE_VARRAY_UINT32 requires an array to be
pre-allocated, however there are cases when the size is not known in
advance and there is no real need to enforce it.

This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
flag which tells the receiving side to allocate memory for the array
before receiving the data.

The first user of it is a dynamic DMA window which existence and size
are totally dynamic.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 include/migration/vmstate.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 84ee355..1622638 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
     .offset     = vmstate_offset_pointer(_state, _field, _type),     \
 }
 
+#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+    .info       = &(_info),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
+    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
     .name       = (stringify(_field)),                               \
     .version_id = (_version),                                        \
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 03/17] vfio: Check that IOMMU MR translates to system address space
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 01/17] memory: Fix IOMMU replay base address Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 02/17] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 04/17] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

At the moment IOMMU MR only translate to the system memory.
However if some new code changes this, we will need clear indication why
it is not working so here is the check.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v15:
* added some spaces

v14:
* new to the series
---
 hw/vfio/common.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 27753d8..23dd738 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 
     trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
 
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+        return;
+    }
+
     /*
      * The IOMMU TLB entry we have just covers translation through
      * this IOMMU to its immediate target.  We need to translate
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 04/17] spapr_iommu: Move table allocation to helpers
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 03/17] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 05/17] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

At the moment presence of vfio-pci devices on a bus affect the way
the guest view table is allocated. If there is no vfio-pci on a PHB
and the host kernel supports KVM acceleration of H_PUT_TCE, a table
is allocated in KVM. However, if there is vfio-pci and we do yet not
KVM acceleration for these, the table has to be allocated by
the userspace. At the moment the table is allocated once at boot time
but next patches will reallocate it.

This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
to helpers.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
 trace-events         |  2 +-
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 277f289..8132f64 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -75,6 +75,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
     }
 }
 
+static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
+                                       uint32_t page_shift,
+                                       uint32_t nb_table,
+                                       int *fd,
+                                       bool need_vfio)
+{
+    uint64_t *table = NULL;
+    uint64_t window_size = (uint64_t)nb_table << page_shift;
+
+    if (kvm_enabled() && !(window_size >> 32)) {
+        table = kvmppc_create_spapr_tce(liobn, window_size, fd, need_vfio);
+    }
+
+    if (!table) {
+        *fd = -1;
+        table = g_malloc0(nb_table * sizeof(uint64_t));
+    }
+
+    trace_spapr_iommu_new_table(liobn, table, *fd);
+
+    return table;
+}
+
+static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
+{
+    if (!kvm_enabled() ||
+        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
+        g_free(table);
+    }
+}
+
 /* Called from RCU critical section */
 static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
                                                bool is_write)
@@ -141,21 +172,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
-    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
 
-    if (kvm_enabled() && !(window_size >> 32)) {
-        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
-                                              window_size,
-                                              &tcet->fd,
-                                              tcet->need_vfio);
-    }
-
-    if (!tcet->table) {
-        size_t table_size = tcet->nb_table * sizeof(uint64_t);
-        tcet->table = g_malloc0(table_size);
-    }
-
-    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
+    tcet->fd = -1;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
 
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
                              "iommu-spapr",
@@ -241,11 +264,8 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
     QLIST_REMOVE(tcet, list);
 
-    if (!kvm_enabled() ||
-        (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
-                                 tcet->nb_table) != 0)) {
-        g_free(tcet->table);
-    }
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/trace-events b/trace-events
index 0ad8a1c..62dcbba 100644
--- a/trace-events
+++ b/trace-events
@@ -1430,7 +1430,7 @@ spapr_iommu_pci_get(uint64_t liobn, uint64_t ioba, uint64_t ret, uint64_t tce) "
 spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN, uint64_t tceN, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcelist=0x%"PRIx64" iobaN=0x%"PRIx64" tceN=0x%"PRIx64" ret=%"PRId64
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
-spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
+spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 05/17] spapr_iommu: Introduce "enabled" state for TCE table
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 04/17] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 06/17] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

Currently TCE tables are created once at start and their sizes never
change. We are going to change that by introducing a Dynamic DMA windows
support where DMA configuration may change during the guest execution.

This changes spapr_tce_new_table() to create an empty zero-size IOMMU
memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
It still will be called once at the owner object (VIO or PHB) creation.

This introduces an "enabled" state for TCE table objects with two
helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
- spapr_tce_table_enable() receives TCE table parameters, allocates
a guest view of the TCE table (in the user space or KVM) and
sets the correct size on the IOMMU MR.
- spapr_tce_table_disable() disposes the table and resets the IOMMU MR
size.

This changes the PHB reset handler to do the default DMA initialization
instead of spapr_phb_realize(). This does not make differenct now but
later with more than just one DMA window, we will have to remove them all
and create the default one on a system reset.

No visible change in behaviour is expected except the actual table
will be reallocated every reset. We might optimize this later.

The other way to implement this would be dynamically create/remove
the TCE table QOM objects but this would make migration impossible
as the migration code expects all QOM objects to exist at the receiver
so we have to have TCE table objects created when migration begins.

spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
as later it will be called at the sPAPRTCETable post-migration stage when
it already has all the properties set after the migration; the same is
done for spapr_tce_table_disable().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v15:
* made adjustments after removing spapr_phb_dma_window_enable()

v14:
* added spapr_tce_table_do_disable(), will make difference in following
patch with fully dynamic table migration

# Conflicts:
#	hw/ppc/spapr_pci.c
---
 hw/ppc/spapr_iommu.c   | 86 ++++++++++++++++++++++++++++++++++++--------------
 hw/ppc/spapr_pci.c     |  8 +++--
 hw/ppc/spapr_vio.c     |  8 ++---
 include/hw/ppc/spapr.h | 10 +++---
 4 files changed, 75 insertions(+), 37 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 8132f64..9bcd3f6 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "hw/hw.h"
 #include "sysemu/kvm.h"
 #include "hw/qdev.h"
@@ -174,15 +175,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     tcet->fd = -1;
-    tcet->table = spapr_tce_alloc_table(tcet->liobn,
-                                        tcet->page_shift,
-                                        tcet->nb_table,
-                                        &tcet->fd,
-                                        tcet->need_vfio);
-
+    tcet->need_vfio = false;
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr",
-                             (uint64_t)tcet->nb_table << tcet->page_shift);
+                             "iommu-spapr", 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -224,14 +219,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
     tcet->table = newtable;
 }
 
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio)
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
 {
     sPAPRTCETable *tcet;
-    char tmp[64];
+    char tmp[32];
 
     if (spapr_tce_find_by_liobn(liobn)) {
         fprintf(stderr, "Attempted to create TCE table with duplicate"
@@ -239,16 +230,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
         return NULL;
     }
 
-    if (!nb_table) {
-        return NULL;
-    }
-
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
     tcet->liobn = liobn;
-    tcet->bus_offset = bus_offset;
-    tcet->page_shift = page_shift;
-    tcet->nb_table = nb_table;
-    tcet->need_vfio = need_vfio;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
@@ -258,14 +241,69 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
     return tcet;
 }
 
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+{
+    if (!tcet->nb_table) {
+        return;
+    }
+
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
+
+    memory_region_set_size(&tcet->iommu,
+                           (uint64_t)tcet->nb_table << tcet->page_shift);
+
+    tcet->enabled = true;
+}
+
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table)
+{
+    if (tcet->enabled) {
+        error_report("Warning: trying to enable already enabled TCE table");
+        return;
+    }
+
+    tcet->bus_offset = bus_offset;
+    tcet->page_shift = page_shift;
+    tcet->nb_table = nb_table;
+
+    spapr_tce_table_do_enable(tcet);
+}
+
+static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
+{
+    memory_region_set_size(&tcet->iommu, 0);
+
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
+    tcet->table = NULL;
+    tcet->enabled = false;
+    tcet->bus_offset = 0;
+    tcet->page_shift = 0;
+    tcet->nb_table = 0;
+}
+
+static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+{
+    if (!tcet->enabled) {
+        error_report("Warning: trying to disable already disabled TCE table");
+        return;
+    }
+    spapr_tce_table_do_disable(tcet);
+}
+
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     QLIST_REMOVE(tcet, list);
 
-    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
-    tcet->fd = -1;
+    spapr_tce_table_disable(tcet);
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 79baa7b..46f205b 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1461,8 +1461,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
+    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
                    sphb->dtbusname);
@@ -1470,7 +1469,10 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
+    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
+                           nb_table);
+
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index 0f61a55..7f57290 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -481,11 +481,9 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
         memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
         address_space_init(&dev->as, &dev->mrroot, qdev->id);
 
-        dev->tcet = spapr_tce_new_table(qdev, liobn,
-                                        0,
-                                        SPAPR_TCE_PAGE_SHIFT,
-                                        pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT, false);
+        dev->tcet = spapr_tce_new_table(qdev, liobn);
+        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
+                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT);
         dev->tcet->vdev = dev;
         memory_region_add_subregion_overlap(&dev->mrroot, 0,
                                             spapr_tce_get_iommu(dev->tcet), 2);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 098d85d..75b0b55 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -539,6 +539,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 
 struct sPAPRTCETable {
     DeviceState parent;
+    bool enabled;
     uint32_t liobn;
     uint32_t nb_table;
     uint64_t bus_offset;
@@ -566,11 +567,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(sPAPRMachineState *sm,
                                  target_ulong addr, target_ulong size,
                                  bool cpu_update, bool memory_update);
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio);
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 06/17] spapr_iommu: Finish renaming vfio_accel to need_vfio
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 05/17] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 07/17] spapr_iommu: Migrate full state Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

6a81dd17 "spapr_iommu: Rename vfio_accel parameter" renamed vfio_accel
flag everywhere but one spot was missed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 target-ppc/kvm_ppc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
index fc79312..3b2090e 100644
--- a/target-ppc/kvm_ppc.h
+++ b/target-ppc/kvm_ppc.h
@@ -163,7 +163,7 @@ static inline bool kvmppc_spapr_use_multitce(void)
 
 static inline void *kvmppc_create_spapr_tce(uint32_t liobn,
                                             uint32_t window_size, int *fd,
-                                            bool vfio_accel)
+                                            bool need_vfio)
 {
     return NULL;
 }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 07/17] spapr_iommu: Migrate full state
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 06/17] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-05  5:58   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 08/17] spapr_iommu: Add root memory region Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

The source guest could have reallocated the default TCE table and
migrate bigger/smaller table. This adds reallocation in post_load()
if the default table size is different on source and destination.

This adds @bus_offset, @page_shift, @enabled to the migration stream.
These cannot change without dynamic DMA windows so no change in
behavior is expected now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* squashed "migrate full state" into this
* added missing tcet->mig_nb_table initialization in spapr_tce_table_pre_save()
* instead of bumping the version, moved extra parameters to subsection

v14:
* new to the series
---
 hw/ppc/spapr_iommu.c   | 67 ++++++++++++++++++++++++++++++++++++++++++++++++--
 include/hw/ppc/spapr.h |  2 ++
 trace-events           |  2 ++
 3 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 9bcd3f6..52b1e0d 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -137,33 +137,96 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->mig_table = tcet->table;
+    tcet->mig_nb_table = tcet->nb_table;
+
+    trace_spapr_iommu_pre_save(tcet->liobn, tcet->mig_nb_table,
+                               tcet->bus_offset, tcet->page_shift);
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
+static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+    uint32_t old_nb_table = tcet->nb_table;
 
     if (tcet->vdev) {
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (tcet->enabled) {
+        if (tcet->nb_table != tcet->mig_nb_table) {
+            if (tcet->nb_table) {
+                spapr_tce_table_do_disable(tcet);
+            }
+            tcet->nb_table = tcet->mig_nb_table;
+            spapr_tce_table_do_enable(tcet);
+        }
+
+        memcpy(tcet->table, tcet->mig_table,
+               tcet->nb_table * sizeof(tcet->table[0]));
+
+        free(tcet->mig_table);
+        tcet->mig_table = NULL;
+    } else if (tcet->table) {
+        /* Destination guest has a default table but source does not -> free */
+        spapr_tce_table_do_disable(tcet);
+    }
+
+    trace_spapr_iommu_post_load(tcet->liobn, old_nb_table, tcet->nb_table,
+                                tcet->bus_offset, tcet->page_shift);
+
     return 0;
 }
 
+static bool spapr_tce_table_ex_needed(void *opaque)
+{
+    sPAPRTCETable *tcet = opaque;
+
+    return tcet->bus_offset || tcet->page_shift != 0xC;
+}
+
+static const VMStateDescription vmstate_spapr_tce_table_ex = {
+    .name = "spapr_iommu_ex",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_tce_table_ex_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(enabled, sPAPRTCETable),
+        VMSTATE_UINT64(bus_offset, sPAPRTCETable),
+        VMSTATE_UINT32(page_shift, sPAPRTCETable),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
     .version_id = 2,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, mig_nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
+    .subsections = (const VMStateDescription*[]) {
+        &vmstate_spapr_tce_table_ex,
+        NULL
+    }
 };
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 75b0b55..c1ea49c 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -545,6 +545,8 @@ struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint32_t mig_nb_table;
+    uint64_t *mig_table;
     bool bypass;
     bool need_vfio;
     int fd;
diff --git a/trace-events b/trace-events
index 62dcbba..4335b9b 100644
--- a/trace-events
+++ b/trace-events
@@ -1431,6 +1431,8 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
+spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 08/17] spapr_iommu: Add root memory region
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 07/17] spapr_iommu: Migrate full state Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 09/17] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

We are going to have multiple DMA windows at different offsets on
a PCI bus. For the sake of migration, we will have as many TCE table
objects pre-created as many windows supported.
So we need a way to map windows dynamically onto a PCI bus
when migration of a table is completed but at this stage a TCE table
object does not have access to a PHB to ask it to map a DMA window
backed by just migrated TCE table.

This adds a "root" memory region (UINT64_MAX long) to the TCE object.
This new region is mapped on a PCI bus with enabled overlapping as
there will be one root MR per TCE table, each of them mapped at 0.
The actual IOMMU memory region is a subregion of the root region and
a TCE table enables/disables this subregion and maps it at
the specific offset inside the root MR which is 1:1 mapping of
a PCI address space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 hw/ppc/spapr_iommu.c   | 13 ++++++++++---
 hw/ppc/spapr_pci.c     |  6 +++---
 include/hw/ppc/spapr.h |  2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 52b1e0d..740836f 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -236,11 +236,16 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+    Object *tcetobj = OBJECT(tcet);
+    char tmp[32];
 
     tcet->fd = -1;
     tcet->need_vfio = false;
-    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr", 0);
+    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
+    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
+
+    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
+    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -318,6 +323,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
+    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
 
     tcet->enabled = true;
 }
@@ -340,6 +346,7 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
 
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
 {
+    memory_region_del_subregion(&tcet->root, &tcet->iommu);
     memory_region_set_size(&tcet->iommu, 0);
 
     spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
@@ -371,7 +378,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
 {
-    return &tcet->iommu;
+    return &tcet->root;
 }
 
 static void spapr_tce_reset(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 46f205b..bc1d549 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1468,13 +1468,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                        spapr_tce_get_iommu(tcet), 0);
+
     /* Register default 32bit DMA window */
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            nb_table);
 
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
-                                spapr_tce_get_iommu(tcet));
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index c1ea49c..e9cdfe3 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -550,7 +550,7 @@ struct sPAPRTCETable {
     bool bypass;
     bool need_vfio;
     int fd;
-    MemoryRegion iommu;
+    MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
     QLIST_ENTRY(sPAPRTCETable) list;
 };
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 09/17] spapr_pci: Reset DMA config on PHB reset
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 08/17] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 10/17] memory: Add reporting of supported page sizes Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

LoPAPR dictates that during system reset all DMA windows must be removed
and the default DMA32 window must be created so does the patch.

At the moment there is just one window supported so no change in
behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c   |  2 +-
 hw/ppc/spapr_pci.c     | 17 +++++++++++------
 include/hw/ppc/spapr.h |  1 +
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 740836f..5ce2f5e 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -358,7 +358,7 @@ static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
     tcet->nb_table = 0;
 }
 
-static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+void spapr_tce_table_disable(sPAPRTCETable *tcet)
 {
     if (!tcet->enabled) {
         error_report("Warning: trying to disable already disabled TCE table");
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index bc1d549..f55efd7 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1308,7 +1308,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
-    uint32_t nb_table;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1460,7 +1459,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
     tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
@@ -1471,10 +1469,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
                                         spapr_tce_get_iommu(tcet), 0);
 
-    /* Register default 32bit DMA window */
-    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
-                           nb_table);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1491,6 +1485,17 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 static void spapr_phb_reset(DeviceState *qdev)
 {
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+
+    if (tcet && tcet->enabled) {
+        spapr_tce_table_disable(tcet);
+    }
+
+    /* Register default 32bit DMA window */
+    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
+                           sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
+
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index e9cdfe3..471eb4a 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -573,6 +573,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
 void spapr_tce_table_enable(sPAPRTCETable *tcet,
                             uint32_t page_shift, uint64_t bus_offset,
                             uint32_t nb_table);
+void spapr_tce_table_disable(sPAPRTCETable *tcet);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 10/17] memory: Add reporting of supported page sizes
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 09/17] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-06  5:52   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 11/17] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the actual
page size(s) used by an IOMMU.

As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.

This removes vfio_container_granularity() and uses new callback in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes

v14:
* removed vfio_container_granularity(), changed memory_region_iommu_replay()

v4:
* s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
---
 hw/ppc/spapr_iommu.c  |  8 ++++++++
 hw/vfio/common.c      |  6 ------
 include/exec/memory.h | 18 ++++++++++++++----
 memory.c              | 17 ++++++++++++++---
 4 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 5ce2f5e..c945dba 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -148,6 +148,13 @@ static void spapr_tce_table_pre_save(void *opaque)
                                tcet->bus_offset, tcet->page_shift);
 }
 
+static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return 1ULL << tcet->page_shift;
+}
+
 static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
 
@@ -231,6 +238,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
+    .get_page_sizes = spapr_tce_get_page_sizes,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 23dd738..6bec419 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -319,11 +319,6 @@ out:
     rcu_read_unlock();
 }
 
-static hwaddr vfio_container_granularity(VFIOContainer *container)
-{
-    return (hwaddr)1 << ctz64(container->iova_pgsizes);
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -391,7 +386,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
-                                   vfio_container_granularity(container),
                                    false);
 
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2de7898..eb5ce67 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /* Returns supported page sizes */
+    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -573,6 +575,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
 
 
 /**
+ * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
+ *
+ * Returns %bitmap of supported page sizes for an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
+
+/**
  * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
  *
  * @mr: the memory region that was changed
@@ -596,16 +607,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
 
 /**
  * memory_region_iommu_replay: replay existing IOMMU translations to
- * a notifier
+ * a notifier with the minimum page granularity returned by
+ * mr->iommu_ops->get_page_sizes().
  *
  * @mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
- * @granularity: Minimum page granularity to replay notifications for
  * @is_write: Whether to treat the replay as a translate "write"
  *     through the iommu
  */
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write);
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
 
 /**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
diff --git a/memory.c b/memory.c
index 95f7209..c37dbc9 100644
--- a/memory.c
+++ b/memory.c
@@ -1512,12 +1512,14 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
     notifier_list_add(&mr->iommu_notify, n);
 }
 
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write)
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
 {
-    hwaddr addr;
+    hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    g_assert(mr->iommu_ops && mr->iommu_ops->get_page_sizes);
+    granularity = (hwaddr)1 << ctz64(mr->iommu_ops->get_page_sizes(mr));
+
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
         iotlb = mr->iommu_ops->translate(mr, addr, is_write);
         if (iotlb.perm != IOMMU_NONE) {
@@ -1544,6 +1546,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     notifier_list_notify(&mr->iommu_notify, &entry);
 }
 
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
+{
+    assert(memory_region_is_iommu(mr));
+    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
+        return mr->iommu_ops->get_page_sizes(mr);
+    }
+    return TARGET_PAGE_SIZE;
+}
+
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
     uint8_t mask = 1 << client;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 11/17] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 10/17] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-06  6:05   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 12/17] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

As there is no per-IOMMU-type release() callback anymore, this stores
the IOMMU type in the container so vfio_listener_release() can device
if it needs to unregister @prereg_listener.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This does not change the guest visible interface.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* banned unaligned sections
* added an vfio_prereg_gpa_to_ua() helper

v14:
* s/free_container_exit/listener_release_exit/g
* added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
---
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              |  38 +++++++++---
 hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   4 ++
 trace-events                  |   2 +
 5 files changed, 173 insertions(+), 10 deletions(-)
 create mode 100644 hw/vfio/prereg.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index ceddbb8..5800e0e 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
+obj-$(CONFIG_SOFTMMU) += prereg.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6bec419..3e9c579 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -493,6 +493,9 @@ static const MemoryListener vfio_memory_listener = {
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
 }
 
 int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
@@ -800,8 +803,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -826,8 +829,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
             container->iova_pgsizes = info.iova_pgsizes;
         }
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
 
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
@@ -835,7 +840,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -847,11 +854,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto listener_release_exit;
+            }
         }
 
         /*
@@ -864,7 +882,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if (ret) {
             error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
             ret = -errno;
-            goto free_container_exit;
+            goto listener_release_exit;
         }
         container->min_iova = info.dma32_window_start;
         container->max_iova = container->min_iova + info.dma32_window_size - 1;
diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
new file mode 100644
index 0000000..5f7fa30
--- /dev/null
+++ b/hw/vfio/prereg.c
@@ -0,0 +1,138 @@
+/*
+ * DMA memory preregistration
+ *
+ * Authors:
+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
+{
+    if (memory_region_is_iommu(section->mr)) {
+        error_report("Cannot possibly preregister IOMMU memory");
+        return true;
+    }
+
+    return !memory_region_is_ram(section->mr) ||
+            memory_region_is_skip_dump(section->mr);
+}
+
+static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)
+{
+    return memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (gpa - section->offset_within_address_space);
+}
+
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    Int128 llend;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_add_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+
+    g_assert(!int128_ge(int128_make64(gpa), llend));
+
+    memory_region_ref(section->mr);
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
+    reg.size = int128_get64(llend) - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
+    if (ret) {
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->initialized) {
+            if (!container->error) {
+                container->error = ret;
+            }
+        } else {
+            hw_error("vfio: Memory registering failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    hwaddr end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_del_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    end = section->offset_within_address_space + int128_get64(section->size);
+    if (gpa >= end) {
+        return;
+    }
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
+    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+}
+
+const MemoryListener vfio_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c9b6622..c72e45a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -73,6 +73,8 @@ typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
+    MemoryListener prereg_listener;
+    unsigned iommu_type;
     int error;
     bool initialized;
     /*
@@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
                          struct vfio_region_info **info);
 #endif
+extern const MemoryListener vfio_prereg_listener;
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index 4335b9b..23ca0b9 100644
--- a/trace-events
+++ b/trace-events
@@ -1736,6 +1736,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 12/17] spapr_pci: Add and export DMA resetting helper
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 11/17] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 13/17] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This will be later used by the "ibm,reset-pe-dma-window" RTAS handler
which resets the DMA configuration to the defaults.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci.c          | 10 ++++++++--
 include/hw/pci-host/spapr.h |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index f55efd7..5497a18 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1483,9 +1483,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
     return 0;
 }
 
-static void spapr_phb_reset(DeviceState *qdev)
+void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
 
     if (tcet && tcet->enabled) {
@@ -1495,6 +1494,13 @@ static void spapr_phb_reset(DeviceState *qdev)
     /* Register default 32bit DMA window */
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
+}
+
+static void spapr_phb_reset(DeviceState *qdev)
+{
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_reset(sphb);
 
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 03ee006..7848366 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -147,4 +147,6 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 }
 #endif
 
+void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 13/17] vfio: Add host side DMA window capabilities
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 12/17] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-06  7:10   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

There are going to be multiple IOMMUs per a container. This moves
the single host IOMMU parameter set to a list of VFIOHostIOMMU.

This should cause no behavioral change and will be used later by
the SPAPR TCE IOMMU v2 which will also add a vfio_host_iommu_del() helper.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* s/vfio_host_iommu_add/vfio_host_win_add/
* s/VFIOHostIOMMU/VFIOHostDMAWindow/
---
 hw/vfio/common.c              | 65 +++++++++++++++++++++++++++++++++----------
 include/hw/vfio/vfio-common.h |  9 ++++--
 2 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3e9c579..ea79311 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "exec/memory.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
+#include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
 
@@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
+static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
+                                               hwaddr min_iova, hwaddr max_iova)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hiommu_list, hiommu_next) {
+        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {
+            return hostwin;
+        }
+    }
+
+    return NULL;
+}
+
+static int vfio_host_win_add(VFIOContainer *container,
+                             hwaddr min_iova, hwaddr max_iova,
+                             uint64_t iova_pgsizes)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hiommu_list, hiommu_next) {
+        if (ranges_overlap(min_iova, max_iova - min_iova + 1,
+                           hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1)) {
+            error_report("%s: Overlapped IOMMU are not enabled", __func__);
+            return -1;
+        }
+    }
+
+    hostwin = g_malloc0(sizeof(*hostwin));
+
+    hostwin->min_iova = min_iova;
+    hostwin->max_iova = max_iova;
+    hostwin->iova_pgsizes = iova_pgsizes;
+    QLIST_INSERT_HEAD(&container->hiommu_list, hostwin, hiommu_next);
+
+    return 0;
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -352,7 +392,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(llend);
 
-    if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
+    if (!vfio_host_win_lookup(container, iova, end - 1)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
                      container, iova, end - 1);
@@ -367,10 +407,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         trace_vfio_listener_region_add_iommu(iova, end - 1);
         /*
-         * FIXME: We should do some checking to see if the
-         * capabilities of the host VFIO IOMMU are adequate to model
-         * the guest IOMMU
-         *
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
          * would be the right place to wire that up (tell the KVM
@@ -818,16 +854,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        container->min_iova = 0;
-        container->max_iova = (hwaddr)-1;
-
-        /* Assume just 4K IOVA page size */
-        container->iova_pgsizes = 0x1000;
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
         /* Ignore errors */
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
-            container->iova_pgsizes = info.iova_pgsizes;
+            vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
+        } else {
+            /* Assume just 4K IOVA page size */
+            vfio_host_win_add(container, 0, (hwaddr)-1, 0x1000);
         }
     } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
                ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
@@ -884,11 +918,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto listener_release_exit;
         }
-        container->min_iova = info.dma32_window_start;
-        container->max_iova = container->min_iova + info.dma32_window_size - 1;
 
-        /* Assume just 4K IOVA pages for now */
-        container->iova_pgsizes = 0x1000;
+        /* The default table uses 4K pages */
+        vfio_host_win_add(container, info.dma32_window_start,
+                          info.dma32_window_start +
+                          info.dma32_window_size - 1,
+                          0x1000);
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c72e45a..8028bb8 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -82,9 +82,8 @@ typedef struct VFIOContainer {
      * contiguous IOVA window.  We may need to generalize that in
      * future
      */
-    hwaddr min_iova, max_iova;
-    uint64_t iova_pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostDMAWindow) hiommu_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
@@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
+typedef struct VFIOHostDMAWindow {
+    hwaddr min_iova, max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostDMAWindow) hiommu_next;
+} VFIOHostDMAWindow;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 13/17] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-07  0:40   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
a guest view of the table and a hardware TCE table. If there is no VFIO
presense in the address space, then just the guest view is used, if
this is the case, it is allocated in the KVM. However since there is no
support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
we need to move the guest view from KVM to the userspace; and we need
to do this for every IOMMU on a bus with VFIO devices.

This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
notifiy IOMMU about changing environment so it can reallocate the table
to/from KVM or (when available) hook the IOMMU groups with the logical
bus (LIOBN) in the KVM.

This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
path as the new callbacks do this better - they notify IOMMU at
the exact moment when the configuration is changed, and this also
includes the case of PCI hot unplug.

As there can be multiple containers attached to the same PHB/LIOBN,
this replaces the @need_vfio flag in sPAPRTCETable with the counter
of VFIO users.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* s/need_vfio/vfio-Users/g
---
 hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
 hw/ppc/spapr_pci.c     |  6 ------
 hw/vfio/common.c       |  9 +++++++++
 include/exec/memory.h  |  4 ++++
 include/hw/ppc/spapr.h |  2 +-
 5 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index c945dba..ea09414 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
     return 1ULL << tcet->page_shift;
 }
 
+static void spapr_tce_vfio_start(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
+}
+
+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
+}
+
 static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
 
@@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
     .get_page_sizes = spapr_tce_get_page_sizes,
+    .vfio_start = spapr_tce_vfio_start,
+    .vfio_stop = spapr_tce_vfio_stop,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
@@ -248,7 +260,7 @@ static int spapr_tce_table_realize(DeviceState *dev)
     char tmp[32];
 
     tcet->fd = -1;
-    tcet->need_vfio = false;
+    tcet->vfio_users = 0;
     snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
     memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
 
@@ -268,20 +280,18 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
     size_t table_size = tcet->nb_table * sizeof(uint64_t);
     void *newtable;
 
-    if (need_vfio == tcet->need_vfio) {
-        /* Nothing to do */
-        return;
-    }
+    tcet->vfio_users += need_vfio ? 1 : -1;
+    g_assert(tcet->vfio_users >= 0);
+    g_assert(tcet->table);
 
-    if (!need_vfio) {
+    if (!tcet->vfio_users) {
         /* FIXME: We don't support transition back to KVM accelerated
          * TCEs yet */
         return;
     }
 
-    tcet->need_vfio = true;
-
-    if (tcet->fd < 0) {
+    if (tcet->vfio_users > 1) {
+        g_assert(tcet->fd < 0);
         /* Table is already in userspace, nothing to be do */
         return;
     }
@@ -327,7 +337,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
                                         tcet->page_shift,
                                         tcet->nb_table,
                                         &tcet->fd,
-                                        tcet->need_vfio);
+                                        tcet->vfio_users != 0);
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 5497a18..f864fde 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1083,12 +1083,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     void *fdt = NULL;
     int fdt_start_offset = 0, fdt_size;
 
-    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
-
-        spapr_tce_set_need_vfio(tcet, true);
-    }
-
     if (dev->hotplugged) {
         fdt = create_device_tree(&fdt_size);
         fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ea79311..5e5b77c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -421,6 +421,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
+            section->mr->iommu_ops->vfio_start(section->mr);
+        }
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
                                    false);
 
@@ -466,6 +469,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     hwaddr iova, end;
     int ret;
+    MemoryRegion *iommu = NULL;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -487,6 +491,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (giommu->iommu == section->mr) {
                 memory_region_unregister_iommu_notifier(&giommu->n);
+                iommu = giommu->iommu;
                 QLIST_REMOVE(giommu, giommu_next);
                 g_free(giommu);
                 break;
@@ -519,6 +524,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, end - iova, ret);
     }
+
+    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
+        iommu->iommu_ops->vfio_stop(section->mr);
+    }
 }
 
 static const MemoryListener vfio_memory_listener = {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index eb5ce67..f1de133f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -152,6 +152,10 @@ struct MemoryRegionIOMMUOps {
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
     /* Returns supported page sizes */
     uint64_t (*get_page_sizes)(MemoryRegion *iommu);
+    /* Called when VFIO starts using this */
+    void (*vfio_start)(MemoryRegion *iommu);
+    /* Called when VFIO stops using this */
+    void (*vfio_stop)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 471eb4a..5c00e38 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -548,7 +548,7 @@ struct sPAPRTCETable {
     uint32_t mig_nb_table;
     uint64_t *mig_table;
     bool bypass;
-    bool need_vfio;
+    int vfio_users;
     int fd;
     MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-07  0:50   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 17/17] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

We are going to have 2 DMA windows which LIOBNs are calculated from
the PHB index and the window number using the SPAPR_PCI_LIOBN macro
so there is no actual use for dma_liobn.

This replaces dma_liobn with SPAPR_PCI_LIOBN. This marks it as unused
in the migration stream. This renames dma_liobn to _dma_liobn as we have
to keep the property for the CLI compatibility and we need a storage
for it, although it has never really been used.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* new to the series
---
 hw/ppc/spapr_pci.c          | 17 ++++++-----------
 include/hw/pci-host/spapr.h |  2 +-
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index f864fde..d4bdb27 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1306,7 +1306,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
 
-        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
+        if ((sphb->buid != (uint64_t)-1)
             || (sphb->mem_win_addr != (hwaddr)-1)
             || (sphb->io_win_addr != (hwaddr)-1)) {
             error_setg(errp, "Either \"index\" or other parameters must"
@@ -1321,7 +1321,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
 
         sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
 
         windows_base = SPAPR_PCI_WINDOW_BASE
             + sphb->index * SPAPR_PCI_WINDOW_SPACING;
@@ -1334,11 +1333,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (sphb->dma_liobn == (uint32_t)-1) {
-        error_setg(errp, "LIOBN not specified for PHB");
-        return;
-    }
-
     if (sphb->mem_win_addr == (hwaddr)-1) {
         error_setg(errp, "Memory window address not specified for PHB");
         return;
@@ -1453,7 +1447,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
+    tcet = spapr_tce_new_table(DEVICE(sphb), SPAPR_PCI_LIOBN(sphb->index, 0));
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
                    sphb->dtbusname);
@@ -1479,7 +1473,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+    uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
     if (tcet && tcet->enabled) {
         spapr_tce_table_disable(tcet);
@@ -1507,7 +1502,7 @@ static void spapr_phb_reset(DeviceState *qdev)
 static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
     DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
-    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
+    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, _dma_liobn, -1),
     DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
     DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
                        SPAPR_PCI_MMIO_WIN_SIZE),
@@ -1595,7 +1590,7 @@ static const VMStateDescription vmstate_spapr_pci = {
     .post_load = spapr_pci_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
-        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
+        VMSTATE_UNUSED(4), /* former dma_liobn */
         VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..3fca1c3 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -56,7 +56,7 @@ struct sPAPRPHBState {
     hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
     MemoryRegion memwindow, iowindow, msiwindow;
 
-    uint32_t dma_liobn;
+    uint32_t _dma_liobn;
     hwaddr dma_win_addr, dma_win_size;
     AddressSpace iommu_as;
     MemoryRegion iommu_root;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (14 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  2016-04-07  1:10   ` David Gibson
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 17/17] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* new to the series

---
TODO:
* export levels to PHB
---
 hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 trace-events     |   2 +
 2 files changed, 107 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5e5b77c..57a51df 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@ static int vfio_host_win_add(VFIOContainer *container,
     return 0;
 }
 
+static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
+{
+    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
+
+    g_assert(hostwin);
+    QLIST_REMOVE(hostwin, hiommu_next);
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,63 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(llend);
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        unsigned entries, pages, pagesize = qemu_real_host_page_size;
+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
+
+        trace_vfio_listener_region_add_iommu(iova, end - 1);
+        if (section->mr->iommu_ops) {
+            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
+        }
+        /*
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        create.window_size = int128_get64(section->size);
+        create.page_shift = ctz64(pagesize);
+        /*
+         * SPAPR host supports multilevel TCE tables, there is some
+         * euristic to decide how many levels we want for our table:
+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+         */
+        entries = create.window_size >> create.page_shift;
+        pages = (entries * sizeof(uint64_t)) / getpagesize();
+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+
+        if (vfio_host_win_lookup(container, create.start_addr,
+                                 create.start_addr + create.window_size - 1)) {
+            goto fail;
+        }
+
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+        if (ret) {
+            error_report("Failed to create a window, ret = %d (%m)", ret);
+            goto fail;
+        }
+
+        if (create.start_addr != section->offset_within_address_space) {
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = create.start_addr
+            };
+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
+                         section->offset_within_address_space,
+                         create.start_addr);
+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            ret = -EINVAL;
+            goto fail;
+        }
+        trace_vfio_spapr_create_window(create.page_shift,
+                                       create.window_size,
+                                       create.start_addr);
+
+        vfio_host_win_add(container, create.start_addr,
+                          create.start_addr + create.window_size - 1,
+                          1ULL << create.page_shift);
+    }
+
     if (!vfio_host_win_lookup(container, iova, end - 1)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
@@ -525,6 +590,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      container, iova, end - iova, ret);
     }
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        struct vfio_iommu_spapr_tce_remove remove = {
+            .argsz = sizeof(remove),
+            .start_addr = section->offset_within_address_space,
+        };
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        if (ret) {
+            error_report("Failed to remove window at %"PRIx64,
+                         remove.start_addr);
+        }
+
+        vfio_host_win_del(container, section->offset_within_address_space);
+
+        trace_vfio_spapr_remove_window(remove.start_addr);
+    }
+
     if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
         iommu->iommu_ops->vfio_stop(section->mr);
     }
@@ -915,11 +996,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
@@ -928,11 +1004,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto listener_release_exit;
         }
 
-        /* The default table uses 4K pages */
-        vfio_host_win_add(container, info.dma32_window_start,
-                          info.dma32_window_start +
-                          info.dma32_window_size - 1,
-                          0x1000);
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = info.dma32_window_start,
+            };
+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            if (ret) {
+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/trace-events b/trace-events
index 23ca0b9..5c651fa 100644
--- a/trace-events
+++ b/trace-events
@@ -1738,6 +1738,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
 vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v15 17/17] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (15 preceding siblings ...)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
@ 2016-04-04  9:33 ` Alexey Kardashevskiy
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-04  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	David Gibson

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.5 machine and older disable it. This also creates a single
DMA window for the older machines to maintain backward migration.

This implements DDW for PHB with emulated and VFIO devices. The host
kernel support is required. The advertised IOMMU page sizes are 4K and
64K; if QEMU is running with huge pages enabled, this also advertises
16M pages.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v15:
* moved page mask filtering to PHB realize(), use "-mempath" to know
if there are huge pages
* fixed error reporting in RTAS handlers
* max window size accounts now hotpluggable memory boundaries
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   7 +-
 hw/ppc/spapr_pci.c          |  60 +++++++--
 hw/ppc/spapr_rtas_ddw.c     | 292 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |   6 +
 include/hw/ppc/spapr.h      |  16 ++-
 trace-events                |   4 +
 7 files changed, 372 insertions(+), 14 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c1ffc77..986b36f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 79a70a9..180c488 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
  * pseries-2.5
  */
 #define SPAPR_COMPAT_2_5 \
-        HW_COMPAT_2_5
+        HW_COMPAT_2_5 \
+        {\
+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+            .property = "ddw",\
+            .value    = stringify(off),\
+        },
 
 static void spapr_machine_2_5_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index d4bdb27..2276ffe 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -32,6 +32,7 @@
 #include "hw/ppc/spapr.h"
 #include "hw/pci-host/spapr.h"
 #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
 #include <libfdt.h>
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -41,6 +42,7 @@
 #include "hw/pci/pci_bus.h"
 #include "hw/ppc/spapr_drc.h"
 #include "sysemu/device_tree.h"
+#include "sysemu/hostmem.h"
 
 #include "hw/vfio/vfio.h"
 
@@ -1302,6 +1304,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
+    const unsigned windows_supported =
+        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1447,15 +1451,19 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), SPAPR_PCI_LIOBN(sphb->index, 0));
-    if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return;
-    }
+    /* DMA setup */
 
-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
-                                        spapr_tce_get_iommu(tcet), 0);
+    for (i = 0; i < windows_supported; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb),
+                                   SPAPR_PCI_LIOBN(sphb->index, i));
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
+    }
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
@@ -1473,14 +1481,20 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    int i;
+    sPAPRTCETable *tcet;
 
-    if (tcet && tcet->enabled) {
-        spapr_tce_table_disable(tcet);
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
+        tcet = spapr_tce_find_by_liobn(liobn);
+
+        if (tcet && tcet->enabled) {
+            spapr_tce_table_disable(tcet);
+        }
     }
 
     /* Register default 32bit DMA window */
+    tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, 0));
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
 }
@@ -1514,6 +1528,11 @@ static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
+                       (1ULL << 12) | (1ULL << 16)),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1767,6 +1786,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1791,6 +1819,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..b4e0686
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,292 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->enabled) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->enabled) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
+{
+    int i;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
+        if (page_mask & (1ULL << masks[i].shift)) {
+            mask |= masks[i].mask;
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid, max_window_size;
+    uint32_t avail, addr, pgmask = 0;
+    MachineState *machine = MACHINE(spapr);
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    /* Translate page mask to LoPAPR format */
+    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as maximum supported RAM size was in 4K pages.
+     */
+    if (machine->ram_size == machine->maxram_size) {
+        max_window_size = machine->ram_size >> SPAPR_TCE_PAGE_SHIFT;
+    } else {
+        MemoryHotplugState *hpms = &spapr->hotplug_memory;
+
+        max_window_size = hpms->base + memory_region_size(&hpms->mr);
+    }
+
+    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+    rtas_st(rets, 2, max_window_size);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
+        (window_shift < page_shift)) {
+        goto param_error_exit;
+    }
+
+    if (!liobn || !sphb->ddw_enabled ||
+        spapr_phb_get_active_win_num(sphb) == SPAPR_PCI_DMA_MAX_WINDOWS) {
+        goto hw_error_exit;
+    }
+
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
+    if (!tcet) {
+        goto hw_error_exit;
+    }
+
+    spapr_tce_table_enable(tcet, page_shift, sphb->dma64_window_addr,
+                           1ULL << (window_shift - page_shift));
+    if (!tcet->enabled) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled || !tcet->enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_tce_table_disable(tcet);
+    trace_spapr_iommu_ddw_remove(liobn);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 3fca1c3..171fa92 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -71,6 +71,10 @@ struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_addr;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
@@ -89,6 +93,8 @@ struct sPAPRPHBState {
 
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 5c00e38..7002e23 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -417,6 +417,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -458,8 +468,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
diff --git a/trace-events b/trace-events
index 5c651fa..5afcb2d 100644
--- a/trace-events
+++ b/trace-events
@@ -1433,6 +1433,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
+spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 01/17] memory: Fix IOMMU replay base address
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 01/17] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-04-05  1:34   ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-04-05  1:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 5075 bytes --]

On Mon, Apr 04, 2016 at 07:33:30PM +1000, Alexey Kardashevskiy wrote:
> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> when new VFIO listener is added, all existing IOMMU mappings are
> replayed. However there is a problem that the base address of
> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> for the existing user (which is pseries) with its default 32bit DMA
> window starting at 0 but it is if there is another DMA window.
> 
> This stores the IOMMU's offset_within_address_space and adjusts
> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> 
> As the IOMMU notifier expects IOVA offset rather than the absolute
> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> calling notifier(s).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v15:
> * accounted section->offset_within_region
> * s/giommu->offset_within_address_space/giommu->iommu_offset/
> ---
>  hw/ppc/spapr_iommu.c          |  2 +-
>  hw/vfio/common.c              | 14 ++++++++------
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 7dd4588..277f289 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
>      tcet->table[index] = tce;
>  
>      entry.target_as = &address_space_memory,
> -    entry.iova = ioba & page_mask;
> +    entry.iova = (ioba - tcet->bus_offset) & page_mask;
>      entry.translated_addr = tce & page_mask;
>      entry.addr_mask = ~page_mask;
>      entry.perm = spapr_tce_iommu_access_flags(tce);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb588d8..27753d8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>      VFIOContainer *container = giommu->container;
>      IOMMUTLBEntry *iotlb = data;
> +    hwaddr iova = iotlb->iova + giommu->iommu_offset;
>      MemoryRegion *mr;
>      hwaddr xlat;
>      hwaddr len = iotlb->addr_mask + 1;
>      void *vaddr;
>      int ret;
>  
> -    trace_vfio_iommu_map_notify(iotlb->iova,
> -                                iotlb->iova + iotlb->addr_mask);
> +    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
>      /*
>       * The IOMMU TLB entry we have just covers translation through
> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>          vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -        ret = vfio_dma_map(container, iotlb->iova,
> +        ret = vfio_dma_map(container, iova,
>                             iotlb->addr_mask + 1, vaddr,
>                             !(iotlb->perm & IOMMU_WO) || mr->readonly);
>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, vaddr, ret);
>          }
>      } else {
> -        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, ret);
>          }
>      }
> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           */
>          giommu = g_malloc0(sizeof(*giommu));
>          giommu->iommu = section->mr;
> +        giommu->iommu_offset = section->offset_within_address_space -
> +            section->offset_within_region;
>          giommu->container = container;
>          giommu->n.notify = vfio_iommu_map_notify;
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eb0e1b0..c9b6622 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -90,6 +90,7 @@ typedef struct VFIOContainer {
>  typedef struct VFIOGuestIOMMU {
>      VFIOContainer *container;
>      MemoryRegion *iommu;
> +    hwaddr iommu_offset;
>      Notifier n;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 07/17] spapr_iommu: Migrate full state
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 07/17] spapr_iommu: Migrate full state Alexey Kardashevskiy
@ 2016-04-05  5:58   ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-04-05  5:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 6605 bytes --]

On Mon, Apr 04, 2016 at 07:33:36PM +1000, Alexey Kardashevskiy wrote:
> The source guest could have reallocated the default TCE table and
> migrate bigger/smaller table. This adds reallocation in post_load()
> if the default table size is different on source and destination.
> 
> This adds @bus_offset, @page_shift, @enabled to the migration stream.
> These cannot change without dynamic DMA windows so no change in
> behavior is expected now.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

The mig_table stuff is kind of ugly, but I don't know of any better
way to do it with our current migration infrastructure.

> ---
> Changes:
> v15:
> * squashed "migrate full state" into this
> * added missing tcet->mig_nb_table initialization in spapr_tce_table_pre_save()
> * instead of bumping the version, moved extra parameters to subsection
> 
> v14:
> * new to the series
> ---
>  hw/ppc/spapr_iommu.c   | 67 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/hw/ppc/spapr.h |  2 ++
>  trace-events           |  2 ++
>  3 files changed, 69 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 9bcd3f6..52b1e0d 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -137,33 +137,96 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>      return ret;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +    tcet->mig_table = tcet->table;
> +    tcet->mig_nb_table = tcet->nb_table;
> +
> +    trace_spapr_iommu_pre_save(tcet->liobn, tcet->mig_nb_table,
> +                               tcet->bus_offset, tcet->page_shift);
> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +    uint32_t old_nb_table = tcet->nb_table;
>  
>      if (tcet->vdev) {
>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>      }
>  
> +    if (tcet->enabled) {
> +        if (tcet->nb_table != tcet->mig_nb_table) {
> +            if (tcet->nb_table) {
> +                spapr_tce_table_do_disable(tcet);
> +            }
> +            tcet->nb_table = tcet->mig_nb_table;
> +            spapr_tce_table_do_enable(tcet);
> +        }
> +
> +        memcpy(tcet->table, tcet->mig_table,
> +               tcet->nb_table * sizeof(tcet->table[0]));
> +
> +        free(tcet->mig_table);
> +        tcet->mig_table = NULL;
> +    } else if (tcet->table) {
> +        /* Destination guest has a default table but source does not -> free */
> +        spapr_tce_table_do_disable(tcet);
> +    }
> +
> +    trace_spapr_iommu_post_load(tcet->liobn, old_nb_table, tcet->nb_table,
> +                                tcet->bus_offset, tcet->page_shift);
> +
>      return 0;
>  }
>  
> +static bool spapr_tce_table_ex_needed(void *opaque)
> +{
> +    sPAPRTCETable *tcet = opaque;
> +
> +    return tcet->bus_offset || tcet->page_shift != 0xC;
> +}
> +
> +static const VMStateDescription vmstate_spapr_tce_table_ex = {
> +    .name = "spapr_iommu_ex",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_tce_table_ex_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BOOL(enabled, sPAPRTCETable),
> +        VMSTATE_UINT64(bus_offset, sPAPRTCETable),
> +        VMSTATE_UINT32(page_shift, sPAPRTCETable),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
>  static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
>      .version_id = 2,
>      .minimum_version_id = 2,
> +    .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
>      .fields      = (VMStateField []) {
>          /* Sanity check */
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> +        VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, mig_nb_table, 0,
> +                                    vmstate_info_uint64, uint64_t),
>  
>          VMSTATE_END_OF_LIST()
>      },
> +    .subsections = (const VMStateDescription*[]) {
> +        &vmstate_spapr_tce_table_ex,
> +        NULL
> +    }
>  };
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 75b0b55..c1ea49c 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -545,6 +545,8 @@ struct sPAPRTCETable {
>      uint64_t bus_offset;
>      uint32_t page_shift;
>      uint64_t *table;
> +    uint32_t mig_nb_table;
> +    uint64_t *mig_table;
>      bool bypass;
>      bool need_vfio;
>      int fd;
> diff --git a/trace-events b/trace-events
> index 62dcbba..4335b9b 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1431,6 +1431,8 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> +spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> +spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 10/17] memory: Add reporting of supported page sizes
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 10/17] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-04-06  5:52   ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-04-06  5:52 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 2752 bytes --]

On Mon, Apr 04, 2016 at 07:33:39PM +1000, Alexey Kardashevskiy wrote:
> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> page size(s) used by an IOMMU.
> 
> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
> as fallback.
> 
> This removes vfio_container_granularity() and uses new callback in
> memory_region_iommu_replay() when replaying IOMMU mappings on added
> IOMMU memory region.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

.. with the exception of one nit:

[snip]
> diff --git a/memory.c b/memory.c
> index 95f7209..c37dbc9 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1512,12 +1512,14 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>      notifier_list_add(&mr->iommu_notify, n);
>  }
>  
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -                                hwaddr granularity, bool is_write)
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
>  {
> -    hwaddr addr;
> +    hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    g_assert(mr->iommu_ops && mr->iommu_ops->get_page_sizes);
> +    granularity = (hwaddr)1 << ctz64(mr->iommu_ops->get_page_sizes(mr));

So here, replay requires that the get_page_sizes() callback be
populated.  However, if you move memory_region_iommu_get_page_sizes()
just above this, you can use that and have the replay fall back to
TARGET_PAGE_SIZE as well.

>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>          iotlb = mr->iommu_ops->translate(mr, addr, is_write);
>          if (iotlb.perm != IOMMU_NONE) {
> @@ -1544,6 +1546,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      notifier_list_notify(&mr->iommu_notify, &entry);
>  }
>  
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
> +{
> +    assert(memory_region_is_iommu(mr));
> +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
> +        return mr->iommu_ops->get_page_sizes(mr);
> +    }
> +    return TARGET_PAGE_SIZE;
> +}
> +
>  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>  {
>      uint8_t mask = 1 << client;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 11/17] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 11/17] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2016-04-06  6:05   ` David Gibson
  2016-04-20  8:51     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-04-06  6:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 13592 bytes --]

On Mon, Apr 04, 2016 at 07:33:40PM +1000, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> As there is no per-IOMMU-type release() callback anymore, this stores
> the IOMMU type in the container so vfio_listener_release() can device

s/device/determine/ ?

> if it needs to unregister @prereg_listener.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This does not change the guest visible interface.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v15:
> * banned unaligned sections
> * added an vfio_prereg_gpa_to_ua() helper
> 
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  38 +++++++++---
>  hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   2 +
>  5 files changed, 173 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/prereg.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..5800e0e 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += prereg.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 6bec419..3e9c579 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -493,6 +493,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
>  }
>  
>  int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
> @@ -800,8 +803,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -826,8 +829,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -835,7 +840,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -847,11 +854,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto listener_release_exit;
> +            }
>          }
>  
>          /*
> @@ -864,7 +882,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if (ret) {
>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>              ret = -errno;
> -            goto free_container_exit;
> +            goto listener_release_exit;
>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> new file mode 100644
> index 0000000..5f7fa30
> --- /dev/null
> +++ b/hw/vfio/prereg.c
> @@ -0,0 +1,138 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    if (memory_region_is_iommu(section->mr)) {
> +        error_report("Cannot possibly preregister IOMMU memory");
> +        return true;
> +    }
> +
> +    return !memory_region_is_ram(section->mr) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)
> +{
> +    return memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    Int128 llend;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }

You could round out the range to pre-register rather than just failing
here, but that can be changed later if we need it.

> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +
> +    g_assert(!int128_ge(int128_make64(gpa), llend));
> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
> +    reg.size = int128_get64(llend) - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: Memory registering failed, unable to continue");
> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);

Why 64-bit math here, but 128 bit math in the region_add path?

> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c9b6622..c72e45a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
> +    MemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
>  int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                           struct vfio_region_info **info);
>  #endif
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index 4335b9b..23ca0b9 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1736,6 +1736,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 13/17] vfio: Add host side DMA window capabilities
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 13/17] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
@ 2016-04-06  7:10   ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-04-06  7:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 7006 bytes --]

On Mon, Apr 04, 2016 at 07:33:42PM +1000, Alexey Kardashevskiy wrote:
> There are going to be multiple IOMMUs per a container. This moves
> the single host IOMMU parameter set to a list of VFIOHostIOMMU.
                                                   ^^^
Haven't updated your commit message for the structure name change.

> This should cause no behavioral change and will be used later by
> the SPAPR TCE IOMMU v2 which will also add a vfio_host_iommu_del() helper.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Apart from that,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v15:
> * s/vfio_host_iommu_add/vfio_host_win_add/
> * s/VFIOHostIOMMU/VFIOHostDMAWindow/
> ---
>  hw/vfio/common.c              | 65 +++++++++++++++++++++++++++++++++----------
>  include/hw/vfio/vfio-common.h |  9 ++++--
>  2 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 3e9c579..ea79311 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "exec/memory.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
> +#include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  
> @@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> +static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> +                                               hwaddr min_iova, hwaddr max_iova)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hiommu_list, hiommu_next) {
> +        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {
> +            return hostwin;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int vfio_host_win_add(VFIOContainer *container,
> +                             hwaddr min_iova, hwaddr max_iova,
> +                             uint64_t iova_pgsizes)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hiommu_list, hiommu_next) {
> +        if (ranges_overlap(min_iova, max_iova - min_iova + 1,
> +                           hostwin->min_iova,
> +                           hostwin->max_iova - hostwin->min_iova + 1)) {
> +            error_report("%s: Overlapped IOMMU are not enabled", __func__);
> +            return -1;
> +        }
> +    }
> +
> +    hostwin = g_malloc0(sizeof(*hostwin));
> +
> +    hostwin->min_iova = min_iova;
> +    hostwin->max_iova = max_iova;
> +    hostwin->iova_pgsizes = iova_pgsizes;
> +    QLIST_INSERT_HEAD(&container->hiommu_list, hostwin, hiommu_next);
> +
> +    return 0;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -352,7 +392,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(llend);
>  
> -    if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
> +    if (!vfio_host_win_lookup(container, iova, end - 1)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>                       container, iova, end - 1);
> @@ -367,10 +407,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          trace_vfio_listener_region_add_iommu(iova, end - 1);
>          /*
> -         * FIXME: We should do some checking to see if the
> -         * capabilities of the host VFIO IOMMU are adequate to model
> -         * the guest IOMMU
> -         *
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
>           * would be the right place to wire that up (tell the KVM
> @@ -818,16 +854,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        container->min_iova = 0;
> -        container->max_iova = (hwaddr)-1;
> -
> -        /* Assume just 4K IOVA page size */
> -        container->iova_pgsizes = 0x1000;
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>          /* Ignore errors */
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> -            container->iova_pgsizes = info.iova_pgsizes;
> +            vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> +        } else {
> +            /* Assume just 4K IOVA page size */
> +            vfio_host_win_add(container, 0, (hwaddr)-1, 0x1000);
>          }
>      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> @@ -884,11 +918,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto listener_release_exit;
>          }
> -        container->min_iova = info.dma32_window_start;
> -        container->max_iova = container->min_iova + info.dma32_window_size - 1;
>  
> -        /* Assume just 4K IOVA pages for now */
> -        container->iova_pgsizes = 0x1000;
> +        /* The default table uses 4K pages */
> +        vfio_host_win_add(container, info.dma32_window_start,
> +                          info.dma32_window_start +
> +                          info.dma32_window_size - 1,
> +                          0x1000);
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c72e45a..8028bb8 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -82,9 +82,8 @@ typedef struct VFIOContainer {
>       * contiguous IOVA window.  We may need to generalize that in
>       * future
>       */
> -    hwaddr min_iova, max_iova;
> -    uint64_t iova_pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostDMAWindow) hiommu_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
> @@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;
>  
> +typedef struct VFIOHostDMAWindow {
> +    hwaddr min_iova, max_iova;
> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostDMAWindow) hiommu_next;
> +} VFIOHostDMAWindow;
> +
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  
>  typedef struct VFIODevice {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-04-07  0:40   ` David Gibson
  2016-04-20  9:15     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-04-07  0:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 8900 bytes --]

On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> a guest view of the table and a hardware TCE table. If there is no VFIO
> presense in the address space, then just the guest view is used, if
> this is the case, it is allocated in the KVM. However since there is no
> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> we need to move the guest view from KVM to the userspace; and we need
> to do this for every IOMMU on a bus with VFIO devices.
> 
> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> notifiy IOMMU about changing environment so it can reallocate the table
> to/from KVM or (when available) hook the IOMMU groups with the logical
> bus (LIOBN) in the KVM.
> 
> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> path as the new callbacks do this better - they notify IOMMU at
> the exact moment when the configuration is changed, and this also
> includes the case of PCI hot unplug.
> 
> As there can be multiple containers attached to the same PHB/LIOBN,
> this replaces the @need_vfio flag in sPAPRTCETable with the counter
> of VFIO users.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

This looks correct, but there's one remaining ugly.

> ---
> Changes:
> v15:
> * s/need_vfio/vfio-Users/g
> ---
>  hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
>  hw/ppc/spapr_pci.c     |  6 ------
>  hw/vfio/common.c       |  9 +++++++++
>  include/exec/memory.h  |  4 ++++
>  include/hw/ppc/spapr.h |  2 +-
>  5 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index c945dba..ea09414 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> +}
> +
> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> +}
> +
>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
>      .get_page_sizes = spapr_tce_get_page_sizes,
> +    .vfio_start = spapr_tce_vfio_start,
> +    .vfio_stop = spapr_tce_vfio_stop,

Ok, so AFAICT these callbacks are called whenever a VFIO context is
added / removed from the gIOMMU's address space, and it's up to the
gIOMMU code to ref count that to see if there are any current vfio
users.  That makes "vfio_start" and "vfio_stop" not great names.

But.. better than changing the names would be to move the refcounting
to the generic code if you can manage it, so the individual gIOMMU
backends don't need to - they just told when they need to start / stop
providing VFIO support.

>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> @@ -248,7 +260,7 @@ static int spapr_tce_table_realize(DeviceState *dev)
>      char tmp[32];
>  
>      tcet->fd = -1;
> -    tcet->need_vfio = false;
> +    tcet->vfio_users = 0;
>      snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
>      memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
>  
> @@ -268,20 +280,18 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>      size_t table_size = tcet->nb_table * sizeof(uint64_t);
>      void *newtable;
>  
> -    if (need_vfio == tcet->need_vfio) {
> -        /* Nothing to do */
> -        return;
> -    }
> +    tcet->vfio_users += need_vfio ? 1 : -1;
> +    g_assert(tcet->vfio_users >= 0);
> +    g_assert(tcet->table);
>  
> -    if (!need_vfio) {
> +    if (!tcet->vfio_users) {
>          /* FIXME: We don't support transition back to KVM accelerated
>           * TCEs yet */
>          return;
>      }
>  
> -    tcet->need_vfio = true;
> -
> -    if (tcet->fd < 0) {
> +    if (tcet->vfio_users > 1) {
> +        g_assert(tcet->fd < 0);
>          /* Table is already in userspace, nothing to be do */
>          return;
>      }
> @@ -327,7 +337,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
>                                          tcet->page_shift,
>                                          tcet->nb_table,
>                                          &tcet->fd,
> -                                        tcet->need_vfio);
> +                                        tcet->vfio_users != 0);
>  
>      memory_region_set_size(&tcet->iommu,
>                             (uint64_t)tcet->nb_table << tcet->page_shift);
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 5497a18..f864fde 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1083,12 +1083,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      void *fdt = NULL;
>      int fdt_start_offset = 0, fdt_size;
>  
> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> -
> -        spapr_tce_set_need_vfio(tcet, true);
> -    }
> -
>      if (dev->hotplugged) {
>          fdt = create_device_tree(&fdt_size);
>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ea79311..5e5b77c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -421,6 +421,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> +        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
> +            section->mr->iommu_ops->vfio_start(section->mr);
> +        }
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
>                                     false);
>  
> @@ -466,6 +469,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>      hwaddr iova, end;
>      int ret;
> +    MemoryRegion *iommu = NULL;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -487,6 +491,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (giommu->iommu == section->mr) {
>                  memory_region_unregister_iommu_notifier(&giommu->n);
> +                iommu = giommu->iommu;
>                  QLIST_REMOVE(giommu, giommu_next);
>                  g_free(giommu);
>                  break;
> @@ -519,6 +524,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       "0x%"HWADDR_PRIx") = %d (%m)",
>                       container, iova, end - iova, ret);
>      }
> +
> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> +        iommu->iommu_ops->vfio_stop(section->mr);
> +    }
>  }
>  
>  static const MemoryListener vfio_memory_listener = {
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index eb5ce67..f1de133f 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -152,6 +152,10 @@ struct MemoryRegionIOMMUOps {
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>      /* Returns supported page sizes */
>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> +    /* Called when VFIO starts using this */
> +    void (*vfio_start)(MemoryRegion *iommu);
> +    /* Called when VFIO stops using this */
> +    void (*vfio_stop)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 471eb4a..5c00e38 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -548,7 +548,7 @@ struct sPAPRTCETable {
>      uint32_t mig_nb_table;
>      uint64_t *mig_table;
>      bool bypass;
> -    bool need_vfio;
> +    int vfio_users;
>      int fd;
>      MemoryRegion root, iommu;
>      struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn Alexey Kardashevskiy
@ 2016-04-07  0:50   ` David Gibson
  2016-04-07  7:10     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-04-07  0:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 5775 bytes --]

s/dma_loibn/dma_liobn/ in subject line.

On Mon, Apr 04, 2016 at 07:33:44PM +1000, Alexey Kardashevskiy wrote:
> We are going to have 2 DMA windows which LIOBNs are calculated from
> the PHB index and the window number using the SPAPR_PCI_LIOBN macro
> so there is no actual use for dma_liobn.
> 
> This replaces dma_liobn with SPAPR_PCI_LIOBN. This marks it as unused
> in the migration stream. This renames dma_liobn to _dma_liobn as we have
> to keep the property for the CLI compatibility and we need a storage
> for it, although it has never really been used.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

This doesn't quite make sense.  We can't really do that without
entirely removing support for PHBs without an 'index' value.
Basically the idea of the PHB config parameters what that you either
specified just "index" or you specified *all* the relevant addresses.
Removing option 2 might be a reasonable idea, but it shouldn't just be
done as a side effect of this other change.  With this patch the
"specify everything" approach still has code, but can't work, because
such a device will never get a reasonable liobn (or worse, it might
get a duplicate liobn, because the index isn't verified in this mode).

Then again.. the "index" approach has also bitten us with the problem
of the not-quite-big-enough MMIO space per-PHB, so I'm not entirely
sure that making it the only choice is the right way to go either.

The short term approach to handle DDW might be to instead add a
dma64_liobn property.

> ---
> Changes:
> v15:
> * new to the series
> ---
>  hw/ppc/spapr_pci.c          | 17 ++++++-----------
>  include/hw/pci-host/spapr.h |  2 +-
>  2 files changed, 7 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index f864fde..d4bdb27 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1306,7 +1306,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
>  
> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> +        if ((sphb->buid != (uint64_t)-1)
>              || (sphb->mem_win_addr != (hwaddr)-1)
>              || (sphb->io_win_addr != (hwaddr)-1)) {
>              error_setg(errp, "Either \"index\" or other parameters must"
> @@ -1321,7 +1321,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>  
>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>  
>          windows_base = SPAPR_PCI_WINDOW_BASE
>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> @@ -1334,11 +1333,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (sphb->dma_liobn == (uint32_t)-1) {
> -        error_setg(errp, "LIOBN not specified for PHB");
> -        return;
> -    }
> -
>      if (sphb->mem_win_addr == (hwaddr)-1) {
>          error_setg(errp, "Memory window address not specified for PHB");
>          return;
> @@ -1453,7 +1447,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> +    tcet = spapr_tce_new_table(DEVICE(sphb), SPAPR_PCI_LIOBN(sphb->index, 0));
>      if (!tcet) {
>          error_setg(errp, "Unable to create TCE table for %s",
>                     sphb->dtbusname);
> @@ -1479,7 +1473,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> +    uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> +    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>  
>      if (tcet && tcet->enabled) {
>          spapr_tce_table_disable(tcet);
> @@ -1507,7 +1502,7 @@ static void spapr_phb_reset(DeviceState *qdev)
>  static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, _dma_liobn, -1),
>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>                         SPAPR_PCI_MMIO_WIN_SIZE),
> @@ -1595,7 +1590,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>      .post_load = spapr_pci_post_load,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> +        VMSTATE_UNUSED(4), /* former dma_liobn */
>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 7848366..3fca1c3 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -56,7 +56,7 @@ struct sPAPRPHBState {
>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>      MemoryRegion memwindow, iowindow, msiwindow;
>  
> -    uint32_t dma_liobn;
> +    uint32_t _dma_liobn;
>      hwaddr dma_win_addr, dma_win_size;
>      AddressSpace iommu_as;
>      MemoryRegion iommu_root;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
@ 2016-04-07  1:10   ` David Gibson
  2016-04-20  9:43     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-04-07  1:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 9935 bytes --]

Subject doesn't seem quite right, since you added at least minimal
support for the SPAPRv2 IOMMU in the prereg patch.

On Mon, Apr 04, 2016 at 07:33:45PM +1000, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses euristic to decide on the TCE table
> levels number.

"heuristic" has an 'h' (yes, English spelling is stupid[0]).

[0] The historical reasons for that are kind of fascinating, though.

> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v14:
> * new to the series
> 
> ---
> TODO:
> * export levels to PHB
> ---
>  hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  trace-events     |   2 +
>  2 files changed, 107 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5e5b77c..57a51df 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -279,6 +279,14 @@ static int vfio_host_win_add(VFIOContainer *container,
>      return 0;
>  }
>  
> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> +{
> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> +
> +    g_assert(hostwin);
> +    QLIST_REMOVE(hostwin, hiommu_next);
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -392,6 +400,63 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(llend);
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {

I think the "add region" path could do with being split out into a
different function - vfio_listener_region_add() is getting pretty
huge.

> +        unsigned entries, pages, pagesize = qemu_real_host_page_size;
> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> +
> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
> +        if (section->mr->iommu_ops) {
> +            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);

Since you're querying the guest IOMMU here, I assume pagesize is
supposed to represent *guest* IOMMU pagesizes, in which case it should
default to TARGET_PAGE_SIZE, instead of qemu_real_host_page_size.
(didn't you already have a function which implemented that fallback?)

> +        }
> +        /*
> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> +         * avoid bouncing all map/unmaps through qemu this way, this
> +         * would be the right place to wire that up (tell the KVM
> +         * device emulation the VFIO iommu handles to use).
> +         */
> +        create.window_size = int128_get64(section->size);
> +        create.page_shift = ctz64(pagesize);
> +        /*
> +         * SPAPR host supports multilevel TCE tables, there is some
> +         * euristic to decide how many levels we want for our table:

s/some euristic/a heuristic/

> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +         */
> +        entries = create.window_size >> create.page_shift;
> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> +
> +        if (vfio_host_win_lookup(container, create.start_addr,
> +                                 create.start_addr + create.window_size - 1)) {
> +            goto fail;

Hmm.. if you successfully look up a host window, it seems to me you
shouldn't fail, but in fact don't even need to create a new window
(the removal path gets harder though, because you need to check if any
guest window requires that host window).

Requiring that the host windows exactly match the guest windows is
probably ok for a first version - except that in that case any overlap
should cause a failure, not just a complete inclusion.

> +        }
> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (ret) {
> +            error_report("Failed to create a window, ret = %d (%m)", ret);
> +            goto fail;
> +        }
> +
> +        if (create.start_addr != section->offset_within_address_space) {
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = create.start_addr
> +            };
> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> +                         section->offset_within_address_space,
> +                         create.start_addr);
> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +        trace_vfio_spapr_create_window(create.page_shift,
> +                                       create.window_size,
> +                                       create.start_addr);
> +
> +        vfio_host_win_add(container, create.start_addr,
> +                          create.start_addr + create.window_size - 1,
> +                          1ULL << create.page_shift);
> +    }
> +
>      if (!vfio_host_win_lookup(container, iova, end - 1)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> @@ -525,6 +590,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       container, iova, end - iova, ret);
>      }
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        struct vfio_iommu_spapr_tce_remove remove = {
> +            .argsz = sizeof(remove),
> +            .start_addr = section->offset_within_address_space,
> +        };
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        if (ret) {
> +            error_report("Failed to remove window at %"PRIx64,
> +                         remove.start_addr);
> +        }
> +
> +        vfio_host_win_del(container, section->offset_within_address_space);
> +
> +        trace_vfio_spapr_remove_window(remove.start_addr);
> +    }
> +
>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>          iommu->iommu_ops->vfio_stop(section->mr);
>      }
> @@ -915,11 +996,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -928,11 +1004,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto listener_release_exit;
>          }
>  
> -        /* The default table uses 4K pages */
> -        vfio_host_win_add(container, info.dma32_window_start,
> -                          info.dma32_window_start +
> -                          info.dma32_window_size - 1,
> -                          0x1000);
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = info.dma32_window_start,
> +            };
> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            if (ret) {
> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            vfio_host_win_add(container, info.dma32_window_start,
> +                              info.dma32_window_start +
> +                              info.dma32_window_size - 1,
> +                              0x1000);
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/trace-events b/trace-events
> index 23ca0b9..5c651fa 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1738,6 +1738,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn
  2016-04-07  0:50   ` David Gibson
@ 2016-04-07  7:10     ` Alexey Kardashevskiy
  2016-04-08  1:34       ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-07  7:10 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 04/07/2016 10:50 AM, David Gibson wrote:
> s/dma_loibn/dma_liobn/ in subject line.
>
> On Mon, Apr 04, 2016 at 07:33:44PM +1000, Alexey Kardashevskiy wrote:
>> We are going to have 2 DMA windows which LIOBNs are calculated from
>> the PHB index and the window number using the SPAPR_PCI_LIOBN macro
>> so there is no actual use for dma_liobn.
>>
>> This replaces dma_liobn with SPAPR_PCI_LIOBN. This marks it as unused
>> in the migration stream. This renames dma_liobn to _dma_liobn as we have
>> to keep the property for the CLI compatibility and we need a storage
>> for it, although it has never really been used.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> This doesn't quite make sense.  We can't really do that without
> entirely removing support for PHBs without an 'index' value.
> Basically the idea of the PHB config parameters what that you either
> specified just "index" or you specified *all* the relevant addresses.
> Removing option 2 might be a reasonable idea, but it shouldn't just be
> done as a side effect of this other change.  With this patch the
> "specify everything" approach still has code, but can't work, because
> such a device will never get a reasonable liobn (or worse, it might
> get a duplicate liobn, because the index isn't verified in this mode).
>
> Then again.. the "index" approach has also bitten us with the problem
> of the not-quite-big-enough MMIO space per-PHB,

Any new ideas on that front? I also keep in mind that we rather want to 
assign interrupt numbers pool to a PHB based on its index rather than 
allocate them from the global machine interrupt number space so this is one 
more vote for keeping the index.


> so I'm not entirely
> sure that making it the only choice is the right way to go either.

I can imagine the user wanting to change MMIO addresses (so they should 
remain properties) but changing LIOBN from the command line does not seem 
useful at all.


> The short term approach to handle DDW might be to instead add a
> dma64_liobn property.

So everywhere where I want to have a loop through all TCE tables and use 
SPAPR_PCI_LIOBN(index, i), I cannot really do that if I have 2 separate 
properties. Not extremely convenient.



>
>> ---
>> Changes:
>> v15:
>> * new to the series
>> ---
>>   hw/ppc/spapr_pci.c          | 17 ++++++-----------
>>   include/hw/pci-host/spapr.h |  2 +-
>>   2 files changed, 7 insertions(+), 12 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index f864fde..d4bdb27 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -1306,7 +1306,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       if (sphb->index != (uint32_t)-1) {
>>           hwaddr windows_base;
>>
>> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
>> +        if ((sphb->buid != (uint64_t)-1)
>>               || (sphb->mem_win_addr != (hwaddr)-1)
>>               || (sphb->io_win_addr != (hwaddr)-1)) {
>>               error_setg(errp, "Either \"index\" or other parameters must"
>> @@ -1321,7 +1321,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>           }
>>
>>           sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>>
>>           windows_base = SPAPR_PCI_WINDOW_BASE
>>               + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>> @@ -1334,11 +1333,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>           return;
>>       }
>>
>> -    if (sphb->dma_liobn == (uint32_t)-1) {
>> -        error_setg(errp, "LIOBN not specified for PHB");
>> -        return;
>> -    }
>> -
>>       if (sphb->mem_win_addr == (hwaddr)-1) {
>>           error_setg(errp, "Memory window address not specified for PHB");
>>           return;
>> @@ -1453,7 +1447,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>           }
>>       }
>>
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> +    tcet = spapr_tce_new_table(DEVICE(sphb), SPAPR_PCI_LIOBN(sphb->index, 0));
>>       if (!tcet) {
>>           error_setg(errp, "Unable to create TCE table for %s",
>>                      sphb->dtbusname);
>> @@ -1479,7 +1473,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>
>>   void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>   {
>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>> +    uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>> +    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>
>>       if (tcet && tcet->enabled) {
>>           spapr_tce_table_disable(tcet);
>> @@ -1507,7 +1502,7 @@ static void spapr_phb_reset(DeviceState *qdev)
>>   static Property spapr_phb_properties[] = {
>>       DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>       DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, _dma_liobn, -1),
>>       DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>       DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>                          SPAPR_PCI_MMIO_WIN_SIZE),
>> @@ -1595,7 +1590,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>       .post_load = spapr_pci_post_load,
>>       .fields = (VMStateField[]) {
>>           VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>> +        VMSTATE_UNUSED(4), /* former dma_liobn */
>>           VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>>           VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>>           VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 7848366..3fca1c3 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -56,7 +56,7 @@ struct sPAPRPHBState {
>>       hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>>       MemoryRegion memwindow, iowindow, msiwindow;
>>
>> -    uint32_t dma_liobn;
>> +    uint32_t _dma_liobn;
>>       hwaddr dma_win_addr, dma_win_size;
>>       AddressSpace iommu_as;
>>       MemoryRegion iommu_root;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn
  2016-04-07  7:10     ` Alexey Kardashevskiy
@ 2016-04-08  1:34       ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-04-08  1:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 3078 bytes --]

On Thu, Apr 07, 2016 at 05:10:53PM +1000, Alexey Kardashevskiy wrote:
> On 04/07/2016 10:50 AM, David Gibson wrote:
> >s/dma_loibn/dma_liobn/ in subject line.
> >
> >On Mon, Apr 04, 2016 at 07:33:44PM +1000, Alexey Kardashevskiy wrote:
> >>We are going to have 2 DMA windows which LIOBNs are calculated from
> >>the PHB index and the window number using the SPAPR_PCI_LIOBN macro
> >>so there is no actual use for dma_liobn.
> >>
> >>This replaces dma_liobn with SPAPR_PCI_LIOBN. This marks it as unused
> >>in the migration stream. This renames dma_liobn to _dma_liobn as we have
> >>to keep the property for the CLI compatibility and we need a storage
> >>for it, although it has never really been used.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >This doesn't quite make sense.  We can't really do that without
> >entirely removing support for PHBs without an 'index' value.
> >Basically the idea of the PHB config parameters what that you either
> >specified just "index" or you specified *all* the relevant addresses.
> >Removing option 2 might be a reasonable idea, but it shouldn't just be
> >done as a side effect of this other change.  With this patch the
> >"specify everything" approach still has code, but can't work, because
> >such a device will never get a reasonable liobn (or worse, it might
> >get a duplicate liobn, because the index isn't verified in this mode).
> >
> >Then again.. the "index" approach has also bitten us with the problem
> >of the not-quite-big-enough MMIO space per-PHB,
> 
> Any new ideas on that front?

No, not really :(.

> I also keep in mind that we rather want to
> assign interrupt numbers pool to a PHB based on its index rather than
> allocate them from the global machine interrupt number space so this is one
> more vote for keeping the index.

True, if we removed the index option we'd have to have the irqs set
from a property as well, I think, which would just increase the pain.

> >so I'm not entirely
> >sure that making it the only choice is the right way to go either.
> 
> I can imagine the user wanting to change MMIO addresses (so they should
> remain properties) but changing LIOBN from the command line does not seem
> useful at all.

Sure, but that doesn't alter the fact that we shouldn't break the
no-index option without completely removing the no-index option.

> >The short term approach to handle DDW might be to instead add a
> >dma64_liobn property.
> 
> So everywhere where I want to have a loop through all TCE tables and use
> SPAPR_PCI_LIOBN(index, i), I cannot really do that if I have 2 separate
> properties. Not extremely convenient.

Hmm.. you could work around that.  Inside the structure you could have
a liobns[2] array which you step through, then the properties just set
the individual elements of the array.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 11/17] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-04-06  6:05   ` David Gibson
@ 2016-04-20  8:51     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-20  8:51 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

On 04/06/2016 04:05 PM, David Gibson wrote:
> On Mon, Apr 04, 2016 at 07:33:40PM +1000, Alexey Kardashevskiy wrote:
>> This makes use of the new "memory registering" feature. The idea is
>> to provide the userspace ability to notify the host kernel about pages
>> which are going to be used for DMA. Having this information, the host
>> kernel can pin them all once per user process, do locked pages
>> accounting (once) and not spent time on doing that in real time with
>> possible failures which cannot be handled nicely in some cases.
>>
>> This adds a prereg memory listener which listens on address_space_memory
>> and notifies a VFIO container about memory which needs to be
>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>
>> As there is no per-IOMMU-type release() callback anymore, this stores
>> the IOMMU type in the container so vfio_listener_release() can device
>
> s/device/determine/ ?

That was supposed to be "decide" but "determine" will also do :)


>
>> if it needs to unregister @prereg_listener.
>>
>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>> not call it when v2 is detected and enabled.
>>
>> This does not change the guest visible interface.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v15:
>> * banned unaligned sections
>> * added an vfio_prereg_gpa_to_ua() helper
>>
>> v14:
>> * s/free_container_exit/listener_release_exit/g
>> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
>> ---
>>   hw/vfio/Makefile.objs         |   1 +
>>   hw/vfio/common.c              |  38 +++++++++---
>>   hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-common.h |   4 ++
>>   trace-events                  |   2 +
>>   5 files changed, 173 insertions(+), 10 deletions(-)
>>   create mode 100644 hw/vfio/prereg.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index ceddbb8..5800e0e 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>   obj-$(CONFIG_SOFTMMU) += platform.o
>>   obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>   obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>> +obj-$(CONFIG_SOFTMMU) += prereg.o
>>   endif
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 6bec419..3e9c579 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -493,6 +493,9 @@ static const MemoryListener vfio_memory_listener = {
>>   static void vfio_listener_release(VFIOContainer *container)
>>   {
>>       memory_listener_unregister(&container->listener);
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        memory_listener_unregister(&container->prereg_listener);
>> +    }
>>   }
>>
>>   int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
>> @@ -800,8 +803,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               goto free_container_exit;
>>           }
>>
>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>           if (ret) {
>>               error_report("vfio: failed to set iommu for container: %m");
>>               ret = -errno;
>> @@ -826,8 +829,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>               container->iova_pgsizes = info.iova_pgsizes;
>>           }
>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>           struct vfio_iommu_spapr_tce_info info;
>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>
>>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>           if (ret) {
>> @@ -835,7 +840,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               ret = -errno;
>>               goto free_container_exit;
>>           }
>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        container->iommu_type =
>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>           if (ret) {
>>               error_report("vfio: failed to set iommu for container: %m");
>>               ret = -errno;
>> @@ -847,11 +854,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>            * when container fd is closed so we do not call it explicitly
>>            * in this file.
>>            */
>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> -        if (ret) {
>> -            error_report("vfio: failed to enable container: %m");
>> -            ret = -errno;
>> -            goto free_container_exit;
>> +        if (!v2) {
>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> +            if (ret) {
>> +                error_report("vfio: failed to enable container: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            container->prereg_listener = vfio_prereg_listener;
>> +
>> +            memory_listener_register(&container->prereg_listener,
>> +                                     &address_space_memory);
>> +            if (container->error) {
>> +                error_report("vfio: RAM memory listener initialization failed for container");
>> +                goto listener_release_exit;
>> +            }
>>           }
>>
>>           /*
>> @@ -864,7 +882,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           if (ret) {
>>               error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>>               ret = -errno;
>> -            goto free_container_exit;
>> +            goto listener_release_exit;
>>           }
>>           container->min_iova = info.dma32_window_start;
>>           container->max_iova = container->min_iova + info.dma32_window_size - 1;
>> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
>> new file mode 100644
>> index 0000000..5f7fa30
>> --- /dev/null
>> +++ b/hw/vfio/prereg.c
>> @@ -0,0 +1,138 @@
>> +/*
>> + * DMA memory preregistration
>> + *
>> + * Authors:
>> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "qemu/error-report.h"
>> +#include "trace.h"
>> +
>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
>> +{
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        error_report("Cannot possibly preregister IOMMU memory");
>> +        return true;
>> +    }
>> +
>> +    return !memory_region_is_ram(section->mr) ||
>> +            memory_region_is_skip_dump(section->mr);
>> +}
>> +
>> +static void *vfio_prereg_gpa_to_ua(MemoryRegionSection *section, hwaddr gpa)
>> +{
>> +    return memory_region_get_ram_ptr(section->mr) +
>> +        section->offset_within_region +
>> +        (gpa - section->offset_within_address_space);
>> +}
>> +
>> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            prereg_listener);
>> +    const hwaddr gpa = section->offset_within_address_space;
>> +    Int128 llend;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_add_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>> +                 (section->offset_within_region & ~page_mask) ||
>> +                 (int128_get64(section->size) & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>
> You could round out the range to pre-register rather than just failing
> here, but that can be changed later if we need it.


Do we really want to support non-host-page-aligned guest RAM size?

I'd prefer less automated decisions at this stage really, regardless.


>
>> +    llend = int128_make64(section->offset_within_address_space);
>> +    llend = int128_add(llend, section->size);
>> +
>> +    g_assert(!int128_ge(int128_make64(gpa), llend));
>> +
>> +    memory_region_ref(section->mr);
>> +
>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
>> +    reg.size = int128_get64(llend) - gpa;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
>> +    if (ret) {
>> +        /*
>> +         * On the initfn path, store the first error in the container so we
>> +         * can gracefully fail.  Runtime, there's not much we can do other
>> +         * than throw a hardware error.
>> +         */
>> +        if (!container->initialized) {
>> +            if (!container->error) {
>> +                container->error = ret;
>> +            }
>> +        } else {
>> +            hw_error("vfio: Memory registering failed, unable to continue");
>> +        }
>> +    }
>> +}
>> +
>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            prereg_listener);
>> +    const hwaddr gpa = section->offset_within_address_space;
>> +    hwaddr end;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_del_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>> +                 (section->offset_within_region & ~page_mask) ||
>> +                 (int128_get64(section->size) & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    end = section->offset_within_address_space + int128_get64(section->size);
>
> Why 64-bit math here, but 128 bit math in the region_add path?


Hm. Good question. It has been copied from the common VFIO code where it 
was since 7532d3cbf1, back those days I did not even think of VFIO devices 
hot(un)plugging so region_del() was never executed so this seems to be a 
bug there and here. Yet another patch is coming then.


>
>> +    if (gpa >= end) {
>> +        return;
>> +    }
>> +
>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_ua(section, gpa);
>> +    reg.size = end - gpa;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
>> +}
>> +
>> +const MemoryListener vfio_prereg_listener = {
>> +    .region_add = vfio_prereg_listener_region_add,
>> +    .region_del = vfio_prereg_listener_region_del,
>> +};
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index c9b6622..c72e45a 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>>       VFIOAddressSpace *space;
>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>       MemoryListener listener;
>> +    MemoryListener prereg_listener;
>> +    unsigned iommu_type;
>>       int error;
>>       bool initialized;
>>       /*
>> @@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
>>   int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>                            struct vfio_region_info **info);
>>   #endif
>> +extern const MemoryListener vfio_prereg_listener;
>> +
>>   #endif /* !HW_VFIO_VFIO_COMMON_H */
>> diff --git a/trace-events b/trace-events
>> index 4335b9b..23ca0b9 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1736,6 +1736,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
>>   vfio_region_exit(const char *name, int index) "Device %s, region %d"
>>   vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>   vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>
>>   # hw/vfio/platform.c
>>   vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-07  0:40   ` David Gibson
@ 2016-04-20  9:15     ` Alexey Kardashevskiy
  2016-04-21  3:59       ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-20  9:15 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

On 04/07/2016 10:40 AM, David Gibson wrote:
> On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
>> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
>> a guest view of the table and a hardware TCE table. If there is no VFIO
>> presense in the address space, then just the guest view is used, if
>> this is the case, it is allocated in the KVM. However since there is no
>> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
>> we need to move the guest view from KVM to the userspace; and we need
>> to do this for every IOMMU on a bus with VFIO devices.
>>
>> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
>> notifiy IOMMU about changing environment so it can reallocate the table
>> to/from KVM or (when available) hook the IOMMU groups with the logical
>> bus (LIOBN) in the KVM.
>>
>> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
>> path as the new callbacks do this better - they notify IOMMU at
>> the exact moment when the configuration is changed, and this also
>> includes the case of PCI hot unplug.
>>
>> As there can be multiple containers attached to the same PHB/LIOBN,
>> this replaces the @need_vfio flag in sPAPRTCETable with the counter
>> of VFIO users.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> This looks correct, but there's one remaining ugly.
>
>> ---
>> Changes:
>> v15:
>> * s/need_vfio/vfio-Users/g
>> ---
>>   hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
>>   hw/ppc/spapr_pci.c     |  6 ------
>>   hw/vfio/common.c       |  9 +++++++++
>>   include/exec/memory.h  |  4 ++++
>>   include/hw/ppc/spapr.h |  2 +-
>>   5 files changed, 34 insertions(+), 17 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index c945dba..ea09414 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>       return 1ULL << tcet->page_shift;
>>   }
>>
>> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
>> +{
>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
>> +}
>> +
>> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
>> +{
>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
>> +}
>> +
>>   static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>>   static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>>
>> @@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>>   static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>       .translate = spapr_tce_translate_iommu,
>>       .get_page_sizes = spapr_tce_get_page_sizes,
>> +    .vfio_start = spapr_tce_vfio_start,
>> +    .vfio_stop = spapr_tce_vfio_stop,
>
> Ok, so AFAICT these callbacks are called whenever a VFIO context is
> added / removed from the gIOMMU's address space, and it's up to the
> gIOMMU code to ref count that to see if there are any current vfio
> users.  That makes "vfio_start" and "vfio_stop" not great names.
>
> But.. better than changing the names would be to move the refcounting
> to the generic code if you can manage it, so the individual gIOMMU
> backends don't need to - they just told when they need to start / stop
> providing VFIO support.

Everything is manageable...

This referencing is needed for the case of >=2 containers so 
2xvfio_listener_region_add will create 2xVFIOGuestIOMMU as they are per 
VFIOContainer so VFIOGuestIOMMU is not the right place for the reference 
counting, VFIOAddressSpace seems to be that place (=> add list of IOMMU MRs 
with refcounter). Or even IOMMU MR. Or move VFIOGuestIOMMU list from 
VFIOContainer to VFIOAddressSpace and then gIOMMU can handle refcounting?


>
>>   };
>>
>>   static int spapr_tce_table_realize(DeviceState *dev)
>> @@ -248,7 +260,7 @@ static int spapr_tce_table_realize(DeviceState *dev)
>>       char tmp[32];
>>
>>       tcet->fd = -1;
>> -    tcet->need_vfio = false;
>> +    tcet->vfio_users = 0;
>>       snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
>>       memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
>>
>> @@ -268,20 +280,18 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>>       size_t table_size = tcet->nb_table * sizeof(uint64_t);
>>       void *newtable;
>>
>> -    if (need_vfio == tcet->need_vfio) {
>> -        /* Nothing to do */
>> -        return;
>> -    }
>> +    tcet->vfio_users += need_vfio ? 1 : -1;
>> +    g_assert(tcet->vfio_users >= 0);
>> +    g_assert(tcet->table);
>>
>> -    if (!need_vfio) {
>> +    if (!tcet->vfio_users) {
>>           /* FIXME: We don't support transition back to KVM accelerated
>>            * TCEs yet */
>>           return;
>>       }
>>
>> -    tcet->need_vfio = true;
>> -
>> -    if (tcet->fd < 0) {
>> +    if (tcet->vfio_users > 1) {
>> +        g_assert(tcet->fd < 0);
>>           /* Table is already in userspace, nothing to be do */
>>           return;
>>       }
>> @@ -327,7 +337,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
>>                                           tcet->page_shift,
>>                                           tcet->nb_table,
>>                                           &tcet->fd,
>> -                                        tcet->need_vfio);
>> +                                        tcet->vfio_users != 0);
>>
>>       memory_region_set_size(&tcet->iommu,
>>                              (uint64_t)tcet->nb_table << tcet->page_shift);
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 5497a18..f864fde 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -1083,12 +1083,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>       void *fdt = NULL;
>>       int fdt_start_offset = 0, fdt_size;
>>
>> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> -
>> -        spapr_tce_set_need_vfio(tcet, true);
>> -    }
>> -
>>       if (dev->hotplugged) {
>>           fdt = create_device_tree(&fdt_size);
>>           fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ea79311..5e5b77c 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -421,6 +421,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>
>>           memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> +        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
>> +            section->mr->iommu_ops->vfio_start(section->mr);
>> +        }
>>           memory_region_iommu_replay(giommu->iommu, &giommu->n,
>>                                      false);
>>
>> @@ -466,6 +469,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>       hwaddr iova, end;
>>       int ret;
>> +    MemoryRegion *iommu = NULL;
>>
>>       if (vfio_listener_skipped_section(section)) {
>>           trace_vfio_listener_region_del_skip(
>> @@ -487,6 +491,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>               if (giommu->iommu == section->mr) {
>>                   memory_region_unregister_iommu_notifier(&giommu->n);
>> +                iommu = giommu->iommu;
>>                   QLIST_REMOVE(giommu, giommu_next);
>>                   g_free(giommu);
>>                   break;
>> @@ -519,6 +524,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                        "0x%"HWADDR_PRIx") = %d (%m)",
>>                        container, iova, end - iova, ret);
>>       }
>> +
>> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>> +        iommu->iommu_ops->vfio_stop(section->mr);
>> +    }
>>   }
>>
>>   static const MemoryListener vfio_memory_listener = {
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index eb5ce67..f1de133f 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -152,6 +152,10 @@ struct MemoryRegionIOMMUOps {
>>       IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>>       /* Returns supported page sizes */
>>       uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>> +    /* Called when VFIO starts using this */
>> +    void (*vfio_start)(MemoryRegion *iommu);
>> +    /* Called when VFIO stops using this */
>> +    void (*vfio_stop)(MemoryRegion *iommu);
>>   };
>>
>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 471eb4a..5c00e38 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -548,7 +548,7 @@ struct sPAPRTCETable {
>>       uint32_t mig_nb_table;
>>       uint64_t *mig_table;
>>       bool bypass;
>> -    bool need_vfio;
>> +    int vfio_users;
>>       int fd;
>>       MemoryRegion root, iommu;
>>       struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-04-07  1:10   ` David Gibson
@ 2016-04-20  9:43     ` Alexey Kardashevskiy
  2016-04-21  4:03       ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-20  9:43 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

On 04/07/2016 11:10 AM, David Gibson wrote:
> Subject doesn't seem quite right, since you added at least minimal
> support for the SPAPRv2 IOMMU in the prereg patch.
>
> On Mon, Apr 04, 2016 at 07:33:45PM +1000, Alexey Kardashevskiy wrote:
>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>> This adds ability to VFIO common code to dynamically allocate/remove
>> DMA windows in the host kernel when new VFIO container is added/removed.
>>
>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>> and adds just created IOMMU into the host IOMMU list; the opposite
>> action is taken in vfio_listener_region_del.
>>
>> When creating a new window, this uses euristic to decide on the TCE table
>> levels number.
>
> "heuristic" has an 'h' (yes, English spelling is stupid[0]).
>
> [0] The historical reasons for that are kind of fascinating, though.

Tried googling, could not spot quickly the reasoning, any hints what to 
google for? Or just a link with an explanation? :)


>
>> This should cause no guest visible change in behavior.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v14:
>> * new to the series
>>
>> ---
>> TODO:
>> * export levels to PHB
>> ---
>>   hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>   trace-events     |   2 +
>>   2 files changed, 107 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 5e5b77c..57a51df 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -279,6 +279,14 @@ static int vfio_host_win_add(VFIOContainer *container,
>>       return 0;
>>   }
>>
>> +static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
>> +{
>> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
>> +
>> +    g_assert(hostwin);
>> +    QLIST_REMOVE(hostwin, hiommu_next);
>> +}
>> +
>>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>   {
>>       return (!memory_region_is_ram(section->mr) &&
>> @@ -392,6 +400,63 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>       }
>>       end = int128_get64(llend);
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>
> I think the "add region" path could do with being split out into a
> different function - vfio_listener_region_add() is getting pretty
> huge.

It is big but not huge and I am trying to avoid having functions with 
"spapr" in their names in common.c as once they appear, we will start 
having a discussion if they should move to a separate file and if they do, 
then may be some other code should too, etc...


>
>> +        unsigned entries, pages, pagesize = qemu_real_host_page_size;
>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>> +
>> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
>> +        if (section->mr->iommu_ops) {
>> +            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
>
> Since you're querying the guest IOMMU here, I assume pagesize is
> supposed to represent *guest* IOMMU pagesizes, in which case it should
> default to TARGET_PAGE_SIZE, instead of qemu_real_host_page_size.
> (didn't you already have a function which implemented that fallback?)

Yes, will use the helper.


>> +        }
>> +        /*
>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>> +         * avoid bouncing all map/unmaps through qemu this way, this
>> +         * would be the right place to wire that up (tell the KVM
>> +         * device emulation the VFIO iommu handles to use).
>> +         */
>> +        create.window_size = int128_get64(section->size);
>> +        create.page_shift = ctz64(pagesize);
>> +        /*
>> +         * SPAPR host supports multilevel TCE tables, there is some
>> +         * euristic to decide how many levels we want for our table:
>
> s/some euristic/a heuristic/
>
>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>> +         */
>> +        entries = create.window_size >> create.page_shift;
>> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
>> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>> +
>> +        if (vfio_host_win_lookup(container, create.start_addr,
>> +                                 create.start_addr + create.window_size - 1)) {
>> +            goto fail;
>
> Hmm.. if you successfully look up a host window, it seems to me you
> shouldn't fail, but in fact don't even need to create a new window
> (the removal path gets harder though, because you need to check if any
> guest window requires that host window).


At the moment if the window is there, it is failure in the environment I am 
testing it in. And, having a host kernel which cannot allocate and map 
windows randomly, it is unlikely that I'll have a setup where this spot 
won't mean that something went wrong. x86 case needs lot more than this anyway.


> Requiring that the host windows exactly match the guest windows is
> probably ok for a first version - except that in that case any overlap
> should cause a failure, not just a complete inclusion.

Yes, for now this should fail too, I'll fix it.


>
>> +        }
>> +
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +        if (ret) {
>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>> +            goto fail;
>> +        }
>> +
>> +        if (create.start_addr != section->offset_within_address_space) {
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = create.start_addr
>> +            };
>> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
>> +                         section->offset_within_address_space,
>> +                         create.start_addr);
>> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            ret = -EINVAL;
>> +            goto fail;
>> +        }
>> +        trace_vfio_spapr_create_window(create.page_shift,
>> +                                       create.window_size,
>> +                                       create.start_addr);
>> +
>> +        vfio_host_win_add(container, create.start_addr,
>> +                          create.start_addr + create.window_size - 1,
>> +                          1ULL << create.page_shift);
>> +    }
>> +
>>       if (!vfio_host_win_lookup(container, iova, end - 1)) {
>>           error_report("vfio: IOMMU container %p can't map guest IOVA region"
>>                        " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>> @@ -525,6 +590,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                        container, iova, end - iova, ret);
>>       }
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        struct vfio_iommu_spapr_tce_remove remove = {
>> +            .argsz = sizeof(remove),
>> +            .start_addr = section->offset_within_address_space,
>> +        };
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +        if (ret) {
>> +            error_report("Failed to remove window at %"PRIx64,
>> +                         remove.start_addr);
>> +        }
>> +
>> +        vfio_host_win_del(container, section->offset_within_address_space);
>> +
>> +        trace_vfio_spapr_remove_window(remove.start_addr);
>> +    }
>> +
>>       if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>>           iommu->iommu_ops->vfio_stop(section->mr);
>>       }
>> @@ -915,11 +996,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               }
>>           }
>>
>> -        /*
>> -         * This only considers the host IOMMU's 32-bit window.  At
>> -         * some point we need to add support for the optional 64-bit
>> -         * window and dynamic windows
>> -         */
>>           info.argsz = sizeof(info);
>>           ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>>           if (ret) {
>> @@ -928,11 +1004,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               goto listener_release_exit;
>>           }
>>
>> -        /* The default table uses 4K pages */
>> -        vfio_host_win_add(container, info.dma32_window_start,
>> -                          info.dma32_window_start +
>> -                          info.dma32_window_size - 1,
>> -                          0x1000);
>> +        if (v2) {
>> +            /*
>> +             * There is a default window in just created container.
>> +             * To make region_add/del simpler, we better remove this
>> +             * window now and let those iommu_listener callbacks
>> +             * create/remove them when needed.
>> +             */
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = info.dma32_window_start,
>> +            };
>> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            if (ret) {
>> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            /* The default table uses 4K pages */
>> +            vfio_host_win_add(container, info.dma32_window_start,
>> +                              info.dma32_window_start +
>> +                              info.dma32_window_size - 1,
>> +                              0x1000);
>> +        }
>>       } else {
>>           error_report("vfio: No available IOMMU models");
>>           ret = -EINVAL;
>> diff --git a/trace-events b/trace-events
>> index 23ca0b9..5c651fa 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1738,6 +1738,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>   vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>>   vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>   vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
>> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>>
>>   # hw/vfio/platform.c
>>   vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-20  9:15     ` Alexey Kardashevskiy
@ 2016-04-21  3:59       ` David Gibson
  2016-04-21  4:22         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-04-21  3:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 4530 bytes --]

On Wed, Apr 20, 2016 at 07:15:15PM +1000, Alexey Kardashevskiy wrote:
> On 04/07/2016 10:40 AM, David Gibson wrote:
> >On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
> >>The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> >>a guest view of the table and a hardware TCE table. If there is no VFIO
> >>presense in the address space, then just the guest view is used, if
> >>this is the case, it is allocated in the KVM. However since there is no
> >>support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> >>we need to move the guest view from KVM to the userspace; and we need
> >>to do this for every IOMMU on a bus with VFIO devices.
> >>
> >>This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> >>notifiy IOMMU about changing environment so it can reallocate the table
> >>to/from KVM or (when available) hook the IOMMU groups with the logical
> >>bus (LIOBN) in the KVM.
> >>
> >>This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> >>path as the new callbacks do this better - they notify IOMMU at
> >>the exact moment when the configuration is changed, and this also
> >>includes the case of PCI hot unplug.
> >>
> >>As there can be multiple containers attached to the same PHB/LIOBN,
> >>this replaces the @need_vfio flag in sPAPRTCETable with the counter
> >>of VFIO users.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >This looks correct, but there's one remaining ugly.
> >
> >>---
> >>Changes:
> >>v15:
> >>* s/need_vfio/vfio-Users/g
> >>---
> >>  hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
> >>  hw/ppc/spapr_pci.c     |  6 ------
> >>  hw/vfio/common.c       |  9 +++++++++
> >>  include/exec/memory.h  |  4 ++++
> >>  include/hw/ppc/spapr.h |  2 +-
> >>  5 files changed, 34 insertions(+), 17 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index c945dba..ea09414 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >>      return 1ULL << tcet->page_shift;
> >>  }
> >>
> >>+static void spapr_tce_vfio_start(MemoryRegion *iommu)
> >>+{
> >>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> >>+}
> >>+
> >>+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> >>+{
> >>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> >>+}
> >>+
> >>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> >>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> >>
> >>@@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
> >>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
> >>      .translate = spapr_tce_translate_iommu,
> >>      .get_page_sizes = spapr_tce_get_page_sizes,
> >>+    .vfio_start = spapr_tce_vfio_start,
> >>+    .vfio_stop = spapr_tce_vfio_stop,
> >
> >Ok, so AFAICT these callbacks are called whenever a VFIO context is
> >added / removed from the gIOMMU's address space, and it's up to the
> >gIOMMU code to ref count that to see if there are any current vfio
> >users.  That makes "vfio_start" and "vfio_stop" not great names.
> >
> >But.. better than changing the names would be to move the refcounting
> >to the generic code if you can manage it, so the individual gIOMMU
> >backends don't need to - they just told when they need to start / stop
> >providing VFIO support.
> 
> Everything is manageable...
> 
> This referencing is needed for the case of >=2 containers so
> 2xvfio_listener_region_add will create 2xVFIOGuestIOMMU as they are per
> VFIOContainer so VFIOGuestIOMMU is not the right place for the reference
> counting, VFIOAddressSpace seems to be that place (=> add list of IOMMU MRs
> with refcounter). Or even IOMMU MR. Or move VFIOGuestIOMMU list from
> VFIOContainer to VFIOAddressSpace and then gIOMMU can handle
> refcounting?

I'm having a lot of trouble parsing that.  I think the ref parsing has
to be per-giommu (because individual giommus could, in theory, be
mapped or unmapped from an address space).  But I think that should be
in the vfio core, rather than being necessary in every giommu
implementation.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-04-20  9:43     ` Alexey Kardashevskiy
@ 2016-04-21  4:03       ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-04-21  4:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 12041 bytes --]

On Wed, Apr 20, 2016 at 07:43:41PM +1000, Alexey Kardashevskiy wrote:
> On 04/07/2016 11:10 AM, David Gibson wrote:
> >Subject doesn't seem quite right, since you added at least minimal
> >support for the SPAPRv2 IOMMU in the prereg patch.
> >
> >On Mon, Apr 04, 2016 at 07:33:45PM +1000, Alexey Kardashevskiy wrote:
> >>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>This adds ability to VFIO common code to dynamically allocate/remove
> >>DMA windows in the host kernel when new VFIO container is added/removed.
> >>
> >>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>and adds just created IOMMU into the host IOMMU list; the opposite
> >>action is taken in vfio_listener_region_del.
> >>
> >>When creating a new window, this uses euristic to decide on the TCE table
> >>levels number.
> >
> >"heuristic" has an 'h' (yes, English spelling is stupid[0]).
> >
> >[0] The historical reasons for that are kind of fascinating, though.
> 
> Tried googling, could not spot quickly the reasoning, any hints what to
> google for? Or just a link with an explanation? :)

That wasn't a comment about heuristic specifically, but english
spelling in general.  I got most of the fascinating stuff I've seen
from:

http://www.amazon.com/Spell-Out-Enthralling-Extraordinary-Spelling/dp/1250056128/ref=la_B000AP940C_1_5/189-2461131-2789630?s=books&ie=UTF8&qid=1461211129&sr=1-5

> >>This should cause no guest visible change in behavior.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>Changes:
> >>v14:
> >>* new to the series
> >>
> >>---
> >>TODO:
> >>* export levels to PHB
> >>---
> >>  hw/vfio/common.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  trace-events     |   2 +
> >>  2 files changed, 107 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 5e5b77c..57a51df 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -279,6 +279,14 @@ static int vfio_host_win_add(VFIOContainer *container,
> >>      return 0;
> >>  }
> >>
> >>+static void vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> >>+{
> >>+    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> >>+
> >>+    g_assert(hostwin);
> >>+    QLIST_REMOVE(hostwin, hiommu_next);
> >>+}
> >>+
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >>      return (!memory_region_is_ram(section->mr) &&
> >>@@ -392,6 +400,63 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      }
> >>      end = int128_get64(llend);
> >>
> >>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >
> >I think the "add region" path could do with being split out into a
> >different function - vfio_listener_region_add() is getting pretty
> >huge.
> 
> It is big but not huge and I am trying to avoid having functions with
> "spapr" in their names in common.c as once they appear, we will start having
> a discussion if they should move to a separate file and if they do, then may
> be some other code should too, etc...

I don't think it needs to be a "spapr" named function.  The spapr
backend is the only one to support it for now, but other iommus could
support dynamic windows in future (for example, if I ever get time to
implement the "type2" converged interface I was thinking about, I'd
look to include that).

> >>+        unsigned entries, pages, pagesize = qemu_real_host_page_size;
> >>+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >>+
> >>+        trace_vfio_listener_region_add_iommu(iova, end - 1);
> >>+        if (section->mr->iommu_ops) {
> >>+            pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
> >
> >Since you're querying the guest IOMMU here, I assume pagesize is
> >supposed to represent *guest* IOMMU pagesizes, in which case it should
> >default to TARGET_PAGE_SIZE, instead of qemu_real_host_page_size.
> >(didn't you already have a function which implemented that fallback?)
> 
> Yes, will use the helper.
> 
> 
> >>+        }
> >>+        /*
> >>+         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>+         * avoid bouncing all map/unmaps through qemu this way, this
> >>+         * would be the right place to wire that up (tell the KVM
> >>+         * device emulation the VFIO iommu handles to use).
> >>+         */
> >>+        create.window_size = int128_get64(section->size);
> >>+        create.page_shift = ctz64(pagesize);
> >>+        /*
> >>+         * SPAPR host supports multilevel TCE tables, there is some
> >>+         * euristic to decide how many levels we want for our table:
> >
> >s/some euristic/a heuristic/
> >
> >>+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >>+         */
> >>+        entries = create.window_size >> create.page_shift;
> >>+        pages = (entries * sizeof(uint64_t)) / getpagesize();
> >>+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> >>+
> >>+        if (vfio_host_win_lookup(container, create.start_addr,
> >>+                                 create.start_addr + create.window_size - 1)) {
> >>+            goto fail;
> >
> >Hmm.. if you successfully look up a host window, it seems to me you
> >shouldn't fail, but in fact don't even need to create a new window
> >(the removal path gets harder though, because you need to check if any
> >guest window requires that host window).
> 
> 
> At the moment if the window is there, it is failure in the environment I am
> testing it in. And, having a host kernel which cannot allocate and map
> windows randomly, it is unlikely that I'll have a setup where this spot
> won't mean that something went wrong. x86 case needs lot more than this
> anyway.

Hm, I suppose so.

> 
> 
> >Requiring that the host windows exactly match the guest windows is
> >probably ok for a first version - except that in that case any overlap
> >should cause a failure, not just a complete inclusion.
> 
> Yes, for now this should fail too, I'll fix it.
> 
> 
> >
> >>+        }
> >>+
> >>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>+        if (ret) {
> >>+            error_report("Failed to create a window, ret = %d (%m)", ret);
> >>+            goto fail;
> >>+        }
> >>+
> >>+        if (create.start_addr != section->offset_within_address_space) {
> >>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>+                .argsz = sizeof(remove),
> >>+                .start_addr = create.start_addr
> >>+            };
> >>+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> >>+                         section->offset_within_address_space,
> >>+                         create.start_addr);
> >>+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+            ret = -EINVAL;
> >>+            goto fail;
> >>+        }
> >>+        trace_vfio_spapr_create_window(create.page_shift,
> >>+                                       create.window_size,
> >>+                                       create.start_addr);
> >>+
> >>+        vfio_host_win_add(container, create.start_addr,
> >>+                          create.start_addr + create.window_size - 1,
> >>+                          1ULL << create.page_shift);
> >>+    }
> >>+
> >>      if (!vfio_host_win_lookup(container, iova, end - 1)) {
> >>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
> >>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> >>@@ -525,6 +590,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                       container, iova, end - iova, ret);
> >>      }
> >>
> >>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>+        struct vfio_iommu_spapr_tce_remove remove = {
> >>+            .argsz = sizeof(remove),
> >>+            .start_addr = section->offset_within_address_space,
> >>+        };
> >>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+        if (ret) {
> >>+            error_report("Failed to remove window at %"PRIx64,
> >>+                         remove.start_addr);
> >>+        }
> >>+
> >>+        vfio_host_win_del(container, section->offset_within_address_space);
> >>+
> >>+        trace_vfio_spapr_remove_window(remove.start_addr);
> >>+    }
> >>+
> >>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> >>          iommu->iommu_ops->vfio_stop(section->mr);
> >>      }
> >>@@ -915,11 +996,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              }
> >>          }
> >>
> >>-        /*
> >>-         * This only considers the host IOMMU's 32-bit window.  At
> >>-         * some point we need to add support for the optional 64-bit
> >>-         * window and dynamic windows
> >>-         */
> >>          info.argsz = sizeof(info);
> >>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
> >>          if (ret) {
> >>@@ -928,11 +1004,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto listener_release_exit;
> >>          }
> >>
> >>-        /* The default table uses 4K pages */
> >>-        vfio_host_win_add(container, info.dma32_window_start,
> >>-                          info.dma32_window_start +
> >>-                          info.dma32_window_size - 1,
> >>-                          0x1000);
> >>+        if (v2) {
> >>+            /*
> >>+             * There is a default window in just created container.
> >>+             * To make region_add/del simpler, we better remove this
> >>+             * window now and let those iommu_listener callbacks
> >>+             * create/remove them when needed.
> >>+             */
> >>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>+                .argsz = sizeof(remove),
> >>+                .start_addr = info.dma32_window_start,
> >>+            };
> >>+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+            if (ret) {
> >>+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> >>+                ret = -errno;
> >>+                goto free_container_exit;
> >>+            }
> >>+        } else {
> >>+            /* The default table uses 4K pages */
> >>+            vfio_host_win_add(container, info.dma32_window_start,
> >>+                              info.dma32_window_start +
> >>+                              info.dma32_window_size - 1,
> >>+                              0x1000);
> >>+        }
> >>      } else {
> >>          error_report("vfio: No available IOMMU models");
> >>          ret = -EINVAL;
> >>diff --git a/trace-events b/trace-events
> >>index 23ca0b9..5c651fa 100644
> >>--- a/trace-events
> >>+++ b/trace-events
> >>@@ -1738,6 +1738,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
> >>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> >>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> >>+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
> >>
> >>  # hw/vfio/platform.c
> >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-21  3:59       ` David Gibson
@ 2016-04-21  4:22         ` Alexey Kardashevskiy
  2016-04-26  2:28           ` Alexey Kardashevskiy
  2016-04-27  6:39           ` David Gibson
  0 siblings, 2 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-21  4:22 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

On 04/21/2016 01:59 PM, David Gibson wrote:
> On Wed, Apr 20, 2016 at 07:15:15PM +1000, Alexey Kardashevskiy wrote:
>> On 04/07/2016 10:40 AM, David Gibson wrote:
>>> On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
>>>> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
>>>> a guest view of the table and a hardware TCE table. If there is no VFIO
>>>> presense in the address space, then just the guest view is used, if
>>>> this is the case, it is allocated in the KVM. However since there is no
>>>> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
>>>> we need to move the guest view from KVM to the userspace; and we need
>>>> to do this for every IOMMU on a bus with VFIO devices.
>>>>
>>>> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
>>>> notifiy IOMMU about changing environment so it can reallocate the table
>>>> to/from KVM or (when available) hook the IOMMU groups with the logical
>>>> bus (LIOBN) in the KVM.
>>>>
>>>> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
>>>> path as the new callbacks do this better - they notify IOMMU at
>>>> the exact moment when the configuration is changed, and this also
>>>> includes the case of PCI hot unplug.
>>>>
>>>> As there can be multiple containers attached to the same PHB/LIOBN,
>>>> this replaces the @need_vfio flag in sPAPRTCETable with the counter
>>>> of VFIO users.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> This looks correct, but there's one remaining ugly.
>>>
>>>> ---
>>>> Changes:
>>>> v15:
>>>> * s/need_vfio/vfio-Users/g
>>>> ---
>>>>   hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
>>>>   hw/ppc/spapr_pci.c     |  6 ------
>>>>   hw/vfio/common.c       |  9 +++++++++
>>>>   include/exec/memory.h  |  4 ++++
>>>>   include/hw/ppc/spapr.h |  2 +-
>>>>   5 files changed, 34 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>> index c945dba..ea09414 100644
>>>> --- a/hw/ppc/spapr_iommu.c
>>>> +++ b/hw/ppc/spapr_iommu.c
>>>> @@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>>>       return 1ULL << tcet->page_shift;
>>>>   }
>>>>
>>>> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
>>>> +{
>>>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
>>>> +}
>>>> +
>>>> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
>>>> +{
>>>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
>>>> +}
>>>> +
>>>>   static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>>>>   static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>>>>
>>>> @@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>>>>   static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>>>       .translate = spapr_tce_translate_iommu,
>>>>       .get_page_sizes = spapr_tce_get_page_sizes,
>>>> +    .vfio_start = spapr_tce_vfio_start,
>>>> +    .vfio_stop = spapr_tce_vfio_stop,
>>>
>>> Ok, so AFAICT these callbacks are called whenever a VFIO context is
>>> added / removed from the gIOMMU's address space, and it's up to the
>>> gIOMMU code to ref count that to see if there are any current vfio
>>> users.  That makes "vfio_start" and "vfio_stop" not great names.
>>>
>>> But.. better than changing the names would be to move the refcounting
>>> to the generic code if you can manage it, so the individual gIOMMU
>>> backends don't need to - they just told when they need to start / stop
>>> providing VFIO support.
>>
>> Everything is manageable...
>>
>> This referencing is needed for the case of >=2 containers so
>> 2xvfio_listener_region_add will create 2xVFIOGuestIOMMU as they are per
>> VFIOContainer so VFIOGuestIOMMU is not the right place for the reference
>> counting, VFIOAddressSpace seems to be that place (=> add list of IOMMU MRs
>> with refcounter). Or even IOMMU MR. Or move VFIOGuestIOMMU list from
>> VFIOContainer to VFIOAddressSpace and then gIOMMU can handle
>> refcounting?
>
> I'm having a lot of trouble parsing that.  I think the ref parsing has
> to be per-giommu (because individual giommus could, in theory, be
> mapped or unmapped from an address space).


Example 1.
POWER8, no DDW, one QEMU PHB, 2 IOMMU groups, table sharing so just 1 
container, one TCE table (aka gIOMMU), one TCE table in KVM, no reference 
counting needed at all, simple.

Example 2.
POWER7, no DDW, one QEMU PHB, 2 IOMMU groups, no table sharing so there are 
2 containers but still one IOMMU MR which is added to each container so 
there are 2 gIOMMU objects. And there is still one TCE table in KVM (which 
is a guest view). Where do I put the reference counter which will count 
that there are 2 gIOMMUs per KVM TCE table in this example?


> But I think that should be
> in the vfio core, rather than being necessary in every giommu
> implementation.


I agree, I am just asking where exactly to put this counter.


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-21  4:22         ` Alexey Kardashevskiy
@ 2016-04-26  2:28           ` Alexey Kardashevskiy
  2016-04-27  6:39           ` David Gibson
  1 sibling, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-26  2:28 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

On 04/21/2016 02:22 PM, Alexey Kardashevskiy wrote:
> On 04/21/2016 01:59 PM, David Gibson wrote:
>> On Wed, Apr 20, 2016 at 07:15:15PM +1000, Alexey Kardashevskiy wrote:
>>> On 04/07/2016 10:40 AM, David Gibson wrote:
>>>> On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
>>>>> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
>>>>> a guest view of the table and a hardware TCE table. If there is no VFIO
>>>>> presense in the address space, then just the guest view is used, if
>>>>> this is the case, it is allocated in the KVM. However since there is no
>>>>> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
>>>>> we need to move the guest view from KVM to the userspace; and we need
>>>>> to do this for every IOMMU on a bus with VFIO devices.
>>>>>
>>>>> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
>>>>> notifiy IOMMU about changing environment so it can reallocate the table
>>>>> to/from KVM or (when available) hook the IOMMU groups with the logical
>>>>> bus (LIOBN) in the KVM.
>>>>>
>>>>> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
>>>>> path as the new callbacks do this better - they notify IOMMU at
>>>>> the exact moment when the configuration is changed, and this also
>>>>> includes the case of PCI hot unplug.
>>>>>
>>>>> As there can be multiple containers attached to the same PHB/LIOBN,
>>>>> this replaces the @need_vfio flag in sPAPRTCETable with the counter
>>>>> of VFIO users.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>
>>>> This looks correct, but there's one remaining ugly.
>>>>
>>>>> ---
>>>>> Changes:
>>>>> v15:
>>>>> * s/need_vfio/vfio-Users/g
>>>>> ---
>>>>>   hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
>>>>>   hw/ppc/spapr_pci.c     |  6 ------
>>>>>   hw/vfio/common.c       |  9 +++++++++
>>>>>   include/exec/memory.h  |  4 ++++
>>>>>   include/hw/ppc/spapr.h |  2 +-
>>>>>   5 files changed, 34 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>>> index c945dba..ea09414 100644
>>>>> --- a/hw/ppc/spapr_iommu.c
>>>>> +++ b/hw/ppc/spapr_iommu.c
>>>>> @@ -155,6 +155,16 @@ static uint64_t
>>>>> spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>>>>       return 1ULL << tcet->page_shift;
>>>>>   }
>>>>>
>>>>> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
>>>>> +{
>>>>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable,
>>>>> iommu), true);
>>>>> +}
>>>>> +
>>>>> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
>>>>> +{
>>>>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable,
>>>>> iommu), false);
>>>>> +}
>>>>> +
>>>>>   static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>>>>>   static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>>>>>
>>>>> @@ -239,6 +249,8 @@ static const VMStateDescription
>>>>> vmstate_spapr_tce_table = {
>>>>>   static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>>>>       .translate = spapr_tce_translate_iommu,
>>>>>       .get_page_sizes = spapr_tce_get_page_sizes,
>>>>> +    .vfio_start = spapr_tce_vfio_start,
>>>>> +    .vfio_stop = spapr_tce_vfio_stop,
>>>>
>>>> Ok, so AFAICT these callbacks are called whenever a VFIO context is
>>>> added / removed from the gIOMMU's address space, and it's up to the
>>>> gIOMMU code to ref count that to see if there are any current vfio
>>>> users.  That makes "vfio_start" and "vfio_stop" not great names.
>>>>
>>>> But.. better than changing the names would be to move the refcounting
>>>> to the generic code if you can manage it, so the individual gIOMMU
>>>> backends don't need to - they just told when they need to start / stop
>>>> providing VFIO support.
>>>
>>> Everything is manageable...
>>>
>>> This referencing is needed for the case of >=2 containers so
>>> 2xvfio_listener_region_add will create 2xVFIOGuestIOMMU as they are per
>>> VFIOContainer so VFIOGuestIOMMU is not the right place for the reference
>>> counting, VFIOAddressSpace seems to be that place (=> add list of IOMMU MRs
>>> with refcounter). Or even IOMMU MR. Or move VFIOGuestIOMMU list from
>>> VFIOContainer to VFIOAddressSpace and then gIOMMU can handle
>>> refcounting?
>>
>> I'm having a lot of trouble parsing that.  I think the ref parsing has
>> to be per-giommu (because individual giommus could, in theory, be
>> mapped or unmapped from an address space).
>
>
> Example 1.
> POWER8, no DDW, one QEMU PHB, 2 IOMMU groups, table sharing so just 1
> container, one TCE table (aka gIOMMU), one TCE table in KVM, no reference
> counting needed at all, simple.
>
> Example 2.
> POWER7, no DDW, one QEMU PHB, 2 IOMMU groups, no table sharing so there are
> 2 containers but still one IOMMU MR which is added to each container so
> there are 2 gIOMMU objects. And there is still one TCE table in KVM (which
> is a guest view). Where do I put the reference counter which will count
> that there are 2 gIOMMUs per KVM TCE table in this example?


Ping?


>
>
>> But I think that should be
>> in the vfio core, rather than being necessary in every giommu
>> implementation.
>
>
> I agree, I am just asking where exactly to put this counter.
>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-21  4:22         ` Alexey Kardashevskiy
  2016-04-26  2:28           ` Alexey Kardashevskiy
@ 2016-04-27  6:39           ` David Gibson
  2016-04-27  9:14             ` Alexey Kardashevskiy
  1 sibling, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-04-27  6:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 6258 bytes --]

On Thu, Apr 21, 2016 at 02:22:01PM +1000, Alexey Kardashevskiy wrote:
> On 04/21/2016 01:59 PM, David Gibson wrote:
> >On Wed, Apr 20, 2016 at 07:15:15PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/07/2016 10:40 AM, David Gibson wrote:
> >>>On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
> >>>>The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> >>>>a guest view of the table and a hardware TCE table. If there is no VFIO
> >>>>presense in the address space, then just the guest view is used, if
> >>>>this is the case, it is allocated in the KVM. However since there is no
> >>>>support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> >>>>we need to move the guest view from KVM to the userspace; and we need
> >>>>to do this for every IOMMU on a bus with VFIO devices.
> >>>>
> >>>>This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> >>>>notifiy IOMMU about changing environment so it can reallocate the table
> >>>>to/from KVM or (when available) hook the IOMMU groups with the logical
> >>>>bus (LIOBN) in the KVM.
> >>>>
> >>>>This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> >>>>path as the new callbacks do this better - they notify IOMMU at
> >>>>the exact moment when the configuration is changed, and this also
> >>>>includes the case of PCI hot unplug.
> >>>>
> >>>>As there can be multiple containers attached to the same PHB/LIOBN,
> >>>>this replaces the @need_vfio flag in sPAPRTCETable with the counter
> >>>>of VFIO users.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>
> >>>This looks correct, but there's one remaining ugly.
> >>>
> >>>>---
> >>>>Changes:
> >>>>v15:
> >>>>* s/need_vfio/vfio-Users/g
> >>>>---
> >>>>  hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
> >>>>  hw/ppc/spapr_pci.c     |  6 ------
> >>>>  hw/vfio/common.c       |  9 +++++++++
> >>>>  include/exec/memory.h  |  4 ++++
> >>>>  include/hw/ppc/spapr.h |  2 +-
> >>>>  5 files changed, 34 insertions(+), 17 deletions(-)
> >>>>
> >>>>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>>>index c945dba..ea09414 100644
> >>>>--- a/hw/ppc/spapr_iommu.c
> >>>>+++ b/hw/ppc/spapr_iommu.c
> >>>>@@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >>>>      return 1ULL << tcet->page_shift;
> >>>>  }
> >>>>
> >>>>+static void spapr_tce_vfio_start(MemoryRegion *iommu)
> >>>>+{
> >>>>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> >>>>+}
> >>>>+
> >>>>+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> >>>>+{
> >>>>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> >>>>+}
> >>>>+
> >>>>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> >>>>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> >>>>
> >>>>@@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
> >>>>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
> >>>>      .translate = spapr_tce_translate_iommu,
> >>>>      .get_page_sizes = spapr_tce_get_page_sizes,
> >>>>+    .vfio_start = spapr_tce_vfio_start,
> >>>>+    .vfio_stop = spapr_tce_vfio_stop,
> >>>
> >>>Ok, so AFAICT these callbacks are called whenever a VFIO context is
> >>>added / removed from the gIOMMU's address space, and it's up to the
> >>>gIOMMU code to ref count that to see if there are any current vfio
> >>>users.  That makes "vfio_start" and "vfio_stop" not great names.
> >>>
> >>>But.. better than changing the names would be to move the refcounting
> >>>to the generic code if you can manage it, so the individual gIOMMU
> >>>backends don't need to - they just told when they need to start / stop
> >>>providing VFIO support.
> >>
> >>Everything is manageable...
> >>
> >>This referencing is needed for the case of >=2 containers so
> >>2xvfio_listener_region_add will create 2xVFIOGuestIOMMU as they are per
> >>VFIOContainer so VFIOGuestIOMMU is not the right place for the reference
> >>counting, VFIOAddressSpace seems to be that place (=> add list of IOMMU MRs
> >>with refcounter). Or even IOMMU MR. Or move VFIOGuestIOMMU list from
> >>VFIOContainer to VFIOAddressSpace and then gIOMMU can handle
> >>refcounting?
> >
> >I'm having a lot of trouble parsing that.  I think the ref parsing has
> >to be per-giommu (because individual giommus could, in theory, be
> >mapped or unmapped from an address space).
> 
> 
> Example 1.
> POWER8, no DDW, one QEMU PHB, 2 IOMMU groups, table sharing so just 1
> container, one TCE table (aka gIOMMU), one TCE table in KVM, no reference
> counting needed at all, simple.
> 
> Example 2.
> POWER7, no DDW, one QEMU PHB, 2 IOMMU groups, no table sharing so there are
> 2 containers but still one IOMMU MR which is added to each container so
> there are 2 gIOMMU objects. And there is still one TCE table in KVM (which
> is a guest view). Where do I put the reference counter which will count that
> there are 2 gIOMMUs per KVM TCE table in this example?

Ah.. I'd forgotten that the gIOMMU object is per guest IOMMU window
*and* per container, not just per guest IOMMU window.

Ultimately it's the code implementing the guest side IOMMU which needs
to know if it is supporting VFIO or not, so in generic terms that
means per IOMMU-type MemoryRegion.

Essentially you need to count the number of VFIOGuestIOMMU objects
associated with each (gIOMMU) MemoryRegion, and notify the
MemoryRegion if that changes from zero to non-zero or vice versa.

I'd prefer if we can maintain that count from just the VFIO code and
just notify the gIOMMU code on zero / non-zero changes.  But I guess
we'd need approval from Paolo to add that count to the MemoryRegion.

The fallback would be similar to what you have - instead the
MemoryRegion gets notified whenever a VFIOGuestIOMMU is attached or
removed, and the MR (i.e. the guest side IOMMU code) has to maintain
the count itself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-27  6:39           ` David Gibson
@ 2016-04-27  9:14             ` Alexey Kardashevskiy
  2016-04-28  1:02               ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-04-27  9:14 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf, Paolo Bonzini

On 04/27/2016 04:39 PM, David Gibson wrote:
> On Thu, Apr 21, 2016 at 02:22:01PM +1000, Alexey Kardashevskiy wrote:
>> On 04/21/2016 01:59 PM, David Gibson wrote:
>>> On Wed, Apr 20, 2016 at 07:15:15PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/07/2016 10:40 AM, David Gibson wrote:
>>>>> On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
>>>>>> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
>>>>>> a guest view of the table and a hardware TCE table. If there is no VFIO
>>>>>> presense in the address space, then just the guest view is used, if
>>>>>> this is the case, it is allocated in the KVM. However since there is no
>>>>>> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
>>>>>> we need to move the guest view from KVM to the userspace; and we need
>>>>>> to do this for every IOMMU on a bus with VFIO devices.
>>>>>>
>>>>>> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
>>>>>> notifiy IOMMU about changing environment so it can reallocate the table
>>>>>> to/from KVM or (when available) hook the IOMMU groups with the logical
>>>>>> bus (LIOBN) in the KVM.
>>>>>>
>>>>>> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
>>>>>> path as the new callbacks do this better - they notify IOMMU at
>>>>>> the exact moment when the configuration is changed, and this also
>>>>>> includes the case of PCI hot unplug.
>>>>>>
>>>>>> As there can be multiple containers attached to the same PHB/LIOBN,
>>>>>> this replaces the @need_vfio flag in sPAPRTCETable with the counter
>>>>>> of VFIO users.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>
>>>>> This looks correct, but there's one remaining ugly.
>>>>>
>>>>>> ---
>>>>>> Changes:
>>>>>> v15:
>>>>>> * s/need_vfio/vfio-Users/g
>>>>>> ---
>>>>>>  hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
>>>>>>  hw/ppc/spapr_pci.c     |  6 ------
>>>>>>  hw/vfio/common.c       |  9 +++++++++
>>>>>>  include/exec/memory.h  |  4 ++++
>>>>>>  include/hw/ppc/spapr.h |  2 +-
>>>>>>  5 files changed, 34 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>>>> index c945dba..ea09414 100644
>>>>>> --- a/hw/ppc/spapr_iommu.c
>>>>>> +++ b/hw/ppc/spapr_iommu.c
>>>>>> @@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>>>>>      return 1ULL << tcet->page_shift;
>>>>>>  }
>>>>>>
>>>>>> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
>>>>>> +{
>>>>>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
>>>>>> +}
>>>>>> +
>>>>>> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
>>>>>> +{
>>>>>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
>>>>>> +}
>>>>>> +
>>>>>>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>>>>>>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>>>>>>
>>>>>> @@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>>>>>>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>>>>>      .translate = spapr_tce_translate_iommu,
>>>>>>      .get_page_sizes = spapr_tce_get_page_sizes,
>>>>>> +    .vfio_start = spapr_tce_vfio_start,
>>>>>> +    .vfio_stop = spapr_tce_vfio_stop,
>>>>>
>>>>> Ok, so AFAICT these callbacks are called whenever a VFIO context is
>>>>> added / removed from the gIOMMU's address space, and it's up to the
>>>>> gIOMMU code to ref count that to see if there are any current vfio
>>>>> users.  That makes "vfio_start" and "vfio_stop" not great names.
>>>>>
>>>>> But.. better than changing the names would be to move the refcounting
>>>>> to the generic code if you can manage it, so the individual gIOMMU
>>>>> backends don't need to - they just told when they need to start / stop
>>>>> providing VFIO support.
>>>>
>>>> Everything is manageable...
>>>>
>>>> This referencing is needed for the case of >=2 containers so
>>>> 2xvfio_listener_region_add will create 2xVFIOGuestIOMMU as they are per
>>>> VFIOContainer so VFIOGuestIOMMU is not the right place for the reference
>>>> counting, VFIOAddressSpace seems to be that place (=> add list of IOMMU MRs
>>>> with refcounter). Or even IOMMU MR. Or move VFIOGuestIOMMU list from
>>>> VFIOContainer to VFIOAddressSpace and then gIOMMU can handle
>>>> refcounting?
>>>
>>> I'm having a lot of trouble parsing that.  I think the ref parsing has
>>> to be per-giommu (because individual giommus could, in theory, be
>>> mapped or unmapped from an address space).
>>
>>
>> Example 1.
>> POWER8, no DDW, one QEMU PHB, 2 IOMMU groups, table sharing so just 1
>> container, one TCE table (aka gIOMMU), one TCE table in KVM, no reference
>> counting needed at all, simple.
>>
>> Example 2.
>> POWER7, no DDW, one QEMU PHB, 2 IOMMU groups, no table sharing so there are
>> 2 containers but still one IOMMU MR which is added to each container so
>> there are 2 gIOMMU objects. And there is still one TCE table in KVM (which
>> is a guest view). Where do I put the reference counter which will count that
>> there are 2 gIOMMUs per KVM TCE table in this example?
>
> Ah.. I'd forgotten that the gIOMMU object is per guest IOMMU window
> *and* per container, not just per guest IOMMU window.
>
> Ultimately it's the code implementing the guest side IOMMU which needs
> to know if it is supporting VFIO or not, so in generic terms that
> means per IOMMU-type MemoryRegion.
>
> Essentially you need to count the number of VFIOGuestIOMMU objects
> associated with each (gIOMMU) MemoryRegion, and notify the
> MemoryRegion if that changes from zero to non-zero or vice versa.
>
> I'd prefer if we can maintain that count from just the VFIO code and
> just notify the gIOMMU code on zero / non-zero changes.  But I guess
> we'd need approval from Paolo to add that count to the MemoryRegion.


Why MR? I could wrap MR to "VFIOIOMMUMR", add a counter and keep a list of 
these VFIOIOMMUMRs in VFIOAddressSpace.

I am adding Paolo, just for the case :)


> The fallback would be similar to what you have - instead the
> MemoryRegion gets notified whenever a VFIOGuestIOMMU is attached or
> removed, and the MR (i.e. the guest side IOMMU code) has to maintain
> the count itself.






-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-04-27  9:14             ` Alexey Kardashevskiy
@ 2016-04-28  1:02               ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-04-28  1:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alex Williamson, Alexander Graf, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6975 bytes --]

On Wed, Apr 27, 2016 at 07:14:15PM +1000, Alexey Kardashevskiy wrote:
> On 04/27/2016 04:39 PM, David Gibson wrote:
> >On Thu, Apr 21, 2016 at 02:22:01PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/21/2016 01:59 PM, David Gibson wrote:
> >>>On Wed, Apr 20, 2016 at 07:15:15PM +1000, Alexey Kardashevskiy wrote:
> >>>>On 04/07/2016 10:40 AM, David Gibson wrote:
> >>>>>On Mon, Apr 04, 2016 at 07:33:43PM +1000, Alexey Kardashevskiy wrote:
> >>>>>>The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> >>>>>>a guest view of the table and a hardware TCE table. If there is no VFIO
> >>>>>>presense in the address space, then just the guest view is used, if
> >>>>>>this is the case, it is allocated in the KVM. However since there is no
> >>>>>>support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> >>>>>>we need to move the guest view from KVM to the userspace; and we need
> >>>>>>to do this for every IOMMU on a bus with VFIO devices.
> >>>>>>
> >>>>>>This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> >>>>>>notifiy IOMMU about changing environment so it can reallocate the table
> >>>>>>to/from KVM or (when available) hook the IOMMU groups with the logical
> >>>>>>bus (LIOBN) in the KVM.
> >>>>>>
> >>>>>>This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> >>>>>>path as the new callbacks do this better - they notify IOMMU at
> >>>>>>the exact moment when the configuration is changed, and this also
> >>>>>>includes the case of PCI hot unplug.
> >>>>>>
> >>>>>>As there can be multiple containers attached to the same PHB/LIOBN,
> >>>>>>this replaces the @need_vfio flag in sPAPRTCETable with the counter
> >>>>>>of VFIO users.
> >>>>>>
> >>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>
> >>>>>This looks correct, but there's one remaining ugly.
> >>>>>
> >>>>>>---
> >>>>>>Changes:
> >>>>>>v15:
> >>>>>>* s/need_vfio/vfio-Users/g
> >>>>>>---
> >>>>>> hw/ppc/spapr_iommu.c   | 30 ++++++++++++++++++++----------
> >>>>>> hw/ppc/spapr_pci.c     |  6 ------
> >>>>>> hw/vfio/common.c       |  9 +++++++++
> >>>>>> include/exec/memory.h  |  4 ++++
> >>>>>> include/hw/ppc/spapr.h |  2 +-
> >>>>>> 5 files changed, 34 insertions(+), 17 deletions(-)
> >>>>>>
> >>>>>>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>>>>>index c945dba..ea09414 100644
> >>>>>>--- a/hw/ppc/spapr_iommu.c
> >>>>>>+++ b/hw/ppc/spapr_iommu.c
> >>>>>>@@ -155,6 +155,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >>>>>>     return 1ULL << tcet->page_shift;
> >>>>>> }
> >>>>>>
> >>>>>>+static void spapr_tce_vfio_start(MemoryRegion *iommu)
> >>>>>>+{
> >>>>>>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> >>>>>>+}
> >>>>>>+
> >>>>>>+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> >>>>>>+{
> >>>>>>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> >>>>>>+}
> >>>>>>+
> >>>>>> static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> >>>>>> static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> >>>>>>
> >>>>>>@@ -239,6 +249,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
> >>>>>> static MemoryRegionIOMMUOps spapr_iommu_ops = {
> >>>>>>     .translate = spapr_tce_translate_iommu,
> >>>>>>     .get_page_sizes = spapr_tce_get_page_sizes,
> >>>>>>+    .vfio_start = spapr_tce_vfio_start,
> >>>>>>+    .vfio_stop = spapr_tce_vfio_stop,
> >>>>>
> >>>>>Ok, so AFAICT these callbacks are called whenever a VFIO context is
> >>>>>added / removed from the gIOMMU's address space, and it's up to the
> >>>>>gIOMMU code to ref count that to see if there are any current vfio
> >>>>>users.  That makes "vfio_start" and "vfio_stop" not great names.
> >>>>>
> >>>>>But.. better than changing the names would be to move the refcounting
> >>>>>to the generic code if you can manage it, so the individual gIOMMU
> >>>>>backends don't need to - they just told when they need to start / stop
> >>>>>providing VFIO support.
> >>>>
> >>>>Everything is manageable...
> >>>>
> >>>>This referencing is needed for the case of >=2 containers so
> >>>>2xvfio_listener_region_add will create 2xVFIOGuestIOMMU as they are per
> >>>>VFIOContainer so VFIOGuestIOMMU is not the right place for the reference
> >>>>counting, VFIOAddressSpace seems to be that place (=> add list of IOMMU MRs
> >>>>with refcounter). Or even IOMMU MR. Or move VFIOGuestIOMMU list from
> >>>>VFIOContainer to VFIOAddressSpace and then gIOMMU can handle
> >>>>refcounting?
> >>>
> >>>I'm having a lot of trouble parsing that.  I think the ref parsing has
> >>>to be per-giommu (because individual giommus could, in theory, be
> >>>mapped or unmapped from an address space).
> >>
> >>
> >>Example 1.
> >>POWER8, no DDW, one QEMU PHB, 2 IOMMU groups, table sharing so just 1
> >>container, one TCE table (aka gIOMMU), one TCE table in KVM, no reference
> >>counting needed at all, simple.
> >>
> >>Example 2.
> >>POWER7, no DDW, one QEMU PHB, 2 IOMMU groups, no table sharing so there are
> >>2 containers but still one IOMMU MR which is added to each container so
> >>there are 2 gIOMMU objects. And there is still one TCE table in KVM (which
> >>is a guest view). Where do I put the reference counter which will count that
> >>there are 2 gIOMMUs per KVM TCE table in this example?
> >
> >Ah.. I'd forgotten that the gIOMMU object is per guest IOMMU window
> >*and* per container, not just per guest IOMMU window.
> >
> >Ultimately it's the code implementing the guest side IOMMU which needs
> >to know if it is supporting VFIO or not, so in generic terms that
> >means per IOMMU-type MemoryRegion.
> >
> >Essentially you need to count the number of VFIOGuestIOMMU objects
> >associated with each (gIOMMU) MemoryRegion, and notify the
> >MemoryRegion if that changes from zero to non-zero or vice versa.
> >
> >I'd prefer if we can maintain that count from just the VFIO code and
> >just notify the gIOMMU code on zero / non-zero changes.  But I guess
> >we'd need approval from Paolo to add that count to the MemoryRegion.
> 
> 
> Why MR? I could wrap MR to "VFIOIOMMUMR", add a counter and keep a list of
> these VFIOIOMMUMRs in VFIOAddressSpace.

Ah, yes I guess we could.  It's just kinda ugly to have to keep
another object with the same lifetime around for one extra counter.

> I am adding Paolo, just for the case :)
> 
> 
> >The fallback would be similar to what you have - instead the
> >MemoryRegion gets notified whenever a VFIOGuestIOMMU is attached or
> >removed, and the MR (i.e. the guest side IOMMU code) has to maintain
> >the count itself.
> 
> 
> 
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-04-28  1:24 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-04  9:33 [Qemu-devel] [PATCH qemu v15 00/17] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 01/17] memory: Fix IOMMU replay base address Alexey Kardashevskiy
2016-04-05  1:34   ` David Gibson
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 02/17] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 03/17] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 04/17] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 05/17] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 06/17] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 07/17] spapr_iommu: Migrate full state Alexey Kardashevskiy
2016-04-05  5:58   ` David Gibson
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 08/17] spapr_iommu: Add root memory region Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 09/17] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 10/17] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-04-06  5:52   ` David Gibson
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 11/17] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2016-04-06  6:05   ` David Gibson
2016-04-20  8:51     ` Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 12/17] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 13/17] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
2016-04-06  7:10   ` David Gibson
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 14/17] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
2016-04-07  0:40   ` David Gibson
2016-04-20  9:15     ` Alexey Kardashevskiy
2016-04-21  3:59       ` David Gibson
2016-04-21  4:22         ` Alexey Kardashevskiy
2016-04-26  2:28           ` Alexey Kardashevskiy
2016-04-27  6:39           ` David Gibson
2016-04-27  9:14             ` Alexey Kardashevskiy
2016-04-28  1:02               ` David Gibson
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 15/17] spapr_pci: Get rid of dma_loibn Alexey Kardashevskiy
2016-04-07  0:50   ` David Gibson
2016-04-07  7:10     ` Alexey Kardashevskiy
2016-04-08  1:34       ` David Gibson
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 16/17] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
2016-04-07  1:10   ` David Gibson
2016-04-20  9:43     ` Alexey Kardashevskiy
2016-04-21  4:03       ` David Gibson
2016-04-04  9:33 ` [Qemu-devel] [PATCH qemu v15 17/17] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.