All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW)
@ 2016-03-01  9:10 Alexey Kardashevskiy
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 01/16] memory: Fix IOMMU replay base address Alexey Kardashevskiy
                   ` (15 more replies)
  0 siblings, 16 replies; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc


Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows on pseries.

This patchset is based on git://github.com/dgibson/qemu.git ,
tag ppc-for-2.6-20160229 plus just recently posted
"Allow EEH on spapr-pci-host-bridge devices" series.

The series was completely reworked so there is no changelog.


Please comment. Thanks!


Alexey Kardashevskiy (16):
  memory: Fix IOMMU replay base address
  spapr_pci: Move DMA window enablement to a helper
  spapr_iommu: Move table allocation to helpers
  spapr_iommu: Introduce "enabled" state for TCE table
  spapr_iommu: Add root memory region
  spapr_pci: Reset DMA config on PHB reset
  vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  memory: Add reporting of supported page sizes
  vfio: Generalize IOMMU memory listener
  vfio: Use different page size for different IOMMU types
  vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  vmstate: Define VARRAY with VMS_ALLOC
  spapr_iommu: Remove need_vfio flag from sPAPRTCETable
  spapr_pci: Add and export DMA resetting helper
  vfio: Move iova_pgsizes from container to guest IOMMU
  spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

 hw/ppc/Makefile.objs          |   1 +
 hw/ppc/spapr.c                |   7 +-
 hw/ppc/spapr_iommu.c          | 187 ++++++++++++++++++++------
 hw/ppc/spapr_pci.c            | 119 +++++++++++++---
 hw/ppc/spapr_rtas_ddw.c       | 306 ++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_vio.c            |   9 +-
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              | 213 ++++++++++++++++++++++-------
 hw/vfio/prereg.c              | 138 +++++++++++++++++++
 include/exec/memory.h         |  13 ++
 include/hw/pci-host/spapr.h   |  15 +++
 include/hw/ppc/spapr.h        |  35 +++--
 include/hw/vfio/vfio-common.h |  16 ++-
 include/migration/vmstate.h   |  10 ++
 memory.c                      |   9 ++
 trace-events                  |  10 +-
 16 files changed, 960 insertions(+), 129 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c
 create mode 100644 hw/vfio/prereg.c

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 01/16] memory: Fix IOMMU replay base address
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  1:34   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
when new VFIO listener is added, all existing IOMMU mappings are
replayed. However there is a problem that the base address of
an IOMMU memory region (IOMMU MR) is ignored which is not a problem
for the existing user (which is pseries) with its default 32bit DMA
window starting at 0 but it is if there is another DMA window.

This stores the IOMMU's offset_within_address_space and adjusts
the IOVA before calling vfio_dma_map/vfio_dma_unmap.

As the IOMMU notifier expects IOVA offset rather than the absolute
address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
calling notifier(s).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c          |  2 +-
 hw/vfio/common.c              | 14 ++++++++------
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 7dd4588..277f289 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
     tcet->table[index] = tce;
 
     entry.target_as = &address_space_memory,
-    entry.iova = ioba & page_mask;
+    entry.iova = (ioba - tcet->bus_offset) & page_mask;
     entry.translated_addr = tce & page_mask;
     entry.addr_mask = ~page_mask;
     entry.perm = spapr_tce_iommu_access_flags(tce);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 55e87d3..9bf4c3b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
     IOMMUTLBEntry *iotlb = data;
+    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
     void *vaddr;
     int ret;
 
-    trace_vfio_iommu_map_notify(iotlb->iova,
-                                iotlb->iova + iotlb->addr_mask);
+    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
 
     /*
      * The IOMMU TLB entry we have just covers translation through
@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         vaddr = memory_region_get_ram_ptr(mr) + xlat;
-        ret = vfio_dma_map(container, iotlb->iova,
+        ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
                            !(iotlb->perm & IOMMU_WO) || mr->readonly);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, ret);
         }
     }
@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
          */
         giommu = g_malloc0(sizeof(*giommu));
         giommu->iommu = section->mr;
+        giommu->offset_within_address_space =
+            section->offset_within_address_space;
         giommu->container = container;
         giommu->n.notify = vfio_iommu_map_notify;
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f037f3c..9ffa681 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -80,6 +80,7 @@ typedef struct VFIOContainer {
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
     MemoryRegion *iommu;
+    hwaddr offset_within_address_space;
     Notifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 01/16] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  1:40   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 03/16] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

We are going to have multiple DMA windows soon so let's start preparing.

This adds a new helper to create a DMA window and makes use of it in
sPAPRPHBState::realize().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_pci.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 3d1145e..248f20a 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -803,6 +803,29 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
     return buf;
 }
 
+static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                       uint32_t liobn, uint32_t page_shift,
+                                       uint64_t window_addr,
+                                       uint64_t window_size)
+{
+    sPAPRTCETable *tcet;
+    uint32_t nb_table = window_size >> page_shift;
+
+    if (!nb_table) {
+        return -1;
+    }
+
+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
+                               page_shift, nb_table, false);
+    if (!tcet) {
+        return -1;
+    }
+
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
+                                spapr_tce_get_iommu(tcet));
+    return 0;
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1228,8 +1251,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     int i;
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
-    sPAPRTCETable *tcet;
-    uint32_t nb_table;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1381,18 +1402,11 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
-    if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return;
-    }
-
     /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
-                                spapr_tce_get_iommu(tcet));
+    if (spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
+                                    sphb->dma_win_addr, sphb->dma_win_size)) {
+        error_setg(errp, "Unable to create TCE table for %s", sphb->dtbusname);
+    }
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 03/16] spapr_iommu: Move table allocation to helpers
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 01/16] memory: Fix IOMMU replay base address Alexey Kardashevskiy
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

At the moment presence of vfio-pci devices on a bus affect the way
the guest view table is allocated. If there is no vfio-pci on a PHB
and the host kernel supports KVM acceleration of H_PUT_TCE, a table
is allocated in KVM. However, if there is vfio-pci and we do yet not
KVM acceleration for these, the table has to be allocated by
the userspace. At the moment the table is allocated once at boot time
but next patches will reallocate it.

This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
to helpers.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
 trace-events         |  2 +-
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 277f289..8132f64 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -75,6 +75,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
     }
 }
 
+static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
+                                       uint32_t page_shift,
+                                       uint32_t nb_table,
+                                       int *fd,
+                                       bool need_vfio)
+{
+    uint64_t *table = NULL;
+    uint64_t window_size = (uint64_t)nb_table << page_shift;
+
+    if (kvm_enabled() && !(window_size >> 32)) {
+        table = kvmppc_create_spapr_tce(liobn, window_size, fd, need_vfio);
+    }
+
+    if (!table) {
+        *fd = -1;
+        table = g_malloc0(nb_table * sizeof(uint64_t));
+    }
+
+    trace_spapr_iommu_new_table(liobn, table, *fd);
+
+    return table;
+}
+
+static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
+{
+    if (!kvm_enabled() ||
+        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
+        g_free(table);
+    }
+}
+
 /* Called from RCU critical section */
 static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
                                                bool is_write)
@@ -141,21 +172,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
-    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
 
-    if (kvm_enabled() && !(window_size >> 32)) {
-        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
-                                              window_size,
-                                              &tcet->fd,
-                                              tcet->need_vfio);
-    }
-
-    if (!tcet->table) {
-        size_t table_size = tcet->nb_table * sizeof(uint64_t);
-        tcet->table = g_malloc0(table_size);
-    }
-
-    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
+    tcet->fd = -1;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
 
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
                              "iommu-spapr",
@@ -241,11 +264,8 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
     QLIST_REMOVE(tcet, list);
 
-    if (!kvm_enabled() ||
-        (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
-                                 tcet->nb_table) != 0)) {
-        g_free(tcet->table);
-    }
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/trace-events b/trace-events
index 075ec27..4b6ea70 100644
--- a/trace-events
+++ b/trace-events
@@ -1431,7 +1431,7 @@ spapr_iommu_pci_get(uint64_t liobn, uint64_t ioba, uint64_t ret, uint64_t tce) "
 spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN, uint64_t tceN, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcelist=0x%"PRIx64" iobaN=0x%"PRIx64" tceN=0x%"PRIx64" ret=%"PRId64
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
-spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
+spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 03/16] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  3:00   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 05/16] spapr_iommu: Add root memory region Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

Currently TCE tables are created once at start and their sizes never
change. We are going to change that by introducing a Dynamic DMA windows
support where DMA configuration may change during the guest execution.

This changes spapr_tce_new_table() to create an empty zero-size IOMMU
memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
It still will be called once at the owner object (VIO or PHB) creation.

This introduces an "enabled" state for TCE table objects with two
helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
- spapr_tce_table_enable() receives TCE table parameters, allocates
a guest view of the TCE table (in the user space or KVM) and
sets the correct size on the IOMMU MR.
- spapr_tce_table_disable() disposes the table and resets the IOMMU MR
size.

This changes the PHB reset handler to do the default DMA initialization
instead of spapr_phb_realize(). This does not make differenct now but
later with more than just one DMA window, we will have to remove them all
and create the default one on a system reset.

No visible change in behaviour is expected except the actual table
will be reallocated every reset. We might optimize this later.

The other way to implement this would be dynamically create/remove
the TCE table QOM objects but this would make migration impossible
as the migration code expects all QOM objects to exist at the receiver
so we have to have TCE table objects created when migration begins.

spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
as later it will be called at the sPAPRTCETable post-migration stage when
it already has all the properties set after the migration.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   | 80 +++++++++++++++++++++++++++++++++++---------------
 hw/ppc/spapr_pci.c     | 21 +++++++++----
 hw/ppc/spapr_vio.c     |  9 +++---
 include/hw/ppc/spapr.h | 10 +++----
 4 files changed, 80 insertions(+), 40 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 8132f64..e66e128 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -174,15 +174,8 @@ static int spapr_tce_table_realize(DeviceState *dev)
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     tcet->fd = -1;
-    tcet->table = spapr_tce_alloc_table(tcet->liobn,
-                                        tcet->page_shift,
-                                        tcet->nb_table,
-                                        &tcet->fd,
-                                        tcet->need_vfio);
-
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr",
-                             (uint64_t)tcet->nb_table << tcet->page_shift);
+                             "iommu-spapr", 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -224,14 +217,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
     tcet->table = newtable;
 }
 
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio)
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
 {
     sPAPRTCETable *tcet;
-    char tmp[64];
+    char tmp[32];
 
     if (spapr_tce_find_by_liobn(liobn)) {
         fprintf(stderr, "Attempted to create TCE table with duplicate"
@@ -239,16 +228,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
         return NULL;
     }
 
-    if (!nb_table) {
-        return NULL;
-    }
-
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
     tcet->liobn = liobn;
-    tcet->bus_offset = bus_offset;
-    tcet->page_shift = page_shift;
-    tcet->nb_table = nb_table;
-    tcet->need_vfio = need_vfio;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
@@ -258,14 +239,65 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
     return tcet;
 }
 
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+{
+    if (!tcet->nb_table) {
+        return;
+    }
+
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
+
+    memory_region_set_size(&tcet->iommu,
+                           (uint64_t)tcet->nb_table << tcet->page_shift);
+
+    tcet->enabled = true;
+}
+
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table, bool need_vfio)
+{
+    if (tcet->enabled) {
+        return;
+    }
+
+    tcet->bus_offset = bus_offset;
+    tcet->page_shift = page_shift;
+    tcet->nb_table = nb_table;
+    tcet->need_vfio = need_vfio;
+
+    spapr_tce_table_do_enable(tcet);
+}
+
+static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+{
+    if (!tcet->enabled) {
+        return;
+    }
+
+    memory_region_set_size(&tcet->iommu, 0);
+
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
+    tcet->table = NULL;
+    tcet->enabled = false;
+    tcet->bus_offset = 0;
+    tcet->page_shift = 0;
+    tcet->nb_table = 0;
+    tcet->need_vfio = false;
+}
+
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     QLIST_REMOVE(tcet, list);
 
-    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
-    tcet->fd = -1;
+    spapr_tce_table_disable(tcet);
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 248f20a..c34a906 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -815,12 +815,13 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
         return -1;
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
-                               page_shift, nb_table, false);
+    tcet = spapr_tce_find_by_liobn(liobn);
     if (!tcet) {
         return -1;
     }
 
+    spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
+
     memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
     return 0;
@@ -1251,6 +1252,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     int i;
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
+    sPAPRTCETable *tcet;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1402,11 +1404,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
+    /* DMA setup */
+    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
+    if (!tcet) {
+        error_report("No default TCE table for %s", sphb->dtbusname);
+        return;
+    }
+
     /* Register default 32bit DMA window */
-    if (spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
-                                    sphb->dma_win_addr, sphb->dma_win_size)) {
-        error_setg(errp, "Unable to create TCE table for %s", sphb->dtbusname);
-    }
+    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
+                                SPAPR_TCE_PAGE_SHIFT,
+                                sphb->dma_win_addr,
+                                sphb->dma_win_size);
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index 0f61a55..a745884 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -481,11 +481,10 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
         memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
         address_space_init(&dev->as, &dev->mrroot, qdev->id);
 
-        dev->tcet = spapr_tce_new_table(qdev, liobn,
-                                        0,
-                                        SPAPR_TCE_PAGE_SHIFT,
-                                        pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT, false);
+        dev->tcet = spapr_tce_new_table(qdev, liobn);
+        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
+                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT,
+                               false);
         dev->tcet->vdev = dev;
         memory_region_add_subregion_overlap(&dev->mrroot, 0,
                                             spapr_tce_get_iommu(dev->tcet), 2);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 098d85d..3e6bb84 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -539,6 +539,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 
 struct sPAPRTCETable {
     DeviceState parent;
+    bool enabled;
     uint32_t liobn;
     uint32_t nb_table;
     uint64_t bus_offset;
@@ -566,11 +567,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(sPAPRMachineState *sm,
                                  target_ulong addr, target_ulong size,
                                  bool cpu_update, bool memory_update);
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio);
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table, bool vfio_accel);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 05/16] spapr_iommu: Add root memory region
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-04  4:08   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 06/16] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

We are going to have multiple DMA windows at different offsets on
a PCI bus. For the sake of migration, we will have as many TCE table
objects pre-created as many windows supported.
So we need a way to map windows dynamically onto a PCI bus
when migration of a table is completed but at this stage a TCE table
object does not have access to a PHB to ask it to map a DMA window
backed by just migrated TCE table.

This adds a "root" memory region (UINT64_MAX long) to the TCE object.
This new region is mapped on a PCI bus with enabled overlapping as
there will be one root MR per TCE table, each of them mapped at 0.
The actual IOMMU memory region is a subregion of the root region and
a TCE table enables/disables this subregion and maps it at
the specific offset inside the root MR which is 1:1 mapping of
a PCI address space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 hw/ppc/spapr_iommu.c   | 13 ++++++++++---
 hw/ppc/spapr_pci.c     |  5 +++--
 include/hw/ppc/spapr.h |  2 +-
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index e66e128..ba9ddbb 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -172,10 +172,15 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+    Object *tcetobj = OBJECT(tcet);
+    char tmp[32];
 
     tcet->fd = -1;
-    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr", 0);
+    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
+    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
+
+    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
+    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -253,6 +258,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
+    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
 
     tcet->enabled = true;
 }
@@ -279,6 +285,7 @@ static void spapr_tce_table_disable(sPAPRTCETable *tcet)
         return;
     }
 
+    memory_region_del_subregion(&tcet->root, &tcet->iommu);
     memory_region_set_size(&tcet->iommu, 0);
 
     spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
@@ -302,7 +309,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
 {
-    return &tcet->iommu;
+    return &tcet->root;
 }
 
 static void spapr_tce_reset(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index c34a906..7b40687 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -822,8 +822,6 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
 
     spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
 
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
-                                spapr_tce_get_iommu(tcet));
     return 0;
 }
 
@@ -1411,6 +1409,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    memory_region_add_subregion(&sphb->iommu_root, 0,
+                                spapr_tce_get_iommu(tcet));
+
     /* Register default 32bit DMA window */
     spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
                                 SPAPR_TCE_PAGE_SHIFT,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 3e6bb84..bdf27ec 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -548,7 +548,7 @@ struct sPAPRTCETable {
     bool bypass;
     bool need_vfio;
     int fd;
-    MemoryRegion iommu;
+    MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
     QLIST_ENTRY(sPAPRTCETable) list;
 };
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 06/16] spapr_pci: Reset DMA config on PHB reset
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 05/16] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  3:02   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

LoPAPR dictates that during system reset all DMA windows must be removed
and the default DMA32 window must be created so does the patch.

At the moment there is just one window supported so no change in
behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   |  2 +-
 hw/ppc/spapr_pci.c     | 29 +++++++++++++++++++++++------
 include/hw/ppc/spapr.h |  1 +
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index ba9ddbb..8a88a74 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -279,7 +279,7 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
     spapr_tce_table_do_enable(tcet);
 }
 
-static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+void spapr_tce_table_disable(sPAPRTCETable *tcet)
 {
     if (!tcet->enabled) {
         return;
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7b40687..ee0fecf 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -825,6 +825,19 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
     return 0;
 }
 
+static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
+{
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+
+    if (!tcet) {
+        return -1;
+    }
+
+    spapr_tce_table_disable(tcet);
+
+    return 0;
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1412,12 +1425,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion(&sphb->iommu_root, 0,
                                 spapr_tce_get_iommu(tcet));
 
-    /* Register default 32bit DMA window */
-    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
-                                SPAPR_TCE_PAGE_SHIFT,
-                                sphb->dma_win_addr,
-                                sphb->dma_win_size);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1434,6 +1441,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 static void spapr_phb_reset(DeviceState *qdev)
 {
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
+
+    /* Register default 32bit DMA window */
+    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
+                                SPAPR_TCE_PAGE_SHIFT,
+                                sphb->dma_win_addr,
+                                sphb->dma_win_size);
+
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index bdf27ec..8aa0c45 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -571,6 +571,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
 void spapr_tce_table_enable(sPAPRTCETable *tcet,
                             uint32_t page_shift, uint64_t bus_offset,
                             uint32_t nb_table, bool vfio_accel);
+void spapr_tce_table_disable(sPAPRTCETable *tcet);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 06/16] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  5:28   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 08/16] memory: Add reporting of supported page sizes Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

This adds a vfio_votify() callback to inform an IOMMU (and then its owner)
that VFIO started using the IOMMU. This is used by the pseries machine to
enable/disable in-kernel acceleration of TCE hypercalls.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   |  9 +++++++++
 hw/ppc/spapr_pci.c     | 14 ++++++++------
 hw/vfio/common.c       |  7 +++++++
 include/exec/memory.h  |  2 ++
 include/hw/ppc/spapr.h |  4 ++++
 5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 8a88a74..67a8356 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -136,6 +136,13 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static int spapr_tce_vfio_notify(MemoryRegion *iommu, bool attached)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return spapr_tce_vfio_notify_owner(tcet->owner, tcet, attached);
+}
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -167,6 +174,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
+    .vfio_notify = spapr_tce_vfio_notify,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
@@ -235,6 +243,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
 
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
     tcet->liobn = liobn;
+    tcet->owner = owner;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index ee0fecf..b0cd148 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1084,6 +1084,14 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
     return 0;
 }
 
+int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
+                                bool attached)
+{
+    spapr_tce_set_need_vfio(tcet, attached);
+
+    return 0;
+}
+
 /* create OF node for pci device and required OF DT properties */
 static int spapr_create_pci_child_dt(sPAPRPHBState *phb, PCIDevice *dev,
                                      void *fdt, int node_offset)
@@ -1118,12 +1126,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     void *fdt = NULL;
     int fdt_start_offset = 0, fdt_size;
 
-    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
-
-        spapr_tce_set_need_vfio(tcet, true);
-    }
-
     if (dev->hotplugged) {
         fdt = create_device_tree(&fdt_size);
         fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9bf4c3b..ca3fd47 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -384,6 +384,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+        giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
                                    vfio_container_granularity(container),
                                    false);
@@ -430,6 +431,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     hwaddr iova, end;
     int ret;
+    MemoryRegion *iommu = NULL;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -451,6 +453,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (giommu->iommu == section->mr) {
                 memory_region_unregister_iommu_notifier(&giommu->n);
+                iommu = giommu->iommu;
                 QLIST_REMOVE(giommu, giommu_next);
                 g_free(giommu);
                 break;
@@ -483,6 +486,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, end - iova, ret);
     }
+
+    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
+        iommu->iommu_ops->vfio_notify(section->mr, false);
+    }
 }
 
 static const MemoryListener vfio_memory_listener = {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index d5284c2..9f82629 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /* Called when VFIO starts/stops using this */
+    int (*vfio_notify)(MemoryRegion *iommu, bool attached);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 8aa0c45..5d2f8f4 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -550,6 +550,7 @@ struct sPAPRTCETable {
     int fd;
     MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
+    DeviceState *owner;
     QLIST_ENTRY(sPAPRTCETable) list;
 };
 
@@ -629,4 +630,7 @@ int spapr_rng_populate_dt(void *fdt);
  */
 #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
 
+int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
+                                bool attached);
+
 #endif /* !defined (__HW_SPAPR_H__) */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 08/16] memory: Add reporting of supported page sizes
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  5:33   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the actual
page size(s) used by an IOMMU.

The qemu_real_host_page_mask is used as fallback.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
---
 hw/ppc/spapr_iommu.c  |  8 ++++++++
 include/exec/memory.h | 11 +++++++++++
 memory.c              |  9 +++++++++
 3 files changed, 28 insertions(+)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 67a8356..4c52cf4 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -143,6 +143,13 @@ static int spapr_tce_vfio_notify(MemoryRegion *iommu, bool attached)
     return spapr_tce_vfio_notify_owner(tcet->owner, tcet, attached);
 }
 
+static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return 1ULL << tcet->page_shift;
+}
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -175,6 +182,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
     .vfio_notify = spapr_tce_vfio_notify,
+    .get_page_sizes = spapr_tce_get_page_sizes,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9f82629..c34e67c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -152,6 +152,8 @@ struct MemoryRegionIOMMUOps {
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
     /* Called when VFIO starts/stops using this */
     int (*vfio_notify)(MemoryRegion *iommu, bool attached);
+    /* Returns supported page sizes */
+    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -576,6 +578,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
 
 
 /**
+ * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
+ *
+ * Returns %bitmap of supported page sizes for an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
+
+/**
  * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
  *
  * @mr: the memory region that was changed
diff --git a/memory.c b/memory.c
index 0dd9695..5d8453d 100644
--- a/memory.c
+++ b/memory.c
@@ -1462,6 +1462,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     notifier_list_notify(&mr->iommu_notify, &entry);
 }
 
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
+{
+    assert(memory_region_is_iommu(mr));
+    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
+        return mr->iommu_ops->get_page_sizes(mr);
+    }
+    return qemu_real_host_page_size;
+}
+
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
     uint8_t mask = 1 << client;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 08/16] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  5:36   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 10/16] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

At the moment VFIOContainer uses one memory listener which listens on
PCI address space for both Type1 and sPAPR IOMMUs. Soon we will need
another listener to listen on RAM; this will do DMA memory
pre-registration for sPAPR guests which basically pins all guest
pages in the host physical RAM.

This introduces VFIOMemoryListener which is wrapper for MemoryListener
and stores a pointer to the container. This allows having multiple
memory listeners for the same container. This replaces the existing
@listener with @iommu_listener.

This should cause no change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c              | 41 +++++++++++++++++++++++++++++++----------
 include/hw/vfio/vfio-common.h |  9 ++++++++-
 2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ca3fd47..0e67a5a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -318,10 +318,10 @@ static hwaddr vfio_container_granularity(VFIOContainer *container)
     return (hwaddr)1 << ctz64(container->iova_pgsizes);
 }
 
-static void vfio_listener_region_add(MemoryListener *listener,
+static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOContainer *container = vlistener->container;
     hwaddr iova, end;
     Int128 llend;
     void *vaddr;
@@ -425,10 +425,10 @@ fail:
     }
 }
 
-static void vfio_listener_region_del(MemoryListener *listener,
+static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOContainer *container = vlistener->container;
     hwaddr iova, end;
     int ret;
     MemoryRegion *iommu = NULL;
@@ -492,14 +492,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
-static const MemoryListener vfio_memory_listener = {
-    .region_add = vfio_listener_region_add,
-    .region_del = vfio_listener_region_del,
+static void vfio_iommu_listener_region_add(MemoryListener *listener,
+                                           MemoryRegionSection *section)
+{
+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
+                                                 listener);
+
+    vfio_listener_region_add(vlistener, section);
+}
+
+
+static void vfio_iommu_listener_region_del(MemoryListener *listener,
+                                           MemoryRegionSection *section)
+{
+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
+                                                 listener);
+
+    vfio_listener_region_del(vlistener, section);
+}
+
+static const MemoryListener vfio_iommu_listener = {
+    .region_add = vfio_iommu_listener_region_add,
+    .region_del = vfio_iommu_listener_region_del,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
 {
-    memory_listener_unregister(&container->listener);
+    memory_listener_unregister(&container->iommu_listener.listener);
 }
 
 int vfio_mmap_region(Object *obj, VFIORegion *region,
@@ -768,9 +787,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         goto free_container_exit;
     }
 
-    container->listener = vfio_memory_listener;
+    container->iommu_listener.container = container;
+    container->iommu_listener.listener = vfio_iommu_listener;
 
-    memory_listener_register(&container->listener, container->space->as);
+    memory_listener_register(&container->iommu_listener.listener,
+                             container->space->as);
 
     if (container->error) {
         ret = container->error;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 9ffa681..b6b736c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,12 +57,19 @@ typedef struct VFIOAddressSpace {
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
 
+typedef struct VFIOContainer VFIOContainer;
+
+typedef struct VFIOMemoryListener {
+    struct MemoryListener listener;
+    VFIOContainer *container;
+} VFIOMemoryListener;
+
 struct VFIOGroup;
 
 typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
-    MemoryListener listener;
+    VFIOMemoryListener iommu_listener;
     int error;
     bool initialized;
     /*
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 10/16] vfio: Use different page size for different IOMMU types
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  6:08   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

The existing memory listener is called on RAM or PCI address space
which implies potentially different page size.

This uses new memory_region_iommu_get_page_sizes() for IOMMU regions
or falls back to qemu_real_host_page_size if RAM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
* uses the smallest page size for mask as IOMMU MR can support multple
page sizes
---
 hw/vfio/common.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0e67a5a..3aaa6b5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -318,6 +318,16 @@ static hwaddr vfio_container_granularity(VFIOContainer *container)
     return (hwaddr)1 << ctz64(container->iova_pgsizes);
 }
 
+static hwaddr vfio_iommu_page_mask(MemoryRegion *mr)
+{
+    if (memory_region_is_iommu(mr)) {
+        int smallest = ffs(memory_region_iommu_get_page_sizes(mr)) - 1;
+
+        return ~((1ULL << smallest) - 1);
+    }
+    return qemu_real_host_page_mask;
+}
+
 static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
                                      MemoryRegionSection *section)
 {
@@ -326,6 +336,7 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
     Int128 llend;
     void *vaddr;
     int ret;
+    hwaddr page_mask = vfio_iommu_page_mask(section->mr);
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_add_skip(
@@ -335,16 +346,16 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
         return;
     }
 
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
         error_report("%s received unaligned region", __func__);
         return;
     }
 
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
     llend = int128_make64(section->offset_within_address_space);
     llend = int128_add(llend, section->size);
-    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+    llend = int128_and(llend, int128_exts64(page_mask));
 
     if (int128_ge(int128_make64(iova), llend)) {
         return;
@@ -432,6 +443,7 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
     hwaddr iova, end;
     int ret;
     MemoryRegion *iommu = NULL;
+    hwaddr page_mask = vfio_iommu_page_mask(section->mr);
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -441,8 +453,8 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
         return;
     }
 
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
         error_report("%s received unaligned region", __func__);
         return;
     }
@@ -469,9 +481,9 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
          */
     }
 
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
     end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
+          page_mask;
 
     if (iova >= end) {
         return;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 10/16] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  6:30   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 12/16] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

As there is no per-IOMMU-type release() callback anymore, this stores
the IOMMU type in the container so vfio_listener_release() can device
if it needs to unregister @prereg_listener.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This does not change the guest visible interface.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              |  39 +++++++++---
 hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   4 ++
 trace-events                  |   2 +
 5 files changed, 175 insertions(+), 9 deletions(-)
 create mode 100644 hw/vfio/prereg.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index ceddbb8..5800e0e 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
+obj-$(CONFIG_SOFTMMU) += prereg.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3aaa6b5..f2a03e0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->iommu_listener.listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener.listener);
+    }
 }
 
 int vfio_mmap_region(Object *obj, VFIORegion *region,
@@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
             container->iova_pgsizes = info.iova_pgsizes;
         }
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
 
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
@@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            container->prereg_listener.container = container;
+            container->prereg_listener.listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener.listener,
+                                     &address_space_memory);
+            if (container->error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                memory_listener_unregister(
+                    &container->prereg_listener.listener);
+                goto free_container_exit;
+            }
         }
 
         /*
diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
new file mode 100644
index 0000000..66cd696
--- /dev/null
+++ b/hw/vfio/prereg.c
@@ -0,0 +1,138 @@
+/*
+ * DMA memory preregistration
+ *
+ * Authors:
+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
+{
+    return (!memory_region_is_ram(section->mr) &&
+            !memory_region_is_iommu(section->mr)) ||
+            memory_region_is_skip_dump(section->mr);
+}
+
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
+                                                 listener);
+    VFIOContainer *container = vlistener->container;
+    hwaddr iova;
+    Int128 llend;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_add_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(page_mask));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return;
+    }
+
+    memory_region_ref(section->mr);
+
+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (iova - section->offset_within_address_space);
+    reg.size = int128_get64(llend) - iova;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
+    if (ret) {
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->initialized) {
+            if (!container->error) {
+                container->error = ret;
+            }
+        } else {
+            hw_error("vfio: DMA mapping failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
+                                                 listener);
+    VFIOContainer *container = vlistener->container;
+    hwaddr iova, end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_del_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
+    end = (section->offset_within_address_space + int128_get64(section->size)) &
+        page_mask;
+
+    if (iova >= end) {
+        return;
+    }
+
+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (iova - section->offset_within_address_space);
+    reg.size = end - iova;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
+    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+}
+
+const MemoryListener vfio_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b6b736c..bcbc5cb 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -70,6 +70,8 @@ typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     VFIOMemoryListener iommu_listener;
+    VFIOMemoryListener prereg_listener;
+    unsigned iommu_type;
     int error;
     bool initialized;
     /*
@@ -146,4 +148,6 @@ extern const MemoryRegionOps vfio_region_ops;
 extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
 extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
 
+extern const MemoryListener vfio_prereg_listener;
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index 4b6ea70..f5335ec 100644
--- a/trace-events
+++ b/trace-events
@@ -1725,6 +1725,8 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 12/16] vmstate: Define VARRAY with VMS_ALLOC
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  6:31   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 13/16] spapr_iommu: Remove need_vfio flag from sPAPRTCETable Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

This allows dynamic allocation for migrating arrays.

Already existing VMSTATE_VARRAY_UINT32 requires an array to be
pre-allocated, however there are cases when the size is not known in
advance and there is no real need to enforce it.

This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
flag which tells the receiving side to allocate memory for the array
before receiving the data.

The first user of it is a dynamic DMA window which existence and size
are totally dynamic.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 include/migration/vmstate.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 84ee355..1622638 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
     .offset     = vmstate_offset_pointer(_state, _field, _type),     \
 }
 
+#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+    .info       = &(_info),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
+    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
     .name       = (stringify(_field)),                               \
     .version_id = (_version),                                        \
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 13/16] spapr_iommu: Remove need_vfio flag from sPAPRTCETable
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 12/16] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  6:38   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 14/16] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

sPAPRTCETable has a need_vfio flag which is passed to
kvmppc_create_spapr_tce() and controls whether to create a guest view
table in KVM as this depends on the host kernel ability to accelerate
H_PUT_TCE for VFIO devices. We would set this flag at the moment
when sPAPRTCETable is created in spapr_tce_new_table() and
use when the table is allocated in spapr_tce_table_realize().

Now we explicitly enable/disable DMA windows via spapr_tce_table_enable()
and spapr_tce_table_disable() and can pass this flag directly without
caching it in sPAPRTCETable.

This removes the flag. This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   | 13 +++++--------
 include/hw/ppc/spapr.h |  1 -
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 4c52cf4..8aa2238 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -210,8 +210,9 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
 {
     size_t table_size = tcet->nb_table * sizeof(uint64_t);
     void *newtable;
+    bool tcet_can_vfio = tcet->fd < 0;
 
-    if (need_vfio == tcet->need_vfio) {
+    if (need_vfio == tcet_can_vfio) {
         /* Nothing to do */
         return;
     }
@@ -222,8 +223,6 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
         return;
     }
 
-    tcet->need_vfio = true;
-
     if (tcet->fd < 0) {
         /* Table is already in userspace, nothing to be do */
         return;
@@ -261,7 +260,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
     return tcet;
 }
 
-static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool need_vfio)
 {
     if (!tcet->nb_table) {
         return;
@@ -271,7 +270,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
                                         tcet->page_shift,
                                         tcet->nb_table,
                                         &tcet->fd,
-                                        tcet->need_vfio);
+                                        need_vfio);
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
@@ -291,9 +290,8 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
     tcet->bus_offset = bus_offset;
     tcet->page_shift = page_shift;
     tcet->nb_table = nb_table;
-    tcet->need_vfio = need_vfio;
 
-    spapr_tce_table_do_enable(tcet);
+    spapr_tce_table_do_enable(tcet, need_vfio);
 }
 
 void spapr_tce_table_disable(sPAPRTCETable *tcet)
@@ -312,7 +310,6 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
     tcet->bus_offset = 0;
     tcet->page_shift = 0;
     tcet->nb_table = 0;
-    tcet->need_vfio = false;
 }
 
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 5d2f8f4..505cb3a 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -546,7 +546,6 @@ struct sPAPRTCETable {
     uint32_t page_shift;
     uint64_t *table;
     bool bypass;
-    bool need_vfio;
     int fd;
     MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 14/16] spapr_pci: Add and export DMA resetting helper
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 13/16] spapr_iommu: Remove need_vfio flag from sPAPRTCETable Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03  6:39   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 15/16] vfio: Move iova_pgsizes from container to guest IOMMU Alexey Kardashevskiy
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

This will be later used by the "ibm,reset-pe-dma-window" RTAS handler
which resets the DMA configuration to the defaults.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_pci.c          | 11 ++++++++---
 include/hw/pci-host/spapr.h |  2 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index b0cd148..4c6e687 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1441,10 +1441,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
     return 0;
 }
 
-static void spapr_phb_reset(DeviceState *qdev)
+void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
-
     spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
 
     /* Register default 32bit DMA window */
@@ -1452,6 +1450,13 @@ static void spapr_phb_reset(DeviceState *qdev)
                                 SPAPR_TCE_PAGE_SHIFT,
                                 sphb->dma_win_addr,
                                 sphb->dma_win_size);
+}
+
+static void spapr_phb_reset(DeviceState *qdev)
+{
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_reset(sphb);
 
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 03ee006..7848366 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -147,4 +147,6 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 }
 #endif
 
+void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 15/16] vfio: Move iova_pgsizes from container to guest IOMMU
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 14/16] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-03 11:22   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

The page size is an attribute of an IOMMU, not a container as a container
may contain more just one IOMMU.

This moves iova_pgsizes from VFIOContainer to VFIOGuestIOMMU.
The following patch will use this.

This removes iova_pgsizes from Type1 IOMMU as it is not used there anyway
and when it will get guest visible IOMMU, it will use VFIOGuestIOMMU's
iova_pgsizes.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c              | 16 ++++------------
 include/hw/vfio/vfio-common.h |  2 +-
 2 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f2a03e0..42ef1eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -313,9 +313,9 @@ out:
     rcu_read_unlock();
 }
 
-static hwaddr vfio_container_granularity(VFIOContainer *container)
+static hwaddr vfio_container_granularity(VFIOGuestIOMMU *giommu)
 {
-    return (hwaddr)1 << ctz64(container->iova_pgsizes);
+    return (hwaddr)1 << ctz64(giommu->iova_pgsizes);
 }
 
 static hwaddr vfio_iommu_page_mask(MemoryRegion *mr)
@@ -392,12 +392,13 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
             section->offset_within_address_space;
         giommu->container = container;
         giommu->n.notify = vfio_iommu_map_notify;
+        giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
         giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
-                                   vfio_container_granularity(container),
+                                   vfio_container_granularity(giommu),
                                    false);
 
         return;
@@ -743,14 +744,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         container->min_iova = 0;
         container->max_iova = (hwaddr)-1;
 
-        /* Assume just 4K IOVA page size */
-        container->iova_pgsizes = 0x1000;
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
-        /* Ignore errors */
-        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
-            container->iova_pgsizes = info.iova_pgsizes;
-        }
     } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
                ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
@@ -811,9 +806,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         }
         container->min_iova = info.dma32_window_start;
         container->max_iova = container->min_iova + info.dma32_window_size - 1;
-
-        /* Assume just 4K IOVA pages for now */
-        container->iova_pgsizes = 0x1000;
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bcbc5cb..48a1d7f 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -80,7 +80,6 @@ typedef struct VFIOContainer {
      * future
      */
     hwaddr min_iova, max_iova;
-    uint64_t iova_pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
@@ -90,6 +89,7 @@ typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
     MemoryRegion *iommu;
     hwaddr offset_within_address_space;
+    uint64_t iova_pgsizes;
     Notifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [Qemu-devel] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (14 preceding siblings ...)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 15/16] vfio: Move iova_pgsizes from container to guest IOMMU Alexey Kardashevskiy
@ 2016-03-01  9:10 ` Alexey Kardashevskiy
  2016-03-04  4:51   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  15 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-01  9:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

This implements DDW for emulated and VFIO devices. As all TCE root regions
are mapped at 0 and 64bit long (and actual tables are child regions),
this replaces memory_region_add_subregion() with _overlap() to make
QEMU memory API happy.

This reserves RTAS token numbers for DDW calls.

This changes the TCE table migration descriptor to support dynamic
tables as from now on, PHB will create as many stub TCE table objects
as PHB can possibly support but not all of them might be initialized at
the time of migration because DDW might or might not be requested by
the guest.

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.5 machine and older disable it.

This implements DDW for VFIO. The host kernel support is required.
This adds a "levels" property to PHB to control the number of levels
in the actual TCE table allocated by the host kernel, 0 is the default
value to tell QEMU to calculate the correct value. Current hardware
supports up to 5 levels.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

TODO (which I have no idea how to implement properly):
1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
windows and 12/16/24 page shift;
2. fix container::min_iova, max_iova - as for now, they are useless,
and I'd expect IOMMU MR boundaries to serve this purpose really;
3. vfio_listener_region_add/vfio_listener_region_del do explicitely
create/remove huge DMA window as we do not have vfio_container_ioctl()
anymore, do we want to move these to some sort of callbacks? How, where?

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

# Conflicts:
#	include/hw/pci-host/spapr.h

# Conflicts:
#	hw/vfio/common.c
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   7 +-
 hw/ppc/spapr_iommu.c        |  32 ++++-
 hw/ppc/spapr_pci.c          |  61 +++++++--
 hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/common.c            |  70 +++++++++-
 include/hw/pci-host/spapr.h |  13 ++
 include/hw/ppc/spapr.h      |  17 ++-
 trace-events                |   6 +
 9 files changed, 489 insertions(+), 24 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c1ffc77..986b36f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e9d4abf..2473217 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2370,7 +2370,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
  * pseries-2.5
  */
 #define SPAPR_COMPAT_2_5 \
-        HW_COMPAT_2_5
+        HW_COMPAT_2_5 \
+        {\
+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+            .property = "ddw",\
+            .value    = stringify(off),\
+        },
 
 static void spapr_machine_2_5_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 8aa2238..e32f71b 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -150,6 +150,15 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
     return 1ULL << tcet->page_shift;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->migtable = tcet->table;
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -158,22 +167,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (tcet->enabled) {
+        if (!tcet->table) {
+            tcet->enabled = false;
+            /* VFIO does not migrate so pass vfio_accel == false */
+            spapr_tce_table_do_enable(tcet, false);
+        }
+        memcpy(tcet->table, tcet->migtable,
+               tcet->nb_table * sizeof(tcet->table[0]));
+        free(tcet->migtable);
+        tcet->migtable = NULL;
+    }
+
     return 0;
 }
 
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
-    .version_id = 2,
+    .version_id = 3,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
+        VMSTATE_UINT32(nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 4c6e687..1bc0710 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -803,10 +803,10 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
     return buf;
 }
 
-static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
-                                       uint32_t liobn, uint32_t page_shift,
-                                       uint64_t window_addr,
-                                       uint64_t window_size)
+int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                uint32_t liobn, uint32_t page_shift,
+                                uint64_t window_addr,
+                                uint64_t window_size)
 {
     sPAPRTCETable *tcet;
     uint32_t nb_table = window_size >> page_shift;
@@ -820,12 +820,16 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
         return -1;
     }
 
+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
+        return -1;
+    }
+
     spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
 
     return 0;
 }
 
-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
 {
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
@@ -1418,14 +1422,21 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     /* DMA setup */
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-        error_report("No default TCE table for %s", sphb->dtbusname);
-        return;
-    }
+    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
+    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
+    sphb->dma64_window_size = pow2ceil(ram_size);
 
-    memory_region_add_subregion(&sphb->iommu_root, 0,
-                                spapr_tce_get_iommu(tcet));
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb),
+                                   SPAPR_PCI_LIOBN(sphb->index, i));
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
+    }
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
@@ -1443,7 +1454,11 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
+    int i;
+
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
+    }
 
     /* Register default 32bit DMA window */
     spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
@@ -1481,6 +1496,9 @@ static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1734,6 +1752,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1758,6 +1785,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..b8ea910
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,306 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->enabled) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->enabled) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
+                                 uint64_t page_mask)
+{
+    int i, j;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
+            if ((sps[i].page_shift == masks[j].shift) &&
+                    (page_mask & (1ULL << masks[j].shift))) {
+                mask |= masks[j].mask;
+            }
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    CPUPPCState *env = &cpu->env;
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t avail, addr, pgmask = 0;
+    unsigned current;
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    current = spapr_phb_get_active_win_num(sphb);
+    avail = (sphb->windows_supported > current) ?
+            (sphb->windows_supported - current) : 0;
+
+    /* Work out supported page masks */
+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as all RAM was in 4K pages.
+     */
+    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
+                                pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+    long ret;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
+        goto hw_error_exit;
+    }
+
+    if (window_shift < page_shift) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_window_enable(sphb, liobn, page_shift,
+                                      sphb->dma64_window_addr,
+                                      1ULL << window_shift);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d,
+                                 liobn, ret);
+    if (ret || !tcet) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+    long ret;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_window_disable(sphb, liobn);
+    trace_spapr_iommu_ddw_remove(liobn, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+    long ret = 0;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 42ef1eb..2332f8e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -395,6 +395,39 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
         giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
+        if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+            int ret;
+            struct vfio_iommu_spapr_tce_create create = {
+                .argsz = sizeof(create),
+                .page_shift = ctz64(giommu->iova_pgsizes),
+                .window_size = memory_region_size(section->mr),
+                .levels = 0,
+                .start_addr = 0,
+            };
+
+            /*
+             * Dynamic windows are supported, that means that there is no
+             * pre-created window and we have to create one.
+             */
+            if (!create.levels) {
+                unsigned entries = create.window_size >> create.page_shift;
+                unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
+                /* 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 */
+                create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+            }
+            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+            if (ret) {
+                error_report("Failed to create a window");
+            }
+
+            if (create.start_addr != section->offset_within_address_space) {
+                error_report("Something went wrong!");
+            }
+            trace_vfio_spapr_create_window(create.page_shift,
+                                           create.window_size,
+                                           create.start_addr);
+        }
+
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
         giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
@@ -500,6 +533,18 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
                      container, iova, end - iova, ret);
     }
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        struct vfio_iommu_spapr_tce_remove remove = {
+            .argsz = sizeof(remove),
+            .start_addr = section->offset_within_address_space,
+        };
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        if (ret) {
+            error_report("Failed to remove window");
+        }
+
+        trace_vfio_spapr_remove_window(remove.start_addr);
+    }
     if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
         iommu->iommu_ops->vfio_notify(section->mr, false);
     }
@@ -792,11 +837,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
@@ -805,7 +845,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
         container->min_iova = info.dma32_window_start;
-        container->max_iova = container->min_iova + info.dma32_window_size - 1;
+        container->max_iova = (hwaddr)-1;
+
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del happy, we better remove this window now
+             * and let those iommu_listener callbacks create them when needed.
+             */
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = info.dma32_window_start,
+            };
+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            if (ret) {
+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..855e458 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -71,6 +71,12 @@ struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint32_t windows_supported;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_addr;
+    uint64_t dma64_window_size;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
@@ -89,6 +95,8 @@ struct sPAPRPHBState {
 
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
@@ -148,5 +156,10 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 #endif
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                uint32_t liobn, uint32_t page_shift,
+                                uint64_t window_addr,
+                                uint64_t window_size);
+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn);
 
 #endif /* __HW_SPAPR_PCI_H__ */
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 505cb3a..4f59d1b 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -417,6 +417,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -458,8 +468,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
@@ -545,6 +559,7 @@ struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint64_t *migtable;
     bool bypass;
     int fd;
     MemoryRegion root, iommu;
diff --git a/trace-events b/trace-events
index f5335ec..c7314b6 100644
--- a/trace-events
+++ b/trace-events
@@ -1432,6 +1432,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
@@ -1727,6 +1731,8 @@ vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions,
 vfio_put_base_device(int fd) "close vdev->fd=%d"
 vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 01/16] memory: Fix IOMMU replay base address
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 01/16] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-03-03  1:34   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-03  1:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5141 bytes --]

On Tue, Mar 01, 2016 at 08:10:26PM +1100, Alexey Kardashevskiy wrote:
> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> when new VFIO listener is added, all existing IOMMU mappings are
> replayed. However there is a problem that the base address of
> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> for the existing user (which is pseries) with its default 32bit DMA
> window starting at 0 but it is if there is another DMA window.
> 
> This stores the IOMMU's offset_within_address_space and adjusts
> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> 
> As the IOMMU notifier expects IOVA offset rather than the absolute
> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> calling notifier(s).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

It's not worth reworking just for this, but it might be slightly
preferable for merge purposes to split out the fix to put_tce_emu
(spapr code) away from the other changes (vfio code).

> ---
>  hw/ppc/spapr_iommu.c          |  2 +-
>  hw/vfio/common.c              | 14 ++++++++------
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 7dd4588..277f289 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
>      tcet->table[index] = tce;
>  
>      entry.target_as = &address_space_memory,
> -    entry.iova = ioba & page_mask;
> +    entry.iova = (ioba - tcet->bus_offset) & page_mask;
>      entry.translated_addr = tce & page_mask;
>      entry.addr_mask = ~page_mask;
>      entry.perm = spapr_tce_iommu_access_flags(tce);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 55e87d3..9bf4c3b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>      VFIOContainer *container = giommu->container;
>      IOMMUTLBEntry *iotlb = data;
> +    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
>      MemoryRegion *mr;
>      hwaddr xlat;
>      hwaddr len = iotlb->addr_mask + 1;
>      void *vaddr;
>      int ret;
>  
> -    trace_vfio_iommu_map_notify(iotlb->iova,
> -                                iotlb->iova + iotlb->addr_mask);
> +    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
>      /*
>       * The IOMMU TLB entry we have just covers translation through
> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>          vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -        ret = vfio_dma_map(container, iotlb->iova,
> +        ret = vfio_dma_map(container, iova,
>                             iotlb->addr_mask + 1, vaddr,
>                             !(iotlb->perm & IOMMU_WO) || mr->readonly);
>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, vaddr, ret);
>          }
>      } else {
> -        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, ret);
>          }
>      }
> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           */
>          giommu = g_malloc0(sizeof(*giommu));
>          giommu->iommu = section->mr;
> +        giommu->offset_within_address_space =
> +            section->offset_within_address_space;
>          giommu->container = container;
>          giommu->n.notify = vfio_iommu_map_notify;
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index f037f3c..9ffa681 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -80,6 +80,7 @@ typedef struct VFIOContainer {
>  typedef struct VFIOGuestIOMMU {
>      VFIOContainer *container;
>      MemoryRegion *iommu;
> +    hwaddr offset_within_address_space;
>      Notifier n;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
@ 2016-03-03  1:40   ` David Gibson
  2016-03-10  5:47     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-03  1:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3571 bytes --]

On Tue, Mar 01, 2016 at 08:10:27PM +1100, Alexey Kardashevskiy wrote:
> We are going to have multiple DMA windows soon so let's start preparing.
> 
> This adds a new helper to create a DMA window and makes use of it in
> sPAPRPHBState::realize().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr_pci.c | 40 +++++++++++++++++++++++++++-------------
>  1 file changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 3d1145e..248f20a 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -803,6 +803,29 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>      return buf;
>  }
>  
> +static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                       uint32_t liobn, uint32_t page_shift,
> +                                       uint64_t window_addr,
> +                                       uint64_t window_size)
> +{
> +    sPAPRTCETable *tcet;
> +    uint32_t nb_table = window_size >> page_shift;
> +
> +    if (!nb_table) {
> +        return -1;
> +    }

The caller shouldn't do this, so this probably makes more sense as an
assert() than an error return.

> +
> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> +                               page_shift, nb_table, false);
> +    if (!tcet) {
> +        return -1;
> +    }

Since you're adding error reporting, you might as well make it via the
error API instead of a return code.  That way if we wasnt to add more
detailed error API reporting to spapr_tce_new_table() in future,
there's less to change.

> +
> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> +                                spapr_tce_get_iommu(tcet));
> +    return 0;
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> @@ -1228,8 +1251,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      int i;
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
> -    sPAPRTCETable *tcet;
> -    uint32_t nb_table;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
> @@ -1381,18 +1402,11 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> -                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> -    if (!tcet) {
> -        error_setg(errp, "Unable to create TCE table for %s",
> -                   sphb->dtbusname);
> -        return;
> -    }
> -
>      /* Register default 32bit DMA window */
> -    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
> -                                spapr_tce_get_iommu(tcet));
> +    if (spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
> +                                    sphb->dma_win_addr, sphb->dma_win_size)) {
> +        error_setg(errp, "Unable to create TCE table for %s", sphb->dtbusname);
> +    }
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-03-03  3:00   ` David Gibson
  2016-03-10  7:39     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-03  3:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 11633 bytes --]

On Tue, Mar 01, 2016 at 08:10:29PM +1100, Alexey Kardashevskiy wrote:
> Currently TCE tables are created once at start and their sizes never
> change. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
> 
> This changes spapr_tce_new_table() to create an empty zero-size IOMMU
> memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
> It still will be called once at the owner object (VIO or PHB) creation.
> 
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> - spapr_tce_table_enable() receives TCE table parameters, allocates
> a guest view of the TCE table (in the user space or KVM) and
> sets the correct size on the IOMMU MR.
> - spapr_tce_table_disable() disposes the table and resets the IOMMU MR
> size.
> 
> This changes the PHB reset handler to do the default DMA initialization
> instead of spapr_phb_realize(). This does not make differenct now but
> later with more than just one DMA window, we will have to remove them all
> and create the default one on a system reset.
> 
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
> 
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as the migration code expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
> 
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it already has all the properties set after the migration.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Although there's one nit that could be improved:


> ---
>  hw/ppc/spapr_iommu.c   | 80 +++++++++++++++++++++++++++++++++++---------------
>  hw/ppc/spapr_pci.c     | 21 +++++++++----
>  hw/ppc/spapr_vio.c     |  9 +++---
>  include/hw/ppc/spapr.h | 10 +++----
>  4 files changed, 80 insertions(+), 40 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 8132f64..e66e128 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -174,15 +174,8 @@ static int spapr_tce_table_realize(DeviceState *dev)
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
>      tcet->fd = -1;
> -    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> -                                        tcet->page_shift,
> -                                        tcet->nb_table,
> -                                        &tcet->fd,
> -                                        tcet->need_vfio);
> -
>      memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
> -                             "iommu-spapr",
> -                             (uint64_t)tcet->nb_table << tcet->page_shift);
> +                             "iommu-spapr", 0);
>  
>      QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
>  
> @@ -224,14 +217,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>      tcet->table = newtable;
>  }
>  
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool need_vfio)
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>  {
>      sPAPRTCETable *tcet;
> -    char tmp[64];
> +    char tmp[32];
>  
>      if (spapr_tce_find_by_liobn(liobn)) {
>          fprintf(stderr, "Attempted to create TCE table with duplicate"
> @@ -239,16 +228,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>          return NULL;
>      }
>  
> -    if (!nb_table) {
> -        return NULL;
> -    }
> -
>      tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>      tcet->liobn = liobn;
> -    tcet->bus_offset = bus_offset;
> -    tcet->page_shift = page_shift;
> -    tcet->nb_table = nb_table;
> -    tcet->need_vfio = need_vfio;
>  
>      snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>      object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> @@ -258,14 +239,65 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>      return tcet;
>  }
>  
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
> +{
> +    if (!tcet->nb_table) {
> +        return;
> +    }
> +
> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +                                        tcet->page_shift,
> +                                        tcet->nb_table,
> +                                        &tcet->fd,
> +                                        tcet->need_vfio);
> +
> +    memory_region_set_size(&tcet->iommu,
> +                           (uint64_t)tcet->nb_table << tcet->page_shift);
> +
> +    tcet->enabled = true;
> +}
> +
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint32_t page_shift, uint64_t bus_offset,
> +                            uint32_t nb_table, bool need_vfio)
> +{
> +    if (tcet->enabled) {
> +        return;

If the given parameters are different from the current ones, treating
this as a no-op is rather misleading.  I gather that to resize the
window you're expected to disable, then re-enable.  In which case I
think it would be safer to actually throw some kind of error on a
double enable.


> +    }
> +
> +    tcet->bus_offset = bus_offset;
> +    tcet->page_shift = page_shift;
> +    tcet->nb_table = nb_table;
> +    tcet->need_vfio = need_vfio;
> +
> +    spapr_tce_table_do_enable(tcet);
> +}
> +
> +static void spapr_tce_table_disable(sPAPRTCETable *tcet)
> +{
> +    if (!tcet->enabled) {
> +        return;
> +    }
> +
> +    memory_region_set_size(&tcet->iommu, 0);
> +
> +    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> +    tcet->fd = -1;
> +    tcet->table = NULL;
> +    tcet->enabled = false;
> +    tcet->bus_offset = 0;
> +    tcet->page_shift = 0;
> +    tcet->nb_table = 0;
> +    tcet->need_vfio = false;
> +}
> +
>  static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
>      QLIST_REMOVE(tcet, list);
>  
> -    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> -    tcet->fd = -1;
> +    spapr_tce_table_disable(tcet);
>  }
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 248f20a..c34a906 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -815,12 +815,13 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>          return -1;
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> -                               page_shift, nb_table, false);
> +    tcet = spapr_tce_find_by_liobn(liobn);
>      if (!tcet) {
>          return -1;
>      }
>  
> +    spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
> +
>      memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>                                  spapr_tce_get_iommu(tcet));
>      return 0;
> @@ -1251,6 +1252,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      int i;
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
> +    sPAPRTCETable *tcet;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
> @@ -1402,11 +1404,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> +    /* DMA setup */
> +    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> +    if (!tcet) {
> +        error_report("No default TCE table for %s", sphb->dtbusname);
> +        return;
> +    }
> +
>      /* Register default 32bit DMA window */
> -    if (spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
> -                                    sphb->dma_win_addr, sphb->dma_win_size)) {
> -        error_setg(errp, "Unable to create TCE table for %s", sphb->dtbusname);
> -    }
> +    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
> +                                SPAPR_TCE_PAGE_SHIFT,
> +                                sphb->dma_win_addr,
> +                                sphb->dma_win_size);
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
> diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
> index 0f61a55..a745884 100644
> --- a/hw/ppc/spapr_vio.c
> +++ b/hw/ppc/spapr_vio.c
> @@ -481,11 +481,10 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
>          memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
>          address_space_init(&dev->as, &dev->mrroot, qdev->id);
>  
> -        dev->tcet = spapr_tce_new_table(qdev, liobn,
> -                                        0,
> -                                        SPAPR_TCE_PAGE_SHIFT,
> -                                        pc->rtce_window_size >>
> -                                        SPAPR_TCE_PAGE_SHIFT, false);
> +        dev->tcet = spapr_tce_new_table(qdev, liobn);
> +        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
> +                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT,
> +                               false);
>          dev->tcet->vdev = dev;
>          memory_region_add_subregion_overlap(&dev->mrroot, 0,
>                                              spapr_tce_get_iommu(dev->tcet), 2);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 098d85d..3e6bb84 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -539,6 +539,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
>  
>  struct sPAPRTCETable {
>      DeviceState parent;
> +    bool enabled;
>      uint32_t liobn;
>      uint32_t nb_table;
>      uint64_t bus_offset;
> @@ -566,11 +567,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
>  int spapr_h_cas_compose_response(sPAPRMachineState *sm,
>                                   target_ulong addr, target_ulong size,
>                                   bool cpu_update, bool memory_update);
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool need_vfio);
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint32_t page_shift, uint64_t bus_offset,
> +                            uint32_t nb_table, bool vfio_accel);
>  void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 06/16] spapr_pci: Reset DMA config on PHB reset
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 06/16] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
@ 2016-03-03  3:02   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-03  3:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3922 bytes --]

On Tue, Mar 01, 2016 at 08:10:31PM +1100, Alexey Kardashevskiy wrote:
> LoPAPR dictates that during system reset all DMA windows must be removed
> and the default DMA32 window must be created so does the patch.
> 
> At the moment there is just one window supported so no change in
> behaviour is expected.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/ppc/spapr_iommu.c   |  2 +-
>  hw/ppc/spapr_pci.c     | 29 +++++++++++++++++++++++------
>  include/hw/ppc/spapr.h |  1 +
>  3 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index ba9ddbb..8a88a74 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -279,7 +279,7 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
>      spapr_tce_table_do_enable(tcet);
>  }
>  
> -static void spapr_tce_table_disable(sPAPRTCETable *tcet)
> +void spapr_tce_table_disable(sPAPRTCETable *tcet)
>  {
>      if (!tcet->enabled) {
>          return;
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 7b40687..ee0fecf 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -825,6 +825,19 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>      return 0;
>  }
>  
> +static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> +{
> +    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> +
> +    if (!tcet) {
> +        return -1;
> +    }
> +
> +    spapr_tce_table_disable(tcet);
> +
> +    return 0;
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> @@ -1412,12 +1425,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      memory_region_add_subregion(&sphb->iommu_root, 0,
>                                  spapr_tce_get_iommu(tcet));
>  
> -    /* Register default 32bit DMA window */
> -    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
> -                                SPAPR_TCE_PAGE_SHIFT,
> -                                sphb->dma_win_addr,
> -                                sphb->dma_win_size);
> -
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>  
> @@ -1434,6 +1441,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  static void spapr_phb_reset(DeviceState *qdev)
>  {
> +    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
> +
> +    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> +
> +    /* Register default 32bit DMA window */
> +    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
> +                                SPAPR_TCE_PAGE_SHIFT,
> +                                sphb->dma_win_addr,
> +                                sphb->dma_win_size);
> +
>      /* Reset the IOMMU state */
>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index bdf27ec..8aa0c45 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -571,6 +571,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
>  void spapr_tce_table_enable(sPAPRTCETable *tcet,
>                              uint32_t page_shift, uint64_t bus_offset,
>                              uint32_t nb_table, bool vfio_accel);
> +void spapr_tce_table_disable(sPAPRTCETable *tcet);
>  void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-03-03  5:28   ` David Gibson
  2016-03-03  6:01     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-03  5:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7280 bytes --]

On Tue, Mar 01, 2016 at 08:10:32PM +1100, Alexey Kardashevskiy wrote:
> This adds a vfio_votify() callback to inform an IOMMU (and then its owner)
> that VFIO started using the IOMMU. This is used by the pseries machine to
> enable/disable in-kernel acceleration of TCE hypercalls.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Hmm.. the current approach of having a hook when vfio-pci devices are
attached is pretty ugly, but what exactly the case that it doesn't
handle and this approach does?

This two tiered notify system for a single bit is also kinda ugly.

> ---
>  hw/ppc/spapr_iommu.c   |  9 +++++++++
>  hw/ppc/spapr_pci.c     | 14 ++++++++------
>  hw/vfio/common.c       |  7 +++++++
>  include/exec/memory.h  |  2 ++
>  include/hw/ppc/spapr.h |  4 ++++
>  5 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 8a88a74..67a8356 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -136,6 +136,13 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>      return ret;
>  }
>  
> +static int spapr_tce_vfio_notify(MemoryRegion *iommu, bool attached)
> +{
> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> +
> +    return spapr_tce_vfio_notify_owner(tcet->owner, tcet, attached);

I'm guessing the "owner" is the PHB, but I'm not entirely clear.

Could you use the QOM parent to get the the PHB instead of storing it
explicitly?

> +}
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -167,6 +174,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
> +    .vfio_notify = spapr_tce_vfio_notify,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> @@ -235,6 +243,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>  
>      tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>      tcet->liobn = liobn;
> +    tcet->owner = owner;
>  
>      snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>      object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index ee0fecf..b0cd148 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1084,6 +1084,14 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
>      return 0;
>  }
>  
> +int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
> +                                bool attached)
> +{
> +    spapr_tce_set_need_vfio(tcet, attached);

Hmm.. you go to the trouble of storing owner in dev, then don't
actually use it.

> +    return 0;
> +}
> +
>  /* create OF node for pci device and required OF DT properties */
>  static int spapr_create_pci_child_dt(sPAPRPHBState *phb, PCIDevice *dev,
>                                       void *fdt, int node_offset)
> @@ -1118,12 +1126,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      void *fdt = NULL;
>      int fdt_start_offset = 0, fdt_size;
>  
> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> -
> -        spapr_tce_set_need_vfio(tcet, true);
> -    }
> -
>      if (dev->hotplugged) {
>          fdt = create_device_tree(&fdt_size);
>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9bf4c3b..ca3fd47 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -384,6 +384,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> +        giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
>                                     vfio_container_granularity(container),
>                                     false);
> @@ -430,6 +431,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>      hwaddr iova, end;
>      int ret;
> +    MemoryRegion *iommu = NULL;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -451,6 +453,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (giommu->iommu == section->mr) {
>                  memory_region_unregister_iommu_notifier(&giommu->n);
> +                iommu = giommu->iommu;
>                  QLIST_REMOVE(giommu, giommu_next);
>                  g_free(giommu);
>                  break;
> @@ -483,6 +486,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       "0x%"HWADDR_PRIx") = %d (%m)",
>                       container, iova, end - iova, ret);
>      }
> +
> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
> +        iommu->iommu_ops->vfio_notify(section->mr, false);
> +    }

So, if an IOMMU is removed from the guest, this will turn off VFIO
enablement.  However, IIUC this won't get caled in the more likely
case that the address space stays the same, but the VFIO device is
removed.

>  }
>  
>  static const MemoryListener vfio_memory_listener = {
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index d5284c2..9f82629 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>  struct MemoryRegionIOMMUOps {
>      /* Return a TLB entry that contains a given address. */
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> +    /* Called when VFIO starts/stops using this */
> +    int (*vfio_notify)(MemoryRegion *iommu, bool attached);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 8aa0c45..5d2f8f4 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -550,6 +550,7 @@ struct sPAPRTCETable {
>      int fd;
>      MemoryRegion root, iommu;
>      struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
> +    DeviceState *owner;
>      QLIST_ENTRY(sPAPRTCETable) list;
>  };
>  
> @@ -629,4 +630,7 @@ int spapr_rng_populate_dt(void *fdt);
>   */
>  #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
>  
> +int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
> +                                bool attached);
> +
>  #endif /* !defined (__HW_SPAPR_H__) */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 08/16] memory: Add reporting of supported page sizes
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 08/16] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-03-03  5:33   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-03  5:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4069 bytes --]

On Tue, Mar 01, 2016 at 08:10:33PM +1100, Alexey Kardashevskiy wrote:
> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> page size(s) used by an IOMMU.
> 
> The qemu_real_host_page_mask is used as fallback.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I'm going to have to see how this gets used to really analyze it.
But, a preliminary comment:

Once this is added, it should be possible to remove the explicit page
size parameter from the iommu_replay function (since it could be
derived from the IOMMU page sizes).

> ---
> Changes:
> v4:
> * s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
> ---
>  hw/ppc/spapr_iommu.c  |  8 ++++++++
>  include/exec/memory.h | 11 +++++++++++
>  memory.c              |  9 +++++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 67a8356..4c52cf4 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -143,6 +143,13 @@ static int spapr_tce_vfio_notify(MemoryRegion *iommu, bool attached)
>      return spapr_tce_vfio_notify_owner(tcet->owner, tcet, attached);
>  }
>  
> +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> +{
> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> +
> +    return 1ULL << tcet->page_shift;
> +}
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -175,6 +182,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
>      .vfio_notify = spapr_tce_vfio_notify,
> +    .get_page_sizes = spapr_tce_get_page_sizes,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 9f82629..c34e67c 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -152,6 +152,8 @@ struct MemoryRegionIOMMUOps {
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>      /* Called when VFIO starts/stops using this */
>      int (*vfio_notify)(MemoryRegion *iommu, bool attached);
> +    /* Returns supported page sizes */
> +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -576,6 +578,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
>  
>  
>  /**
> + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
> + *
> + * Returns %bitmap of supported page sizes for an iommu.
> + *
> + * @mr: the memory region being queried
> + */
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
> +
> +/**
>   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>   *
>   * @mr: the memory region that was changed
> diff --git a/memory.c b/memory.c
> index 0dd9695..5d8453d 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1462,6 +1462,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      notifier_list_notify(&mr->iommu_notify, &entry);
>  }
>  
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
> +{
> +    assert(memory_region_is_iommu(mr));
> +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
> +        return mr->iommu_ops->get_page_sizes(mr);
> +    }
> +    return qemu_real_host_page_size;
> +}
> +
>  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>  {
>      uint8_t mask = 1 << client;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener Alexey Kardashevskiy
@ 2016-03-03  5:36   ` David Gibson
  2016-03-03  6:07     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-03  5:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5496 bytes --]

On Tue, Mar 01, 2016 at 08:10:34PM +1100, Alexey Kardashevskiy wrote:
> At the moment VFIOContainer uses one memory listener which listens on
> PCI address space for both Type1 and sPAPR IOMMUs. Soon we will need
> another listener to listen on RAM; this will do DMA memory
> pre-registration for sPAPR guests which basically pins all guest
> pages in the host physical RAM.
> 
> This introduces VFIOMemoryListener which is wrapper for MemoryListener
> and stores a pointer to the container. This allows having multiple
> memory listeners for the same container. This replaces the existing
> @listener with @iommu_listener.
> 
> This should cause no change in behavior.

This is nonsense.

The two listeners you're talking about have (or should have) both a
different AS they're listening on, *and* different notification
functions.  Since they have nothing in common, there's no point trying
to build a common structure for them.

> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/common.c              | 41 +++++++++++++++++++++++++++++++----------
>  include/hw/vfio/vfio-common.h |  9 ++++++++-
>  2 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ca3fd47..0e67a5a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -318,10 +318,10 @@ static hwaddr vfio_container_granularity(VFIOContainer *container)
>      return (hwaddr)1 << ctz64(container->iova_pgsizes);
>  }
>  
> -static void vfio_listener_region_add(MemoryListener *listener,
> +static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>                                       MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    VFIOContainer *container = vlistener->container;
>      hwaddr iova, end;
>      Int128 llend;
>      void *vaddr;
> @@ -425,10 +425,10 @@ fail:
>      }
>  }
>  
> -static void vfio_listener_region_del(MemoryListener *listener,
> +static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
>                                       MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    VFIOContainer *container = vlistener->container;
>      hwaddr iova, end;
>      int ret;
>      MemoryRegion *iommu = NULL;
> @@ -492,14 +492,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> -static const MemoryListener vfio_memory_listener = {
> -    .region_add = vfio_listener_region_add,
> -    .region_del = vfio_listener_region_del,
> +static void vfio_iommu_listener_region_add(MemoryListener *listener,
> +                                           MemoryRegionSection *section)
> +{
> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> +                                                 listener);
> +
> +    vfio_listener_region_add(vlistener, section);
> +}
> +
> +
> +static void vfio_iommu_listener_region_del(MemoryListener *listener,
> +                                           MemoryRegionSection *section)
> +{
> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> +                                                 listener);
> +
> +    vfio_listener_region_del(vlistener, section);
> +}
> +
> +static const MemoryListener vfio_iommu_listener = {
> +    .region_add = vfio_iommu_listener_region_add,
> +    .region_del = vfio_iommu_listener_region_del,
>  };
>  
>  static void vfio_listener_release(VFIOContainer *container)
>  {
> -    memory_listener_unregister(&container->listener);
> +    memory_listener_unregister(&container->iommu_listener.listener);
>  }
>  
>  int vfio_mmap_region(Object *obj, VFIORegion *region,
> @@ -768,9 +787,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          goto free_container_exit;
>      }
>  
> -    container->listener = vfio_memory_listener;
> +    container->iommu_listener.container = container;
> +    container->iommu_listener.listener = vfio_iommu_listener;
>  
> -    memory_listener_register(&container->listener, container->space->as);
> +    memory_listener_register(&container->iommu_listener.listener,
> +                             container->space->as);
>  
>      if (container->error) {
>          ret = container->error;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 9ffa681..b6b736c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -57,12 +57,19 @@ typedef struct VFIOAddressSpace {
>      QLIST_ENTRY(VFIOAddressSpace) list;
>  } VFIOAddressSpace;
>  
> +typedef struct VFIOContainer VFIOContainer;
> +
> +typedef struct VFIOMemoryListener {
> +    struct MemoryListener listener;
> +    VFIOContainer *container;
> +} VFIOMemoryListener;
> +
>  struct VFIOGroup;
>  
>  typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> -    MemoryListener listener;
> +    VFIOMemoryListener iommu_listener;
>      int error;
>      bool initialized;
>      /*

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-03  5:28   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2016-03-03  6:01     ` Alexey Kardashevskiy
  2016-03-04  4:01       ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-03  6:01 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/03/2016 04:28 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:32PM +1100, Alexey Kardashevskiy wrote:
>> This adds a vfio_votify() callback to inform an IOMMU (and then its owner)
>> that VFIO started using the IOMMU. This is used by the pseries machine to
>> enable/disable in-kernel acceleration of TCE hypercalls.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Hmm.. the current approach of having a hook when vfio-pci devices are
> attached is pretty ugly, but what exactly the case that it doesn't
> handle and this approach does?

Sorry, I am not following you here. What hook do you mean here?

My hook fixes the case when I want to enable/disable KVM acceleration, 
without these patches, I need to re-count how many vfio-pci devices are 
there and it is more ugly with PCI hotplug/unplug...


> This two tiered notify system for a single bit is also kinda ugly.
>
>> ---
>>   hw/ppc/spapr_iommu.c   |  9 +++++++++
>>   hw/ppc/spapr_pci.c     | 14 ++++++++------
>>   hw/vfio/common.c       |  7 +++++++
>>   include/exec/memory.h  |  2 ++
>>   include/hw/ppc/spapr.h |  4 ++++
>>   5 files changed, 30 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 8a88a74..67a8356 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -136,6 +136,13 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>>       return ret;
>>   }
>>
>> +static int spapr_tce_vfio_notify(MemoryRegion *iommu, bool attached)
>> +{
>> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
>> +
>> +    return spapr_tce_vfio_notify_owner(tcet->owner, tcet, attached);
>
> I'm guessing the "owner" is the PHB, but I'm not entirely clear.
>
> Could you use the QOM parent to get the the PHB instead of storing it
> explicitly?


I am pretty sure I am not allowed to use the QOM parent, this is why there 
is no object_get_parent() helper.


>
>> +}
>> +
>>   static int spapr_tce_table_post_load(void *opaque, int version_id)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -167,6 +174,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>>
>>   static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>       .translate = spapr_tce_translate_iommu,
>> +    .vfio_notify = spapr_tce_vfio_notify,
>>   };
>>
>>   static int spapr_tce_table_realize(DeviceState *dev)
>> @@ -235,6 +243,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>>
>>       tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>>       tcet->liobn = liobn;
>> +    tcet->owner = owner;
>>
>>       snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>>       object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index ee0fecf..b0cd148 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -1084,6 +1084,14 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
>>       return 0;
>>   }
>>
>> +int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
>> +                                bool attached)
>> +{
>> +    spapr_tce_set_need_vfio(tcet, attached);
>
> Hmm.. you go to the trouble of storing owner in dev, then don't
> actually use it.


Yeah, I need to clean this, I removed spapr_tce_vfio_notify_owner() from my 
working branch already and call spapr_tce_set_need_vfio() directly from 
spapr_tce_vfio_notify().


>
>> +    return 0;
>> +}
>> +
>>   /* create OF node for pci device and required OF DT properties */
>>   static int spapr_create_pci_child_dt(sPAPRPHBState *phb, PCIDevice *dev,
>>                                        void *fdt, int node_offset)
>> @@ -1118,12 +1126,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>       void *fdt = NULL;
>>       int fdt_start_offset = 0, fdt_size;
>>
>> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> -
>> -        spapr_tce_set_need_vfio(tcet, true);
>> -    }
>> -
>>       if (dev->hotplugged) {
>>           fdt = create_device_tree(&fdt_size);
>>           fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 9bf4c3b..ca3fd47 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -384,6 +384,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>
>>           memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> +        giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
>>           memory_region_iommu_replay(giommu->iommu, &giommu->n,
>>                                      vfio_container_granularity(container),
>>                                      false);
>> @@ -430,6 +431,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>       hwaddr iova, end;
>>       int ret;
>> +    MemoryRegion *iommu = NULL;
>>
>>       if (vfio_listener_skipped_section(section)) {
>>           trace_vfio_listener_region_del_skip(
>> @@ -451,6 +453,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>               if (giommu->iommu == section->mr) {
>>                   memory_region_unregister_iommu_notifier(&giommu->n);
>> +                iommu = giommu->iommu;
>>                   QLIST_REMOVE(giommu, giommu_next);
>>                   g_free(giommu);
>>                   break;
>> @@ -483,6 +486,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                        "0x%"HWADDR_PRIx") = %d (%m)",
>>                        container, iova, end - iova, ret);
>>       }
>> +
>> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
>> +        iommu->iommu_ops->vfio_notify(section->mr, false);
>> +    }
>
> So, if an IOMMU is removed from the guest, this will turn off VFIO
> enablement.  However, IIUC this won't get caled in the more likely
> case that the address space stays the same, but the VFIO device is
> removed.


When VFIO device is removed, its listener gets removed and this is supposed 
to end up calling vfio_listener_region_del().


>
>>   }
>>
>>   static const MemoryListener vfio_memory_listener = {
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index d5284c2..9f82629 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>>   struct MemoryRegionIOMMUOps {
>>       /* Return a TLB entry that contains a given address. */
>>       IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>> +    /* Called when VFIO starts/stops using this */
>> +    int (*vfio_notify)(MemoryRegion *iommu, bool attached);
>>   };
>>
>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 8aa0c45..5d2f8f4 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -550,6 +550,7 @@ struct sPAPRTCETable {
>>       int fd;
>>       MemoryRegion root, iommu;
>>       struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
>> +    DeviceState *owner;
>>       QLIST_ENTRY(sPAPRTCETable) list;
>>   };
>>
>> @@ -629,4 +630,7 @@ int spapr_rng_populate_dt(void *fdt);
>>    */
>>   #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
>>
>> +int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
>> +                                bool attached);
>> +
>>   #endif /* !defined (__HW_SPAPR_H__) */
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener
  2016-03-03  5:36   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2016-03-03  6:07     ` Alexey Kardashevskiy
  2016-03-04  3:44       ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-03  6:07 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/03/2016 04:36 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:34PM +1100, Alexey Kardashevskiy wrote:
>> At the moment VFIOContainer uses one memory listener which listens on
>> PCI address space for both Type1 and sPAPR IOMMUs. Soon we will need
>> another listener to listen on RAM; this will do DMA memory
>> pre-registration for sPAPR guests which basically pins all guest
>> pages in the host physical RAM.
>>
>> This introduces VFIOMemoryListener which is wrapper for MemoryListener
>> and stores a pointer to the container. This allows having multiple
>> memory listeners for the same container. This replaces the existing
>> @listener with @iommu_listener.
>>
>> This should cause no change in behavior.
>
> This is nonsense.
>
> The two listeners you're talking about have (or should have) both a
> different AS they're listening on,

They do have different AS.

> *and* different notification
> functions.

They do use totally different region_add/region_del, later in the series.

> Since they have nothing in common, there's no point trying
> to build a common structure for them.

They use the same VFIOContainer pointer. VFIOMemoryListener is made of 
MemoryListener and VFIOContainer, and that's it.

Ok, I'll get rid of VFIOMemoryListener. It is just hard sometime to 
understand what bits I have to reuse and which I do not, constant argument...

>
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/vfio/common.c              | 41 +++++++++++++++++++++++++++++++----------
>>   include/hw/vfio/vfio-common.h |  9 ++++++++-
>>   2 files changed, 39 insertions(+), 11 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ca3fd47..0e67a5a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -318,10 +318,10 @@ static hwaddr vfio_container_granularity(VFIOContainer *container)
>>       return (hwaddr)1 << ctz64(container->iova_pgsizes);
>>   }
>>
>> -static void vfio_listener_region_add(MemoryListener *listener,
>> +static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>>                                        MemoryRegionSection *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>> +    VFIOContainer *container = vlistener->container;
>>       hwaddr iova, end;
>>       Int128 llend;
>>       void *vaddr;
>> @@ -425,10 +425,10 @@ fail:
>>       }
>>   }
>>
>> -static void vfio_listener_region_del(MemoryListener *listener,
>> +static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
>>                                        MemoryRegionSection *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>> +    VFIOContainer *container = vlistener->container;
>>       hwaddr iova, end;
>>       int ret;
>>       MemoryRegion *iommu = NULL;
>> @@ -492,14 +492,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       }
>>   }
>>
>> -static const MemoryListener vfio_memory_listener = {
>> -    .region_add = vfio_listener_region_add,
>> -    .region_del = vfio_listener_region_del,
>> +static void vfio_iommu_listener_region_add(MemoryListener *listener,
>> +                                           MemoryRegionSection *section)
>> +{
>> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
>> +                                                 listener);
>> +
>> +    vfio_listener_region_add(vlistener, section);
>> +}
>> +
>> +
>> +static void vfio_iommu_listener_region_del(MemoryListener *listener,
>> +                                           MemoryRegionSection *section)
>> +{
>> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
>> +                                                 listener);
>> +
>> +    vfio_listener_region_del(vlistener, section);
>> +}
>> +
>> +static const MemoryListener vfio_iommu_listener = {
>> +    .region_add = vfio_iommu_listener_region_add,
>> +    .region_del = vfio_iommu_listener_region_del,
>>   };
>>
>>   static void vfio_listener_release(VFIOContainer *container)
>>   {
>> -    memory_listener_unregister(&container->listener);
>> +    memory_listener_unregister(&container->iommu_listener.listener);
>>   }
>>
>>   int vfio_mmap_region(Object *obj, VFIORegion *region,
>> @@ -768,9 +787,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           goto free_container_exit;
>>       }
>>
>> -    container->listener = vfio_memory_listener;
>> +    container->iommu_listener.container = container;
>> +    container->iommu_listener.listener = vfio_iommu_listener;
>>
>> -    memory_listener_register(&container->listener, container->space->as);
>> +    memory_listener_register(&container->iommu_listener.listener,
>> +                             container->space->as);
>>
>>       if (container->error) {
>>           ret = container->error;
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 9ffa681..b6b736c 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -57,12 +57,19 @@ typedef struct VFIOAddressSpace {
>>       QLIST_ENTRY(VFIOAddressSpace) list;
>>   } VFIOAddressSpace;
>>
>> +typedef struct VFIOContainer VFIOContainer;
>> +
>> +typedef struct VFIOMemoryListener {
>> +    struct MemoryListener listener;
>> +    VFIOContainer *container;
>> +} VFIOMemoryListener;
>> +
>>   struct VFIOGroup;
>>
>>   typedef struct VFIOContainer {
>>       VFIOAddressSpace *space;
>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> -    MemoryListener listener;
>> +    VFIOMemoryListener iommu_listener;
>>       int error;
>>       bool initialized;
>>       /*
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 10/16] vfio: Use different page size for different IOMMU types
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 10/16] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
@ 2016-03-03  6:08   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-03  6:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4969 bytes --]

On Tue, Mar 01, 2016 at 08:10:35PM +1100, Alexey Kardashevskiy wrote:
> The existing memory listener is called on RAM or PCI address space
> which implies potentially different page size.
> 
> This uses new memory_region_iommu_get_page_sizes() for IOMMU regions
> or falls back to qemu_real_host_page_size if RAM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

This doesn't seem right to me.. but neither does the original code.

As far as I can tell, these checks against page sizes are trying to
make sure that the regions are aligned in such a way that we can
actually map them in the host IOMMU.  But TARGET_PAGE_SIZE is a
property of the guest, rather than the host.

So, changing TARGET_PAGE_SIZE to qemu_real_host_page_size seems
correct to me for RAM regions.

But memory_region_iommu_get_page_sizes() is *not* the right choice for
IOMMU regions, because that gives you the granularity of the guest
IOMMU, whereas you need the granularity of the host IOMMU.

Unless I'm mistaking the purpose of these checks, which I hope Alex
can clarify us on when he gets back from holiday next week.

> ---
> Changes:
> * uses the smallest page size for mask as IOMMU MR can support multple
> page sizes
> ---
>  hw/vfio/common.c | 28 ++++++++++++++++++++--------
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0e67a5a..3aaa6b5 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -318,6 +318,16 @@ static hwaddr vfio_container_granularity(VFIOContainer *container)
>      return (hwaddr)1 << ctz64(container->iova_pgsizes);
>  }
>  
> +static hwaddr vfio_iommu_page_mask(MemoryRegion *mr)
> +{
> +    if (memory_region_is_iommu(mr)) {
> +        int smallest = ffs(memory_region_iommu_get_page_sizes(mr)) - 1;
> +
> +        return ~((1ULL << smallest) - 1);
> +    }
> +    return qemu_real_host_page_mask;
> +}
> +
>  static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>                                       MemoryRegionSection *section)
>  {
> @@ -326,6 +336,7 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>      Int128 llend;
>      void *vaddr;
>      int ret;
> +    hwaddr page_mask = vfio_iommu_page_mask(section->mr);
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_add_skip(
> @@ -335,16 +346,16 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>          return;
>      }
>  
> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
>          error_report("%s received unaligned region", __func__);
>          return;
>      }
>  
> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>      llend = int128_make64(section->offset_within_address_space);
>      llend = int128_add(llend, section->size);
> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +    llend = int128_and(llend, int128_exts64(page_mask));
>  
>      if (int128_ge(int128_make64(iova), llend)) {
>          return;
> @@ -432,6 +443,7 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
>      hwaddr iova, end;
>      int ret;
>      MemoryRegion *iommu = NULL;
> +    hwaddr page_mask = vfio_iommu_page_mask(section->mr);
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -441,8 +453,8 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
>          return;
>      }
>  
> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
>          error_report("%s received unaligned region", __func__);
>          return;
>      }
> @@ -469,9 +481,9 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
>           */
>      }
>  
> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>      end = (section->offset_within_address_space + int128_get64(section->size)) &
> -          TARGET_PAGE_MASK;
> +          page_mask;
>  
>      if (iova >= end) {
>          return;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2016-03-03  6:30   ` David Gibson
  2016-03-15  2:53     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-03  6:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 14263 bytes --]

On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> As there is no per-IOMMU-type release() callback anymore, this stores
> the IOMMU type in the container so vfio_listener_release() can device
> if it needs to unregister @prereg_listener.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This does not change the guest visible interface.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  39 +++++++++---
>  hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   2 +
>  5 files changed, 175 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/prereg.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..5800e0e 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += prereg.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 3aaa6b5..f2a03e0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->iommu_listener.listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener.listener);
> +    }
>  }
>  
>  int vfio_mmap_region(Object *obj, VFIORegion *region,
> @@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);

It'd be nice to consolidate the setting of container->iommu_type and
then the SET_IOMMU ioctl() rather than having more or less duplicated
logic for Type1 and SPAPR, but it's not a big deal.

>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener.container = container;
> +            container->prereg_listener.listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener.listener,
> +                                     &address_space_memory);

This assumes that the target address space of the (guest) IOMMU is
address_space_memory.  Which is fine - vfio already assumes that - but
it reminds me that it'd be nice to have an explicit check for that (I
guess it would have to go in vfio_iommu_map_notify()).  So that if
someone constructs a machine where that's not the case, it'll at least
be obvious why VFIO isn't working.

> +            if (container->error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                memory_listener_unregister(
> +                    &container->prereg_listener.listener);
> +                goto free_container_exit;
> +            }
>          }

Looks like you don't have an error path which will handle the case
where the prereg listener is registered, but registering the normal
PCI AS listener fails - I believe you will fail to unregister the
prereg listener in that case.

>          /*
> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> new file mode 100644
> index 0000000..66cd696
> --- /dev/null
> +++ b/hw/vfio/prereg.c
> @@ -0,0 +1,138 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/vfio.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    return (!memory_region_is_ram(section->mr) &&
> +            !memory_region_is_iommu(section->mr)) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> +                                                 listener);
> +    VFIOContainer *container = vlistener->container;
> +    hwaddr iova;
> +    Int128 llend;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }

You should probably explicitly check for IOMMU regions and abort if
you find one.  An IOMMU AS appearing within address_space_memory would
be bad news.

> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);

iova is a terrible name here.  This is *not* an IOVA, but a real
memory address.

> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(page_mask));
> +
> +    if (int128_ge(int128_make64(iova), llend)) {
> +        return;

IIUC, if we get here something has gone horribly wrong in our machine
setup, and we shold probably just abort.  Same goes for the similar
test in the IOVA listener, of course.

> +    }
> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (iova - section->offset_within_address_space);
> +    reg.size = int128_get64(llend) - iova;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: DMA mapping failed, unable to continue");

Wrong error message.

> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> +                                                 listener);
> +    VFIOContainer *container = vlistener->container;
> +    hwaddr iova, end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
> +    end = (section->offset_within_address_space + int128_get64(section->size)) &
> +        page_mask;
> +
> +    if (iova >= end) {
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (iova - section->offset_within_address_space);
> +    reg.size = end - iova;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index b6b736c..bcbc5cb 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -70,6 +70,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      VFIOMemoryListener iommu_listener;
> +    VFIOMemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -146,4 +148,6 @@ extern const MemoryRegionOps vfio_region_ops;
>  extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
>  extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
>  
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index 4b6ea70..f5335ec 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1725,6 +1725,8 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
>  vfio_put_group(int fd) "close group->fd=%d"
>  vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
>  vfio_put_base_device(int fd) "close vdev->fd=%d"
> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  
>  # hw/vfio/platform.c
>  vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 12/16] vmstate: Define VARRAY with VMS_ALLOC
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 12/16] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2016-03-03  6:31   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-03  6:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2378 bytes --]

On Tue, Mar 01, 2016 at 08:10:37PM +1100, Alexey Kardashevskiy wrote:
> This allows dynamic allocation for migrating arrays.
> 
> Already existing VMSTATE_VARRAY_UINT32 requires an array to be
> pre-allocated, however there are cases when the size is not known in
> advance and there is no real need to enforce it.
> 
> This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
> flag which tells the receiving side to allocate memory for the array
> before receiving the data.
> 
> The first user of it is a dynamic DMA window which existence and size
> are totally dynamic.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Thomas Huth <thuth@redhat.com>

This looks fine, but it might be worth sending separately, to seek
review from Juan or Dave Gilbert.

> ---
>  include/migration/vmstate.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 84ee355..1622638 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
>      .offset     = vmstate_offset_pointer(_state, _field, _type),     \
>  }
>  
> +#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
> +    .name       = (stringify(_field)),                               \
> +    .version_id = (_version),                                        \
> +    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
> +    .info       = &(_info),                                          \
> +    .size       = sizeof(_type),                                     \
> +    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
> +    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
> +}
> +
>  #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
>      .name       = (stringify(_field)),                               \
>      .version_id = (_version),                                        \

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 13/16] spapr_iommu: Remove need_vfio flag from sPAPRTCETable
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 13/16] spapr_iommu: Remove need_vfio flag from sPAPRTCETable Alexey Kardashevskiy
@ 2016-03-03  6:38   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-03  6:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3926 bytes --]

On Tue, Mar 01, 2016 at 08:10:38PM +1100, Alexey Kardashevskiy wrote:
> sPAPRTCETable has a need_vfio flag which is passed to
> kvmppc_create_spapr_tce() and controls whether to create a guest view
> table in KVM as this depends on the host kernel ability to accelerate
> H_PUT_TCE for VFIO devices. We would set this flag at the moment
> when sPAPRTCETable is created in spapr_tce_new_table() and
> use when the table is allocated in spapr_tce_table_realize().
> 
> Now we explicitly enable/disable DMA windows via spapr_tce_table_enable()
> and spapr_tce_table_disable() and can pass this flag directly without
> caching it in sPAPRTCETable.
> 
> This removes the flag. This should cause no behavioural change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/ppc/spapr_iommu.c   | 13 +++++--------
>  include/hw/ppc/spapr.h |  1 -
>  2 files changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 4c52cf4..8aa2238 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -210,8 +210,9 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>  {
>      size_t table_size = tcet->nb_table * sizeof(uint64_t);
>      void *newtable;
> +    bool tcet_can_vfio = tcet->fd < 0;
>  
> -    if (need_vfio == tcet->need_vfio) {
> +    if (need_vfio == tcet_can_vfio) {
>          /* Nothing to do */
>          return;
>      }
> @@ -222,8 +223,6 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>          return;
>      }
>  
> -    tcet->need_vfio = true;
> -
>      if (tcet->fd < 0) {
>          /* Table is already in userspace, nothing to be do */
>          return;
> @@ -261,7 +260,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>      return tcet;
>  }
>  
> -static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool need_vfio)
>  {
>      if (!tcet->nb_table) {
>          return;
> @@ -271,7 +270,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
>                                          tcet->page_shift,
>                                          tcet->nb_table,
>                                          &tcet->fd,
> -                                        tcet->need_vfio);
> +                                        need_vfio);
>  
>      memory_region_set_size(&tcet->iommu,
>                             (uint64_t)tcet->nb_table << tcet->page_shift);
> @@ -291,9 +290,8 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
>      tcet->bus_offset = bus_offset;
>      tcet->page_shift = page_shift;
>      tcet->nb_table = nb_table;
> -    tcet->need_vfio = need_vfio;
>  
> -    spapr_tce_table_do_enable(tcet);
> +    spapr_tce_table_do_enable(tcet, need_vfio);
>  }
>  
>  void spapr_tce_table_disable(sPAPRTCETable *tcet)
> @@ -312,7 +310,6 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
>      tcet->bus_offset = 0;
>      tcet->page_shift = 0;
>      tcet->nb_table = 0;
> -    tcet->need_vfio = false;
>  }
>  
>  static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 5d2f8f4..505cb3a 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -546,7 +546,6 @@ struct sPAPRTCETable {
>      uint32_t page_shift;
>      uint64_t *table;
>      bool bypass;
> -    bool need_vfio;
>      int fd;
>      MemoryRegion root, iommu;
>      struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 14/16] spapr_pci: Add and export DMA resetting helper
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 14/16] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
@ 2016-03-03  6:39   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-03  6:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2124 bytes --]

On Tue, Mar 01, 2016 at 08:10:39PM +1100, Alexey Kardashevskiy wrote:
> This will be later used by the "ibm,reset-pe-dma-window" RTAS handler
> which resets the DMA configuration to the defaults.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/ppc/spapr_pci.c          | 11 ++++++++---
>  include/hw/pci-host/spapr.h |  2 ++
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index b0cd148..4c6e687 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1441,10 +1441,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>      return 0;
>  }
>  
> -static void spapr_phb_reset(DeviceState *qdev)
> +void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
> -
>      spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>  
>      /* Register default 32bit DMA window */
> @@ -1452,6 +1450,13 @@ static void spapr_phb_reset(DeviceState *qdev)
>                                  SPAPR_TCE_PAGE_SHIFT,
>                                  sphb->dma_win_addr,
>                                  sphb->dma_win_size);
> +}
> +
> +static void spapr_phb_reset(DeviceState *qdev)
> +{
> +    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
> +
> +    spapr_phb_dma_reset(sphb);
>  
>      /* Reset the IOMMU state */
>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 03ee006..7848366 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -147,4 +147,6 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>  }
>  #endif
>  
> +void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> +
>  #endif /* __HW_SPAPR_PCI_H__ */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 15/16] vfio: Move iova_pgsizes from container to guest IOMMU
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 15/16] vfio: Move iova_pgsizes from container to guest IOMMU Alexey Kardashevskiy
@ 2016-03-03 11:22   ` David Gibson
  2016-03-04  0:02     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-03 11:22 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4540 bytes --]

On Tue, Mar 01, 2016 at 08:10:40PM +1100, Alexey Kardashevskiy wrote:
> The page size is an attribute of an IOMMU, not a container as a container
> may contain more just one IOMMU.
> 
> This moves iova_pgsizes from VFIOContainer to VFIOGuestIOMMU.
> The following patch will use this.
> 
> This removes iova_pgsizes from Type1 IOMMU as it is not used there anyway
> and when it will get guest visible IOMMU, it will use VFIOGuestIOMMU's
> iova_pgsizes.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Hmm.  This makes an important semantic change which.. I'm not sure is
wrong, but certainly isn't adequately addressed in your commit
message.

The current iova_pgsizes is populated with information about the
*host* IOMMU, whereas you're replacing it with information about the
*guest* IOMMU.

> ---
>  hw/vfio/common.c              | 16 ++++------------
>  include/hw/vfio/vfio-common.h |  2 +-
>  2 files changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f2a03e0..42ef1eb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -313,9 +313,9 @@ out:
>      rcu_read_unlock();
>  }
>  
> -static hwaddr vfio_container_granularity(VFIOContainer *container)
> +static hwaddr vfio_container_granularity(VFIOGuestIOMMU *giommu)
>  {
> -    return (hwaddr)1 << ctz64(container->iova_pgsizes);
> +    return (hwaddr)1 << ctz64(giommu->iova_pgsizes);
>  }
>  
>  static hwaddr vfio_iommu_page_mask(MemoryRegion *mr)
> @@ -392,12 +392,13 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>              section->offset_within_address_space;
>          giommu->container = container;
>          giommu->n.notify = vfio_iommu_map_notify;
> +        giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>          giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> -                                   vfio_container_granularity(container),
> +                                   vfio_container_granularity(giommu),
>                                     false);
>  
>          return;
> @@ -743,14 +744,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          container->min_iova = 0;
>          container->max_iova = (hwaddr)-1;
>  
> -        /* Assume just 4K IOVA page size */
> -        container->iova_pgsizes = 0x1000;
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
> -        /* Ignore errors */
> -        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> -            container->iova_pgsizes = info.iova_pgsizes;
> -        }
>      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> @@ -811,9 +806,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> -
> -        /* Assume just 4K IOVA pages for now */
> -        container->iova_pgsizes = 0x1000;
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bcbc5cb..48a1d7f 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -80,7 +80,6 @@ typedef struct VFIOContainer {
>       * future
>       */
>      hwaddr min_iova, max_iova;
> -    uint64_t iova_pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
> @@ -90,6 +89,7 @@ typedef struct VFIOGuestIOMMU {
>      VFIOContainer *container;
>      MemoryRegion *iommu;
>      hwaddr offset_within_address_space;
> +    uint64_t iova_pgsizes;
>      Notifier n;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 15/16] vfio: Move iova_pgsizes from container to guest IOMMU
  2016-03-03 11:22   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2016-03-04  0:02     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-04  0:02 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/03/2016 10:22 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:40PM +1100, Alexey Kardashevskiy wrote:
>> The page size is an attribute of an IOMMU, not a container as a container
>> may contain more just one IOMMU.
>>
>> This moves iova_pgsizes from VFIOContainer to VFIOGuestIOMMU.
>> The following patch will use this.
>>
>> This removes iova_pgsizes from Type1 IOMMU as it is not used there anyway
>> and when it will get guest visible IOMMU, it will use VFIOGuestIOMMU's
>> iova_pgsizes.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Hmm.  This makes an important semantic change which.. I'm not sure is
> wrong, but certainly isn't adequately addressed in your commit
> message.
>
> The current iova_pgsizes is populated with information about the
> *host* IOMMU, whereas you're replacing it with information about the
> *guest* IOMMU.


Ah, did not realize that. Then it should be not a move but an additional 
giommu->iova_pgsizes. And this probably answers todo#1 in 16/16 about page 
masks.



>
>> ---
>>   hw/vfio/common.c              | 16 ++++------------
>>   include/hw/vfio/vfio-common.h |  2 +-
>>   2 files changed, 5 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index f2a03e0..42ef1eb 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -313,9 +313,9 @@ out:
>>       rcu_read_unlock();
>>   }
>>
>> -static hwaddr vfio_container_granularity(VFIOContainer *container)
>> +static hwaddr vfio_container_granularity(VFIOGuestIOMMU *giommu)
>>   {
>> -    return (hwaddr)1 << ctz64(container->iova_pgsizes);
>> +    return (hwaddr)1 << ctz64(giommu->iova_pgsizes);
>>   }
>>
>>   static hwaddr vfio_iommu_page_mask(MemoryRegion *mr)
>> @@ -392,12 +392,13 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>>               section->offset_within_address_space;
>>           giommu->container = container;
>>           giommu->n.notify = vfio_iommu_map_notify;
>> +        giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>
>>           memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>>           giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
>>           memory_region_iommu_replay(giommu->iommu, &giommu->n,
>> -                                   vfio_container_granularity(container),
>> +                                   vfio_container_granularity(giommu),
>>                                      false);
>>
>>           return;
>> @@ -743,14 +744,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           container->min_iova = 0;
>>           container->max_iova = (hwaddr)-1;
>>
>> -        /* Assume just 4K IOVA page size */
>> -        container->iova_pgsizes = 0x1000;
>>           info.argsz = sizeof(info);
>>           ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>> -        /* Ignore errors */
>> -        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>> -            container->iova_pgsizes = info.iova_pgsizes;
>> -        }
>>       } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>>                  ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>           struct vfio_iommu_spapr_tce_info info;
>> @@ -811,9 +806,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           }
>>           container->min_iova = info.dma32_window_start;
>>           container->max_iova = container->min_iova + info.dma32_window_size - 1;
>> -
>> -        /* Assume just 4K IOVA pages for now */
>> -        container->iova_pgsizes = 0x1000;
>>       } else {
>>           error_report("vfio: No available IOMMU models");
>>           ret = -EINVAL;
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index bcbc5cb..48a1d7f 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -80,7 +80,6 @@ typedef struct VFIOContainer {
>>        * future
>>        */
>>       hwaddr min_iova, max_iova;
>> -    uint64_t iova_pgsizes;
>>       QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>       QLIST_HEAD(, VFIOGroup) group_list;
>>       QLIST_ENTRY(VFIOContainer) next;
>> @@ -90,6 +89,7 @@ typedef struct VFIOGuestIOMMU {
>>       VFIOContainer *container;
>>       MemoryRegion *iommu;
>>       hwaddr offset_within_address_space;
>> +    uint64_t iova_pgsizes;
>>       Notifier n;
>>       QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>>   } VFIOGuestIOMMU;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener
  2016-03-03  6:07     ` Alexey Kardashevskiy
@ 2016-03-04  3:44       ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-04  3:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6829 bytes --]

On Thu, Mar 03, 2016 at 05:07:33PM +1100, Alexey Kardashevskiy wrote:
> On 03/03/2016 04:36 PM, David Gibson wrote:
> >On Tue, Mar 01, 2016 at 08:10:34PM +1100, Alexey Kardashevskiy wrote:
> >>At the moment VFIOContainer uses one memory listener which listens on
> >>PCI address space for both Type1 and sPAPR IOMMUs. Soon we will need
> >>another listener to listen on RAM; this will do DMA memory
> >>pre-registration for sPAPR guests which basically pins all guest
> >>pages in the host physical RAM.
> >>
> >>This introduces VFIOMemoryListener which is wrapper for MemoryListener
> >>and stores a pointer to the container. This allows having multiple
> >>memory listeners for the same container. This replaces the existing
> >>@listener with @iommu_listener.
> >>
> >>This should cause no change in behavior.
> >
> >This is nonsense.
> >
> >The two listeners you're talking about have (or should have) both a
> >different AS they're listening on,
> 
> They do have different AS.
> 
> >*and* different notification
> >functions.
> 
> They do use totally different region_add/region_del, later in the series.
> 
> >Since they have nothing in common, there's no point trying
> >to build a common structure for them.
> 
> They use the same VFIOContainer pointer. VFIOMemoryListener is made of
> MemoryListener and VFIOContainer, and that's it.

Right, but you don't need the container pointer.  In both cases you
can locate the VFIOContainer with container_of.  It's a different
container_of invocation in each case, but since they're different
callback functions, that's no problem.

> Ok, I'll get rid of VFIOMemoryListener. It is just hard sometime to
> understand what bits I have to reuse and which I do not, constant
> argument...

I think the arguments to try to make things re-used here were based on
a mis understanding of what the prereg listener was for and therefore
not realizing that it has basically nothing in common with the regular
listener.

> 
> >
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  hw/vfio/common.c              | 41 +++++++++++++++++++++++++++++++----------
> >>  include/hw/vfio/vfio-common.h |  9 ++++++++-
> >>  2 files changed, 39 insertions(+), 11 deletions(-)
> >>
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index ca3fd47..0e67a5a 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -318,10 +318,10 @@ static hwaddr vfio_container_granularity(VFIOContainer *container)
> >>      return (hwaddr)1 << ctz64(container->iova_pgsizes);
> >>  }
> >>
> >>-static void vfio_listener_region_add(MemoryListener *listener,
> >>+static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
> >>                                       MemoryRegionSection *section)
> >>  {
> >>-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> >>+    VFIOContainer *container = vlistener->container;
> >>      hwaddr iova, end;
> >>      Int128 llend;
> >>      void *vaddr;
> >>@@ -425,10 +425,10 @@ fail:
> >>      }
> >>  }
> >>
> >>-static void vfio_listener_region_del(MemoryListener *listener,
> >>+static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
> >>                                       MemoryRegionSection *section)
> >>  {
> >>-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> >>+    VFIOContainer *container = vlistener->container;
> >>      hwaddr iova, end;
> >>      int ret;
> >>      MemoryRegion *iommu = NULL;
> >>@@ -492,14 +492,33 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>      }
> >>  }
> >>
> >>-static const MemoryListener vfio_memory_listener = {
> >>-    .region_add = vfio_listener_region_add,
> >>-    .region_del = vfio_listener_region_del,
> >>+static void vfio_iommu_listener_region_add(MemoryListener *listener,
> >>+                                           MemoryRegionSection *section)
> >>+{
> >>+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> >>+                                                 listener);
> >>+
> >>+    vfio_listener_region_add(vlistener, section);
> >>+}
> >>+
> >>+
> >>+static void vfio_iommu_listener_region_del(MemoryListener *listener,
> >>+                                           MemoryRegionSection *section)
> >>+{
> >>+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> >>+                                                 listener);
> >>+
> >>+    vfio_listener_region_del(vlistener, section);
> >>+}
> >>+
> >>+static const MemoryListener vfio_iommu_listener = {
> >>+    .region_add = vfio_iommu_listener_region_add,
> >>+    .region_del = vfio_iommu_listener_region_del,
> >>  };
> >>
> >>  static void vfio_listener_release(VFIOContainer *container)
> >>  {
> >>-    memory_listener_unregister(&container->listener);
> >>+    memory_listener_unregister(&container->iommu_listener.listener);
> >>  }
> >>
> >>  int vfio_mmap_region(Object *obj, VFIORegion *region,
> >>@@ -768,9 +787,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>          goto free_container_exit;
> >>      }
> >>
> >>-    container->listener = vfio_memory_listener;
> >>+    container->iommu_listener.container = container;
> >>+    container->iommu_listener.listener = vfio_iommu_listener;
> >>
> >>-    memory_listener_register(&container->listener, container->space->as);
> >>+    memory_listener_register(&container->iommu_listener.listener,
> >>+                             container->space->as);
> >>
> >>      if (container->error) {
> >>          ret = container->error;
> >>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>index 9ffa681..b6b736c 100644
> >>--- a/include/hw/vfio/vfio-common.h
> >>+++ b/include/hw/vfio/vfio-common.h
> >>@@ -57,12 +57,19 @@ typedef struct VFIOAddressSpace {
> >>      QLIST_ENTRY(VFIOAddressSpace) list;
> >>  } VFIOAddressSpace;
> >>
> >>+typedef struct VFIOContainer VFIOContainer;
> >>+
> >>+typedef struct VFIOMemoryListener {
> >>+    struct MemoryListener listener;
> >>+    VFIOContainer *container;
> >>+} VFIOMemoryListener;
> >>+
> >>  struct VFIOGroup;
> >>
> >>  typedef struct VFIOContainer {
> >>      VFIOAddressSpace *space;
> >>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >>-    MemoryListener listener;
> >>+    VFIOMemoryListener iommu_listener;
> >>      int error;
> >>      bool initialized;
> >>      /*
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-03  6:01     ` Alexey Kardashevskiy
@ 2016-03-04  4:01       ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-04  4:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 8992 bytes --]

On Thu, Mar 03, 2016 at 05:01:31PM +1100, Alexey Kardashevskiy wrote:
> On 03/03/2016 04:28 PM, David Gibson wrote:
> >On Tue, Mar 01, 2016 at 08:10:32PM +1100, Alexey Kardashevskiy wrote:
> >>This adds a vfio_votify() callback to inform an IOMMU (and then its owner)
> >>that VFIO started using the IOMMU. This is used by the pseries machine to
> >>enable/disable in-kernel acceleration of TCE hypercalls.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >Hmm.. the current approach of having a hook when vfio-pci devices are
> >attached is pretty ugly, but what exactly the case that it doesn't
> >handle and this approach does?
> 
> Sorry, I am not following you here. What hook do you mean here?
> 
> My hook fixes the case when I want to enable/disable KVM acceleration,
> without these patches, I need to re-count how many vfio-pci devices are
> there and it is more ugly with PCI hotplug/unplug...
> 
> 
> >This two tiered notify system for a single bit is also kinda ugly.
> >
> >>---
> >>  hw/ppc/spapr_iommu.c   |  9 +++++++++
> >>  hw/ppc/spapr_pci.c     | 14 ++++++++------
> >>  hw/vfio/common.c       |  7 +++++++
> >>  include/exec/memory.h  |  2 ++
> >>  include/hw/ppc/spapr.h |  4 ++++
> >>  5 files changed, 30 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 8a88a74..67a8356 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -136,6 +136,13 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
> >>      return ret;
> >>  }
> >>
> >>+static int spapr_tce_vfio_notify(MemoryRegion *iommu, bool attached)
> >>+{
> >>+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> >>+
> >>+    return spapr_tce_vfio_notify_owner(tcet->owner, tcet, attached);
> >
> >I'm guessing the "owner" is the PHB, but I'm not entirely clear.
> >
> >Could you use the QOM parent to get the the PHB instead of storing it
> >explicitly?
> 
> 
> I am pretty sure I am not allowed to use the QOM parent, this is why there
> is no object_get_parent() helper.

Hmm.. I thought I had this discussion before and accessing the qom
parent from qmp was bad, but it was ok for internal code use.  But I
may be getting muddled with older qdev stuff.

> 
> >
> >>+}
> >>+
> >>  static int spapr_tce_table_post_load(void *opaque, int version_id)
> >>  {
> >>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> >>@@ -167,6 +174,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
> >>
> >>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
> >>      .translate = spapr_tce_translate_iommu,
> >>+    .vfio_notify = spapr_tce_vfio_notify,
> >>  };
> >>
> >>  static int spapr_tce_table_realize(DeviceState *dev)
> >>@@ -235,6 +243,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
> >>
> >>      tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
> >>      tcet->liobn = liobn;
> >>+    tcet->owner = owner;
> >>
> >>      snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
> >>      object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index ee0fecf..b0cd148 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -1084,6 +1084,14 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
> >>      return 0;
> >>  }
> >>
> >>+int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
> >>+                                bool attached)
> >>+{
> >>+    spapr_tce_set_need_vfio(tcet, attached);
> >
> >Hmm.. you go to the trouble of storing owner in dev, then don't
> >actually use it.
> 
> 
> Yeah, I need to clean this, I removed spapr_tce_vfio_notify_owner() from my
> working branch already and call spapr_tce_set_need_vfio() directly from
> spapr_tce_vfio_notify().

Ok.

> 
> 
> >
> >>+    return 0;
> >>+}
> >>+
> >>  /* create OF node for pci device and required OF DT properties */
> >>  static int spapr_create_pci_child_dt(sPAPRPHBState *phb, PCIDevice *dev,
> >>                                       void *fdt, int node_offset)
> >>@@ -1118,12 +1126,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >>      void *fdt = NULL;
> >>      int fdt_start_offset = 0, fdt_size;
> >>
> >>-    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> >>-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >>-
> >>-        spapr_tce_set_need_vfio(tcet, true);
> >>-    }
> >>-
> >>      if (dev->hotplugged) {
> >>          fdt = create_device_tree(&fdt_size);
> >>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 9bf4c3b..ca3fd47 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -384,6 +384,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >>
> >>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> >>+        giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
> >>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> >>                                     vfio_container_granularity(container),
> >>                                     false);
> >>@@ -430,6 +431,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> >>      hwaddr iova, end;
> >>      int ret;
> >>+    MemoryRegion *iommu = NULL;
> >>
> >>      if (vfio_listener_skipped_section(section)) {
> >>          trace_vfio_listener_region_del_skip(
> >>@@ -451,6 +453,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >>              if (giommu->iommu == section->mr) {
> >>                  memory_region_unregister_iommu_notifier(&giommu->n);
> >>+                iommu = giommu->iommu;
> >>                  QLIST_REMOVE(giommu, giommu_next);
> >>                  g_free(giommu);
> >>                  break;
> >>@@ -483,6 +486,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                       "0x%"HWADDR_PRIx") = %d (%m)",
> >>                       container, iova, end - iova, ret);
> >>      }
> >>+
> >>+    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
> >>+        iommu->iommu_ops->vfio_notify(section->mr, false);
> >>+    }
> >
> >So, if an IOMMU is removed from the guest, this will turn off VFIO
> >enablement.  However, IIUC this won't get caled in the more likely
> >case that the address space stays the same, but the VFIO device is
> >removed.
> 
> 
> When VFIO device is removed, its listener gets removed and this is supposed
> to end up calling vfio_listener_region_del().

Ah, ok region_del() gets called on all existing regions when the
listener is removed.  I guess that makes sense since region_add() is
called on the existing regions when the listener is registered.



> 
> 
> >
> >>  }
> >>
> >>  static const MemoryListener vfio_memory_listener = {
> >>diff --git a/include/exec/memory.h b/include/exec/memory.h
> >>index d5284c2..9f82629 100644
> >>--- a/include/exec/memory.h
> >>+++ b/include/exec/memory.h
> >>@@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
> >>  struct MemoryRegionIOMMUOps {
> >>      /* Return a TLB entry that contains a given address. */
> >>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> >>+    /* Called when VFIO starts/stops using this */
> >>+    int (*vfio_notify)(MemoryRegion *iommu, bool attached);
> >>  };
> >>
> >>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> >>diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>index 8aa0c45..5d2f8f4 100644
> >>--- a/include/hw/ppc/spapr.h
> >>+++ b/include/hw/ppc/spapr.h
> >>@@ -550,6 +550,7 @@ struct sPAPRTCETable {
> >>      int fd;
> >>      MemoryRegion root, iommu;
> >>      struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
> >>+    DeviceState *owner;
> >>      QLIST_ENTRY(sPAPRTCETable) list;
> >>  };
> >>
> >>@@ -629,4 +630,7 @@ int spapr_rng_populate_dt(void *fdt);
> >>   */
> >>  #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
> >>
> >>+int spapr_tce_vfio_notify_owner(DeviceState *dev, sPAPRTCETable *tcet,
> >>+                                bool attached);
> >>+
> >>  #endif /* !defined (__HW_SPAPR_H__) */
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 05/16] spapr_iommu: Add root memory region
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 05/16] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2016-03-04  4:08   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-04  4:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4771 bytes --]

On Tue, Mar 01, 2016 at 08:10:30PM +1100, Alexey Kardashevskiy wrote:
> We are going to have multiple DMA windows at different offsets on
> a PCI bus. For the sake of migration, we will have as many TCE table
> objects pre-created as many windows supported.
> So we need a way to map windows dynamically onto a PCI bus
> when migration of a table is completed but at this stage a TCE table
> object does not have access to a PHB to ask it to map a DMA window
> backed by just migrated TCE table.
> 
> This adds a "root" memory region (UINT64_MAX long) to the TCE object.
> This new region is mapped on a PCI bus with enabled overlapping as
> there will be one root MR per TCE table, each of them mapped at 0.
> The actual IOMMU memory region is a subregion of the root region and
> a TCE table enables/disables this subregion and maps it at
> the specific offset inside the root MR which is 1:1 mapping of
> a PCI address space.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Thomas Huth <thuth@redhat.com>
> ---
>  hw/ppc/spapr_iommu.c   | 13 ++++++++++---
>  hw/ppc/spapr_pci.c     |  5 +++--
>  include/hw/ppc/spapr.h |  2 +-
>  3 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index e66e128..ba9ddbb 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -172,10 +172,15 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>  static int spapr_tce_table_realize(DeviceState *dev)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> +    Object *tcetobj = OBJECT(tcet);
> +    char tmp[32];
>  
>      tcet->fd = -1;
> -    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
> -                             "iommu-spapr", 0);
> +    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
> +    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
> +
> +    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
> +    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
>  
>      QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
>  
> @@ -253,6 +258,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
>  
>      memory_region_set_size(&tcet->iommu,
>                             (uint64_t)tcet->nb_table << tcet->page_shift);
> +    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
>  
>      tcet->enabled = true;
>  }
> @@ -279,6 +285,7 @@ static void spapr_tce_table_disable(sPAPRTCETable *tcet)
>          return;
>      }
>  
> +    memory_region_del_subregion(&tcet->root, &tcet->iommu);
>      memory_region_set_size(&tcet->iommu, 0);
>  
>      spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> @@ -302,7 +309,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
>  {
> -    return &tcet->iommu;
> +    return &tcet->root;
>  }
>  
>  static void spapr_tce_reset(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index c34a906..7b40687 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -822,8 +822,6 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>  
>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
>  
> -    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> -                                spapr_tce_get_iommu(tcet));
>      return 0;
>  }
>  
> @@ -1411,6 +1409,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> +    memory_region_add_subregion(&sphb->iommu_root, 0,
> +                                spapr_tce_get_iommu(tcet));
> +

Logically this patch should add the _overlap() option rather than a
later one, yes?


>      /* Register default 32bit DMA window */
>      spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
>                                  SPAPR_TCE_PAGE_SHIFT,
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 3e6bb84..bdf27ec 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -548,7 +548,7 @@ struct sPAPRTCETable {
>      bool bypass;
>      bool need_vfio;
>      int fd;
> -    MemoryRegion iommu;
> +    MemoryRegion root, iommu;
>      struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
>      QLIST_ENTRY(sPAPRTCETable) list;
>  };

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2016-03-04  4:51   ` David Gibson
  2016-03-11  9:03     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-04  4:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 35843 bytes --]

On Tue, Mar 01, 2016 at 08:10:41PM +1100, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices. As all TCE root regions
> are mapped at 0 and 64bit long (and actual tables are child regions),
> this replaces memory_region_add_subregion() with _overlap() to make
> QEMU memory API happy.
> 
> This reserves RTAS token numbers for DDW calls.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.5 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> TODO (which I have no idea how to implement properly):
> 1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
> windows and 12/16/24 page shift;

As noted in a different subthread, this information is there in the
container.

> 2. fix container::min_iova, max_iova - as for now, they are useless,
> and I'd expect IOMMU MR boundaries to serve this purpose really;

This seems to show a similar confusion of concepts to #1.
container::min_iova, container::max_iova advertise limitations of the
host IOMMU, the IOMMU MR boundaries show constraints of the guest
IOMMU.  You need to verify the guest constraints against the host
constraints.

A more flexible method than min/max iova will be necessary though, now
that the host IOMMU allows more flexible configurations than a single
window.

> 3. vfio_listener_region_add/vfio_listener_region_del do explicitely
> create/remove huge DMA window as we do not have vfio_container_ioctl()
> anymore, do we want to move these to some sort of callbacks? How, where?
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> # Conflicts:
> #	include/hw/pci-host/spapr.h
> 
> # Conflicts:
> #	hw/vfio/common.c
> ---
>  hw/ppc/Makefile.objs        |   1 +
>  hw/ppc/spapr.c              |   7 +-
>  hw/ppc/spapr_iommu.c        |  32 ++++-
>  hw/ppc/spapr_pci.c          |  61 +++++++--
>  hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/common.c            |  70 +++++++++-
>  include/hw/pci-host/spapr.h |  13 ++
>  include/hw/ppc/spapr.h      |  17 ++-
>  trace-events                |   6 +
>  9 files changed, 489 insertions(+), 24 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index c1ffc77..986b36f 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index e9d4abf..2473217 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2370,7 +2370,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>   * pseries-2.5
>   */
>  #define SPAPR_COMPAT_2_5 \
> -        HW_COMPAT_2_5
> +        HW_COMPAT_2_5 \
> +        {\
> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +            .property = "ddw",\
> +            .value    = stringify(off),\
> +        },
>  
>  static void spapr_machine_2_5_instance_options(MachineState *machine)
>  {
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 8aa2238..e32f71b 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -150,6 +150,15 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +    tcet->migtable = tcet->table;
> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -158,22 +167,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>      }
>  
> +    if (tcet->enabled) {
> +        if (!tcet->table) {
> +            tcet->enabled = false;
> +            /* VFIO does not migrate so pass vfio_accel == false */
> +            spapr_tce_table_do_enable(tcet, false);
> +        }

What if there was an existing table, but its size doesn't match that
in the incoming migration?  Don't you need to free() it and
re-allocate?  IIUC this would happen in practice if you migrated a
guest which had removed the default window and replaced it with one of
a different size.

> +        memcpy(tcet->table, tcet->migtable,
> +               tcet->nb_table * sizeof(tcet->table[0]));
> +        free(tcet->migtable);
> +        tcet->migtable = NULL;
> +    }

Likewise, what if your incoming migration is of a guest which has
completely removed the default window?  Don't you need to free the
existing default table?

>      return 0;
>  }
>  
>  static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
> -    .version_id = 2,
> +    .version_id = 3,
>      .minimum_version_id = 2,
> +    .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
>      .fields      = (VMStateField []) {
>          /* Sanity check */
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
> +                                    vmstate_info_uint64, uint64_t),
>  
>          VMSTATE_END_OF_LIST()
>      },
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 4c6e687..1bc0710 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -803,10 +803,10 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>      return buf;
>  }
>  
> -static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> -                                       uint32_t liobn, uint32_t page_shift,
> -                                       uint64_t window_addr,
> -                                       uint64_t window_size)
> +int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                uint32_t liobn, uint32_t page_shift,
> +                                uint64_t window_addr,
> +                                uint64_t window_size)
>  {
>      sPAPRTCETable *tcet;
>      uint32_t nb_table = window_size >> page_shift;
> @@ -820,12 +820,16 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>          return -1;
>      }
>  
> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> +        return -1;
> +    }
> +
>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
>  
>      return 0;
>  }
>  
> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>  {
>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>  
> @@ -1418,14 +1422,21 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      }
>  
>      /* DMA setup */
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -        error_report("No default TCE table for %s", sphb->dtbusname);
> -        return;
> -    }
> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
> +    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
> +    sphb->dma64_window_size = pow2ceil(ram_size);

Why do you need this value?  Isn't the size of the dma64 window
supplied when you create it with RTAS?  It makes more sense to me to
validate the value at that point rather than here where you have to
use a global.

Plus.. if your machine allows hotplug memory you probably need
maxram_size, rather than ram_size here.

>  
> -    memory_region_add_subregion(&sphb->iommu_root, 0,
> -                                spapr_tce_get_iommu(tcet));
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb),
> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
> +        if (!tcet) {
> +            error_setg(errp, "Creating window#%d failed for %s",
> +                       i, sphb->dtbusname);
> +            return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
> +    }
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
> @@ -1443,7 +1454,11 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> +    int i;
> +
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
> +    }
>  
>      /* Register default 32bit DMA window */
>      spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
> @@ -1481,6 +1496,9 @@ static Property spapr_phb_properties[] = {
>      /* Default DMA window is 0..1GB */
>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> +                       0x800000000000000ULL),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1734,6 +1752,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1758,6 +1785,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..b8ea910
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,306 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->enabled) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->enabled) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> +                                 uint64_t page_mask)
> +{
> +    int i, j;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            if ((sps[i].page_shift == masks[j].shift) &&
> +                    (page_mask & (1ULL << masks[j].shift))) {
> +                mask |= masks[j].mask;
> +            }
> +        }
> +    }

Hmm... checking against the list of page sizes supported by the vcpu
seems conceptually wrong, although it's probably correct in practice.
Is there a way of checking directly against the pagesizes supported by
the host IOMMU.

> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t avail, addr, pgmask = 0;
> +    unsigned current;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    current = spapr_phb_get_active_win_num(sphb);
> +    avail = (sphb->windows_supported > current) ?
> +            (sphb->windows_supported - current) : 0;

sphb->windows_supported < current indicates a bug in qemu, surely?  So
you should be able to do without the ?:.

> +
> +    /* Work out supported page masks */
> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as all RAM was in 4K pages.
> +     */
> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
> +                                pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    long ret;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +
> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
> +        goto hw_error_exit;
> +    }
> +
> +    if (window_shift < page_shift) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_window_enable(sphb, liobn, page_shift,
> +                                      sphb->dma64_window_addr,
> +                                      1ULL << window_shift);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift,
> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
> +                                 liobn, ret);
> +    if (ret || !tcet) {
> +        goto hw_error_exit;
> +    }

!ret && !tcet indicates a qemu bug, surely, an assert would make more
sense.

> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +    long ret;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_window_disable(sphb, liobn);
> +    trace_spapr_iommu_ddw_remove(liobn, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +    long ret = 0;

ret is never assigned a value other than 0; remove it.

> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 42ef1eb..2332f8e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -395,6 +395,39 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>          giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);

It might make this easier to review if the guest side (non-VFIO) and
VFIO parts were in different patches.

> +        if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {

Might want to split this stuff out into a "new guest iommu" helper.
It would want to first check if the guest IOMMU can be supported with
the existing host IOMMU windows.  If not, and the host IOMMU supports
it (i.e. SPAPR_TCE_v2_IOMMU) it would attempt to create a new host
window.

> +            int ret;
> +            struct vfio_iommu_spapr_tce_create create = {
> +                .argsz = sizeof(create),
> +                .page_shift = ctz64(giommu->iova_pgsizes),
> +                .window_size = memory_region_size(section->mr),
> +                .levels = 0,
> +                .start_addr = 0,
> +            };
> +
> +            /*
> +             * Dynamic windows are supported, that means that there is no
> +             * pre-created window and we have to create one.
> +             */
> +            if (!create.levels) {

This test will always be true.

> +                unsigned entries = create.window_size >> create.page_shift;
> +                unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
> +                /* 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 */
> +                create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;

Hmm.. does it make more sense for qemu to apply this heuristic, or the kernel?

> +            }
> +            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +            if (ret) {
> +                error_report("Failed to create a window");
> +            }
> +
> +            if (create.start_addr != section->offset_within_address_space) {
> +                error_report("Something went wrong!");

Shouldn't you at least set start_addr before the ioctl() as a hint to
the kernel?

> +            }
> +            trace_vfio_spapr_create_window(create.page_shift,
> +                                           create.window_size,
> +                                           create.start_addr);
> +        }
> +
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>          giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> @@ -500,6 +533,18 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
>                       container, iova, end - iova, ret);
>      }
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        struct vfio_iommu_spapr_tce_remove remove = {
> +            .argsz = sizeof(remove),
> +            .start_addr = section->offset_within_address_space,
> +        };
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        if (ret) {
> +            error_report("Failed to remove window");
> +        }
> +
> +        trace_vfio_spapr_remove_window(remove.start_addr);
> +    }
>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
>          iommu->iommu_ops->vfio_notify(section->mr, false);
>      }
> @@ -792,11 +837,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -805,7 +845,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>          container->min_iova = info.dma32_window_start;
> -        container->max_iova = container->min_iova + info.dma32_window_size - 1;
> +        container->max_iova = (hwaddr)-1;

Rather than hacking min/max iova here, I think it makes more sense for
the "create new host window" path to *replace* the tests against
min/max iova in the add_region path.  Basically the min/max iova tests
are a (rather dumb) check of whether the new guest window is
compatible with the host windows.  When the host windows are dynamic a
static test doesn't make sense and should be replaced by the code to
create a new host window (and error if it can't make a matching one).

> +
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del happy, we better remove this window now
> +             * and let those iommu_listener callbacks create them when needed.
> +             */
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = info.dma32_window_start,
> +            };
> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            if (ret) {
> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 7848366..855e458 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -71,6 +71,12 @@ struct sPAPRPHBState {
>      spapr_pci_msi_mig *msi_devs;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
> +
> +    bool ddw_enabled;
> +    uint32_t windows_supported;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_window_addr;
> +    uint64_t dma64_window_size;
>  };
>  
>  #define SPAPR_PCI_MAX_INDEX          255
> @@ -89,6 +95,8 @@ struct sPAPRPHBState {
>  
>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>  
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> @@ -148,5 +156,10 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>  #endif
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> +int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                uint32_t liobn, uint32_t page_shift,
> +                                uint64_t window_addr,
> +                                uint64_t window_size);
> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn);
>  
>  #endif /* __HW_SPAPR_PCI_H__ */
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 505cb3a..4f59d1b 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -417,6 +417,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -458,8 +468,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> @@ -545,6 +559,7 @@ struct sPAPRTCETable {
>      uint64_t bus_offset;
>      uint32_t page_shift;
>      uint64_t *table;
> +    uint64_t *migtable;
>      bool bypass;
>      int fd;
>      MemoryRegion root, iommu;
> diff --git a/trace-events b/trace-events
> index f5335ec..c7314b6 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1432,6 +1432,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
> @@ -1727,6 +1731,8 @@ vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions,
>  vfio_put_base_device(int fd) "close vdev->fd=%d"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper
  2016-03-03  1:40   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2016-03-10  5:47     ` Alexey Kardashevskiy
  2016-03-15  5:30       ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-10  5:47 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/03/2016 12:40 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:27PM +1100, Alexey Kardashevskiy wrote:
>> We are going to have multiple DMA windows soon so let's start preparing.
>>
>> This adds a new helper to create a DMA window and makes use of it in
>> sPAPRPHBState::realize().
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/ppc/spapr_pci.c | 40 +++++++++++++++++++++++++++-------------
>>   1 file changed, 27 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 3d1145e..248f20a 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -803,6 +803,29 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>       return buf;
>>   }
>>
>> +static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> +                                       uint32_t liobn, uint32_t page_shift,
>> +                                       uint64_t window_addr,
>> +                                       uint64_t window_size)
>> +{
>> +    sPAPRTCETable *tcet;
>> +    uint32_t nb_table = window_size >> page_shift;
>> +
>> +    if (!nb_table) {
>> +        return -1;
>> +    }
>
> The caller shouldn't do this, so this probably makes more sense as an
> assert() than an error return.


@dma_win_size is a PHB property so the cli can set it to zero - where is it 
supposed to fail? When DMA won't work? This will be not really obvious for 
the user. Check dma_win_size==0 where we check mem_win_addr/...?

>
>> +
>> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
>> +                               page_shift, nb_table, false);
>> +    if (!tcet) {
>> +        return -1;
>> +    }
>
> Since you're adding error reporting, you might as well make it via the
> error API instead of a return code.  That way if we wasnt to add more
> detailed error API reporting to spapr_tce_new_table() in future,
> there's less to change.

Well, the table allocation is the only real thing which may fail there and 
spapr_phb_realize() does not pass Error to the callers so 
spapr_phb_dma_window_enable() would be the first one to propagate an error 
and it just seems a bit over engineered. Should I still do that?


>> +
>> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>> +                                spapr_tce_get_iommu(tcet));
>> +    return 0;
>> +}
>> +
>>   /* Macros to operate with address in OF binding to PCI */
>>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
>> @@ -1228,8 +1251,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       int i;
>>       PCIBus *bus;
>>       uint64_t msi_window_size = 4096;
>> -    sPAPRTCETable *tcet;
>> -    uint32_t nb_table;
>>
>>       if (sphb->index != (uint32_t)-1) {
>>           hwaddr windows_base;
>> @@ -1381,18 +1402,11 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>           }
>>       }
>>
>> -    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
>> -                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
>> -    if (!tcet) {
>> -        error_setg(errp, "Unable to create TCE table for %s",
>> -                   sphb->dtbusname);
>> -        return;
>> -    }
>> -
>>       /* Register default 32bit DMA window */
>> -    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
>> -                                spapr_tce_get_iommu(tcet));
>> +    if (spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
>> +                                    sphb->dma_win_addr, sphb->dma_win_size)) {
>> +        error_setg(errp, "Unable to create TCE table for %s", sphb->dtbusname);
>> +    }
>>
>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>   }
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table
  2016-03-03  3:00   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2016-03-10  7:39     ` Alexey Kardashevskiy
  2016-03-15  5:32       ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-10  7:39 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/03/2016 02:00 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:29PM +1100, Alexey Kardashevskiy wrote:
>> Currently TCE tables are created once at start and their sizes never
>> change. We are going to change that by introducing a Dynamic DMA windows
>> support where DMA configuration may change during the guest execution.
>>
>> This changes spapr_tce_new_table() to create an empty zero-size IOMMU
>> memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
>> It still will be called once at the owner object (VIO or PHB) creation.
>>
>> This introduces an "enabled" state for TCE table objects with two
>> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
>> - spapr_tce_table_enable() receives TCE table parameters, allocates
>> a guest view of the TCE table (in the user space or KVM) and
>> sets the correct size on the IOMMU MR.
>> - spapr_tce_table_disable() disposes the table and resets the IOMMU MR
>> size.
>>
>> This changes the PHB reset handler to do the default DMA initialization
>> instead of spapr_phb_realize(). This does not make differenct now but
>> later with more than just one DMA window, we will have to remove them all
>> and create the default one on a system reset.
>>
>> No visible change in behaviour is expected except the actual table
>> will be reallocated every reset. We might optimize this later.
>>
>> The other way to implement this would be dynamically create/remove
>> the TCE table QOM objects but this would make migration impossible
>> as the migration code expects all QOM objects to exist at the receiver
>> so we have to have TCE table objects created when migration begins.
>>
>> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
>> as later it will be called at the sPAPRTCETable post-migration stage when
>> it already has all the properties set after the migration.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> Although there's one nit that could be improved:
>
>
>> ---
>>   hw/ppc/spapr_iommu.c   | 80 +++++++++++++++++++++++++++++++++++---------------
>>   hw/ppc/spapr_pci.c     | 21 +++++++++----
>>   hw/ppc/spapr_vio.c     |  9 +++---
>>   include/hw/ppc/spapr.h | 10 +++----
>>   4 files changed, 80 insertions(+), 40 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 8132f64..e66e128 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -174,15 +174,8 @@ static int spapr_tce_table_realize(DeviceState *dev)
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>>
>>       tcet->fd = -1;
>> -    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>> -                                        tcet->page_shift,
>> -                                        tcet->nb_table,
>> -                                        &tcet->fd,
>> -                                        tcet->need_vfio);
>> -
>>       memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
>> -                             "iommu-spapr",
>> -                             (uint64_t)tcet->nb_table << tcet->page_shift);
>> +                             "iommu-spapr", 0);
>>
>>       QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
>>
>> @@ -224,14 +217,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>>       tcet->table = newtable;
>>   }
>>
>> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>> -                                   uint64_t bus_offset,
>> -                                   uint32_t page_shift,
>> -                                   uint32_t nb_table,
>> -                                   bool need_vfio)
>> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>>   {
>>       sPAPRTCETable *tcet;
>> -    char tmp[64];
>> +    char tmp[32];
>>
>>       if (spapr_tce_find_by_liobn(liobn)) {
>>           fprintf(stderr, "Attempted to create TCE table with duplicate"
>> @@ -239,16 +228,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>>           return NULL;
>>       }
>>
>> -    if (!nb_table) {
>> -        return NULL;
>> -    }
>> -
>>       tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>>       tcet->liobn = liobn;
>> -    tcet->bus_offset = bus_offset;
>> -    tcet->page_shift = page_shift;
>> -    tcet->nb_table = nb_table;
>> -    tcet->need_vfio = need_vfio;
>>
>>       snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>>       object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
>> @@ -258,14 +239,65 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>>       return tcet;
>>   }
>>
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
>> +{
>> +    if (!tcet->nb_table) {
>> +        return;
>> +    }
>> +
>> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>> +                                        tcet->page_shift,
>> +                                        tcet->nb_table,
>> +                                        &tcet->fd,
>> +                                        tcet->need_vfio);
>> +
>> +    memory_region_set_size(&tcet->iommu,
>> +                           (uint64_t)tcet->nb_table << tcet->page_shift);
>> +
>> +    tcet->enabled = true;
>> +}
>> +
>> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
>> +                            uint32_t page_shift, uint64_t bus_offset,
>> +                            uint32_t nb_table, bool need_vfio)
>> +{
>> +    if (tcet->enabled) {
>> +        return;
>
> If the given parameters are different from the current ones, treating
> this as a no-op is rather misleading.  I gather that to resize the
> window you're expected to disable, then re-enable.  In which case I
> think it would be safer to actually throw some kind of error on a
> double enable.

I'll add here

  error_report("Warning: trying to enable already enabled TCE table");

...




>
>
>> +    }
>> +
>> +    tcet->bus_offset = bus_offset;
>> +    tcet->page_shift = page_shift;
>> +    tcet->nb_table = nb_table;
>> +    tcet->need_vfio = need_vfio;
>> +
>> +    spapr_tce_table_do_enable(tcet);
>> +}
>> +
>> +static void spapr_tce_table_disable(sPAPRTCETable *tcet)
>> +{
>> +    if (!tcet->enabled) {

...and

error_report("Warning: trying to disable already disabled TCE table");

or g_assert()?


>> +        return;
>> +    }
>> +
>> +    memory_region_set_size(&tcet->iommu, 0);
>> +
>> +    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
>> +    tcet->fd = -1;
>> +    tcet->table = NULL;
>> +    tcet->enabled = false;
>> +    tcet->bus_offset = 0;
>> +    tcet->page_shift = 0;
>> +    tcet->nb_table = 0;
>> +    tcet->need_vfio = false;
>> +}
>> +
>>   static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>>
>>       QLIST_REMOVE(tcet, list);
>>
>> -    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
>> -    tcet->fd = -1;
>> +    spapr_tce_table_disable(tcet);
>>   }
>>
>>   MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 248f20a..c34a906 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -815,12 +815,13 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>           return -1;
>>       }
>>
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
>> -                               page_shift, nb_table, false);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>>       if (!tcet) {
>>           return -1;
>>       }
>>
>> +    spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
>> +
>>       memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>                                   spapr_tce_get_iommu(tcet));
>>       return 0;
>> @@ -1251,6 +1252,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       int i;
>>       PCIBus *bus;
>>       uint64_t msi_window_size = 4096;
>> +    sPAPRTCETable *tcet;
>>
>>       if (sphb->index != (uint32_t)-1) {
>>           hwaddr windows_base;
>> @@ -1402,11 +1404,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>           }
>>       }
>>
>> +    /* DMA setup */
>> +    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> +    if (!tcet) {
>> +        error_report("No default TCE table for %s", sphb->dtbusname);
>> +        return;
>> +    }
>> +
>>       /* Register default 32bit DMA window */
>> -    if (spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
>> -                                    sphb->dma_win_addr, sphb->dma_win_size)) {
>> -        error_setg(errp, "Unable to create TCE table for %s", sphb->dtbusname);
>> -    }
>> +    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
>> +                                SPAPR_TCE_PAGE_SHIFT,
>> +                                sphb->dma_win_addr,
>> +                                sphb->dma_win_size);
>>
>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>   }
>> diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
>> index 0f61a55..a745884 100644
>> --- a/hw/ppc/spapr_vio.c
>> +++ b/hw/ppc/spapr_vio.c
>> @@ -481,11 +481,10 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
>>           memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
>>           address_space_init(&dev->as, &dev->mrroot, qdev->id);
>>
>> -        dev->tcet = spapr_tce_new_table(qdev, liobn,
>> -                                        0,
>> -                                        SPAPR_TCE_PAGE_SHIFT,
>> -                                        pc->rtce_window_size >>
>> -                                        SPAPR_TCE_PAGE_SHIFT, false);
>> +        dev->tcet = spapr_tce_new_table(qdev, liobn);
>> +        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
>> +                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT,
>> +                               false);
>>           dev->tcet->vdev = dev;
>>           memory_region_add_subregion_overlap(&dev->mrroot, 0,
>>                                               spapr_tce_get_iommu(dev->tcet), 2);
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 098d85d..3e6bb84 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -539,6 +539,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
>>
>>   struct sPAPRTCETable {
>>       DeviceState parent;
>> +    bool enabled;
>>       uint32_t liobn;
>>       uint32_t nb_table;
>>       uint64_t bus_offset;
>> @@ -566,11 +567,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
>>   int spapr_h_cas_compose_response(sPAPRMachineState *sm,
>>                                    target_ulong addr, target_ulong size,
>>                                    bool cpu_update, bool memory_update);
>> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>> -                                   uint64_t bus_offset,
>> -                                   uint32_t page_shift,
>> -                                   uint32_t nb_table,
>> -                                   bool need_vfio);
>> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
>> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
>> +                            uint32_t page_shift, uint64_t bus_offset,
>> +                            uint32_t nb_table, bool vfio_accel);
>>   void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
>>
>>   MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-04  4:51   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2016-03-11  9:03     ` Alexey Kardashevskiy
  2016-03-15  5:53       ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-11  9:03 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/04/2016 03:51 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:41PM +1100, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices. As all TCE root regions
>> are mapped at 0 and 64bit long (and actual tables are child regions),
>> this replaces memory_region_add_subregion() with _overlap() to make
>> QEMU memory API happy.
>>
>> This reserves RTAS token numbers for DDW calls.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.5 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> TODO (which I have no idea how to implement properly):
>> 1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
>> windows and 12/16/24 page shift;
>
> As noted in a different subthread, this information is there in the
> container.

Well, I rather want this in rtas_ibm_query_pe_dma_window() to report to the 
guest the supported page sizes but I cannot because of missing 
vfio_container_ioctl().

I guest I'll just make page_size_mask, windows_supported and 
dma64_window_start PHB properties, set them to what I think the host 
supports and if the host does not support something, then QEMU will just 
fail quite quick and quite obviously why.


>> 2. fix container::min_iova, max_iova - as for now, they are useless,
>> and I'd expect IOMMU MR boundaries to serve this purpose really;
>
> This seems to show a similar confusion of concepts to #1.
> container::min_iova, container::max_iova advertise limitations of the
> host IOMMU, the IOMMU MR boundaries show constraints of the guest
> IOMMU.  You need to verify the guest constraints against the host
> constraints.
>
> A more flexible method than min/max iova will be necessary though, now
> that the host IOMMU allows more flexible configurations than a single
> window.
>
>> 3. vfio_listener_region_add/vfio_listener_region_del do explicitely
>> create/remove huge DMA window as we do not have vfio_container_ioctl()
>> anymore, do we want to move these to some sort of callbacks? How, where?
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> # Conflicts:
>> #	include/hw/pci-host/spapr.h
>>
>> # Conflicts:
>> #	hw/vfio/common.c
>> ---
>>   hw/ppc/Makefile.objs        |   1 +
>>   hw/ppc/spapr.c              |   7 +-
>>   hw/ppc/spapr_iommu.c        |  32 ++++-
>>   hw/ppc/spapr_pci.c          |  61 +++++++--
>>   hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/common.c            |  70 +++++++++-
>>   include/hw/pci-host/spapr.h |  13 ++
>>   include/hw/ppc/spapr.h      |  17 ++-
>>   trace-events                |   6 +
>>   9 files changed, 489 insertions(+), 24 deletions(-)
>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c1ffc77..986b36f 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>   obj-y += spapr_pci_vfio.o
>>   endif
>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index e9d4abf..2473217 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2370,7 +2370,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>    * pseries-2.5
>>    */
>>   #define SPAPR_COMPAT_2_5 \
>> -        HW_COMPAT_2_5
>> +        HW_COMPAT_2_5 \
>> +        {\
>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +            .property = "ddw",\
>> +            .value    = stringify(off),\
>> +        },
>>
>>   static void spapr_machine_2_5_instance_options(MachineState *machine)
>>   {
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 8aa2238..e32f71b 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -150,6 +150,15 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>       return 1ULL << tcet->page_shift;
>>   }
>>
>> +static void spapr_tce_table_pre_save(void *opaque)
>> +{
>> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +
>> +    tcet->migtable = tcet->table;
>> +}
>> +
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
>> +
>>   static int spapr_tce_table_post_load(void *opaque, int version_id)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -158,22 +167,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>>           spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>>       }
>>
>> +    if (tcet->enabled) {
>> +        if (!tcet->table) {
>> +            tcet->enabled = false;
>> +            /* VFIO does not migrate so pass vfio_accel == false */
>> +            spapr_tce_table_do_enable(tcet, false);
>> +        }
>
> What if there was an existing table, but its size doesn't match that
> in the incoming migration?Don't you need to free() it and
> re-allocate?  IIUC this would happen in practice if you migrated a
> guest which had removed the default window and replaced it with one of
> a different size.
>
>> +        memcpy(tcet->table, tcet->migtable,
>> +               tcet->nb_table * sizeof(tcet->table[0]));
>> +        free(tcet->migtable);
>> +        tcet->migtable = NULL;
>> +    }
>
> Likewise, what if your incoming migration is of a guest which has
> completely removed the default window?  Don't you need to free the
> existing default table?
 >
>>       return 0;
>>   }
>>
>>   static const VMStateDescription vmstate_spapr_tce_table = {
>>       .name = "spapr_iommu",
>> -    .version_id = 2,
>> +    .version_id = 3,
>>       .minimum_version_id = 2,
>> +    .pre_save = spapr_tce_table_pre_save,
>>       .post_load = spapr_tce_table_post_load,
>>       .fields      = (VMStateField []) {
>>           /* Sanity check */
>>           VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
>> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>>
>>           /* IOMMU state */
>> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
>> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>>           VMSTATE_BOOL(bypass, sPAPRTCETable),
>> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
>> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
>> +                                    vmstate_info_uint64, uint64_t),
>>
>>           VMSTATE_END_OF_LIST()
>>       },
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 4c6e687..1bc0710 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -803,10 +803,10 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>       return buf;
>>   }
>>
>> -static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> -                                       uint32_t liobn, uint32_t page_shift,
>> -                                       uint64_t window_addr,
>> -                                       uint64_t window_size)
>> +int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> +                                uint32_t liobn, uint32_t page_shift,
>> +                                uint64_t window_addr,
>> +                                uint64_t window_size)
>>   {
>>       sPAPRTCETable *tcet;
>>       uint32_t nb_table = window_size >> page_shift;
>> @@ -820,12 +820,16 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>           return -1;
>>       }
>>
>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>> +        return -1;
>> +    }
>> +
>>       spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
>>
>>       return 0;
>>   }
>>
>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>   {
>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>
>> @@ -1418,14 +1422,21 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       }
>>
>>       /* DMA setup */
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> -    if (!tcet) {
>> -        error_report("No default TCE table for %s", sphb->dtbusname);
>> -        return;
>> -    }
>> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
>> +    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
>> +    sphb->dma64_window_size = pow2ceil(ram_size);
>
> Why do you need this value?  Isn't the size of the dma64 window
> supplied when you create it with RTAS?  It makes more sense to me to
> validate the value at that point rather than here where you have to
> use a global.
>
> Plus.. if your machine allows hotplug memory you probably need
> maxram_size, rather than ram_size here.
>
>>
>> -    memory_region_add_subregion(&sphb->iommu_root, 0,
>> -                                spapr_tce_get_iommu(tcet));
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        tcet = spapr_tce_new_table(DEVICE(sphb),
>> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
>> +        if (!tcet) {
>> +            error_setg(errp, "Creating window#%d failed for %s",
>> +                       i, sphb->dtbusname);
>> +            return;
>> +        }
>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> +                                            spapr_tce_get_iommu(tcet), 0);
>> +    }
>>
>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>   }
>> @@ -1443,7 +1454,11 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>
>>   void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>   {
>> -    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>> +    int i;
>> +
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
>> +    }
>>
>>       /* Register default 32bit DMA window */
>>       spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
>> @@ -1481,6 +1496,9 @@ static Property spapr_phb_properties[] = {
>>       /* Default DMA window is 0..1GB */
>>       DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>       DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>> +                       0x800000000000000ULL),
>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>
>> @@ -1734,6 +1752,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       uint32_t interrupt_map_mask[] = {
>>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>> +    };
>> +    uint32_t ddw_extensions[] = {
>> +        cpu_to_be32(1),
>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>> +    };
>>       sPAPRTCETable *tcet;
>>       PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>       sPAPRFDT s_fdt;
>> @@ -1758,6 +1785,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>
>> +    /* Dynamic DMA window */
>> +    if (phb->ddw_enabled) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>> +    }
>> +
>>       /* Build the interrupt-map, this must matches what is done
>>        * in pci_spapr_map_irq
>>        */
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..b8ea910
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,306 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->enabled) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->enabled) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>> +                                 uint64_t page_mask)
>> +{
>> +    int i, j;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>> +            if ((sps[i].page_shift == masks[j].shift) &&
>> +                    (page_mask & (1ULL << masks[j].shift))) {
>> +                mask |= masks[j].mask;
>> +            }
>> +        }
>> +    }
>
> Hmm... checking against the list of page sizes supported by the vcpu
> seems conceptually wrong, although it's probably correct in practice.
> Is there a way of checking directly against the pagesizes supported by
> the host IOMMU.


VFIO_IOMMU_SPAPR_TCE_GET_INFO returns the mask but since 
vfio_container_ioctl() is gone, there is no direct way of knowing it here, 
it is hidded now in hw/vfio/common.c.

Anyway the host IOMMU always supports 4K|64K|16M. QEMU may or may not use 
huge pages for the guest RAM, this defines whether H_PUT_TCE for 16M page 
suceeds or fails.


>
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    CPUPPCState *env = &cpu->env;
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t avail, addr, pgmask = 0;
>> +    unsigned current;
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    current = spapr_phb_get_active_win_num(sphb);
>> +    avail = (sphb->windows_supported > current) ?
>> +            (sphb->windows_supported - current) : 0;
>
> sphb->windows_supported < current indicates a bug in qemu, surely?  So
> you should be able to do without the ?:.
>
>> +
>> +    /* Work out supported page masks */
>> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as all RAM was in 4K pages.
>> +     */
>> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
>> +                                pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +    long ret;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = spapr_phb_get_free_liobn(sphb);
>> +
>> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    if (window_shift < page_shift) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_window_enable(sphb, liobn, page_shift,
>> +                                      sphb->dma64_window_addr,
>> +                                      1ULL << window_shift);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>> +                                 1ULL << window_shift,
>> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
>> +                                 liobn, ret);
>> +    if (ret || !tcet) {
>> +        goto hw_error_exit;
>> +    }
>
> !ret && !tcet indicates a qemu bug, surely, an assert would make more
> sense.

Heh. That is correct. Although spapr_phb_dma_window_enable() calls 
eventually vfio_listener_region_add() which can fail as it calls the host 
VFIO IOMMU driver but there is no nice way of delivering that error here...


>
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, liobn);
>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn;
>> +    long ret;
>> +
>> +    if ((nargs != 1) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    liobn = rtas_ld(args, 0);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_window_disable(sphb, liobn);
>> +    trace_spapr_iommu_ddw_remove(liobn, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t addr;
>> +    long ret = 0;
>
> ret is never assigned a value other than 0; remove it.
>
>> +    if ((nargs != 3) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_phb_dma_reset(sphb);
>> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void spapr_rtas_ddw_init(void)
>> +{
>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +                        "ibm,query-pe-dma-window",
>> +                        rtas_ibm_query_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +                        "ibm,create-pe-dma-window",
>> +                        rtas_ibm_create_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>> +                        "ibm,remove-pe-dma-window",
>> +                        rtas_ibm_remove_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>> +                        "ibm,reset-pe-dma-window",
>> +                        rtas_ibm_reset_pe_dma_window);
>> +}
>> +
>> +type_init(spapr_rtas_ddw_init)
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 42ef1eb..2332f8e 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -395,6 +395,39 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>>           giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>
> It might make this easier to review if the guest side (non-VFIO) and
> VFIO parts were in different patches.
>
>> +        if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>
> Might want to split this stuff out into a "new guest iommu" helper.
> It would want to first check if the guest IOMMU can be supported with
> the existing host IOMMU windows.  If not, and the host IOMMU supports
> it (i.e. SPAPR_TCE_v2_IOMMU) it would attempt to create a new host
> window.
>
>> +            int ret;
>> +            struct vfio_iommu_spapr_tce_create create = {
>> +                .argsz = sizeof(create),
>> +                .page_shift = ctz64(giommu->iova_pgsizes),
>> +                .window_size = memory_region_size(section->mr),
>> +                .levels = 0,
>> +                .start_addr = 0,
>> +            };
>> +
>> +            /*
>> +             * Dynamic windows are supported, that means that there is no
>> +             * pre-created window and we have to create one.
>> +             */
>> +            if (!create.levels) {
>
> This test will always be true.
>
>> +                unsigned entries = create.window_size >> create.page_shift;
>> +                unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
>> +                /* 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 */
>> +                create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>
> Hmm.. does it make more sense for qemu to apply this heuristic, or the kernel?


If something can be done safely in the userspace, why would we want to put 
it to the kernel?



>> +            }
>> +            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +            if (ret) {
>> +                error_report("Failed to create a window");
>> +            }
>> +
>> +            if (create.start_addr != section->offset_within_address_space) {
>> +                error_report("Something went wrong!");
>
> Shouldn't you at least set start_addr before the ioctl() as a hint to
> the kernel?


The kernel does not take hints. At least on POWER8 (may be it will on POWER9).


-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-03  6:30   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2016-03-15  2:53     ` Alexey Kardashevskiy
  2016-03-15  5:42       ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-15  2:53 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/03/2016 05:30 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
>> This makes use of the new "memory registering" feature. The idea is
>> to provide the userspace ability to notify the host kernel about pages
>> which are going to be used for DMA. Having this information, the host
>> kernel can pin them all once per user process, do locked pages
>> accounting (once) and not spent time on doing that in real time with
>> possible failures which cannot be handled nicely in some cases.
>>
>> This adds a prereg memory listener which listens on address_space_memory
>> and notifies a VFIO container about memory which needs to be
>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>
>> As there is no per-IOMMU-type release() callback anymore, this stores
>> the IOMMU type in the container so vfio_listener_release() can device
>> if it needs to unregister @prereg_listener.
>>
>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>> not call it when v2 is detected and enabled.
>>
>> This does not change the guest visible interface.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/vfio/Makefile.objs         |   1 +
>>   hw/vfio/common.c              |  39 +++++++++---
>>   hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-common.h |   4 ++
>>   trace-events                  |   2 +
>>   5 files changed, 175 insertions(+), 9 deletions(-)
>>   create mode 100644 hw/vfio/prereg.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index ceddbb8..5800e0e 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>   obj-$(CONFIG_SOFTMMU) += platform.o
>>   obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>   obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>> +obj-$(CONFIG_SOFTMMU) += prereg.o
>>   endif
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 3aaa6b5..f2a03e0 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
>>   static void vfio_listener_release(VFIOContainer *container)
>>   {
>>       memory_listener_unregister(&container->iommu_listener.listener);
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        memory_listener_unregister(&container->prereg_listener.listener);
>> +    }
>>   }
>>
>>   int vfio_mmap_region(Object *obj, VFIORegion *region,
>> @@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               goto free_container_exit;
>>           }
>>
>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>           if (ret) {
>>               error_report("vfio: failed to set iommu for container: %m");
>>               ret = -errno;
>> @@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>               container->iova_pgsizes = info.iova_pgsizes;
>>           }
>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>           struct vfio_iommu_spapr_tce_info info;
>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>
>>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>           if (ret) {
>> @@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               ret = -errno;
>>               goto free_container_exit;
>>           }
>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        container->iommu_type =
>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>
> It'd be nice to consolidate the setting of container->iommu_type and
> then the SET_IOMMU ioctl() rather than having more or less duplicated
> logic for Type1 and SPAPR, but it's not a big deal.


May be but I cannot think of any nice way of doing this though.


>
>>           if (ret) {
>>               error_report("vfio: failed to set iommu for container: %m");
>>               ret = -errno;
>> @@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>            * when container fd is closed so we do not call it explicitly
>>            * in this file.
>>            */
>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> -        if (ret) {
>> -            error_report("vfio: failed to enable container: %m");
>> -            ret = -errno;
>> -            goto free_container_exit;
>> +        if (!v2) {
>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> +            if (ret) {
>> +                error_report("vfio: failed to enable container: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            container->prereg_listener.container = container;
>> +            container->prereg_listener.listener = vfio_prereg_listener;
>> +
>> +            memory_listener_register(&container->prereg_listener.listener,
>> +                                     &address_space_memory);
>
> This assumes that the target address space of the (guest) IOMMU is
> address_space_memory.  Which is fine - vfio already assumes that - but
> it reminds me that it'd be nice to have an explicit check for that (I
> guess it would have to go in vfio_iommu_map_notify()).  So that if
> someone constructs a machine where that's not the case, it'll at least
> be obvious why VFIO isn't working.

Ok, I'll add a small patch for this in the next respin.


>
>> +            if (container->error) {
>> +                error_report("vfio: RAM memory listener initialization failed for container");
>> +                memory_listener_unregister(
>> +                    &container->prereg_listener.listener);
>> +                goto free_container_exit;
>> +            }
>>           }
>
> Looks like you don't have an error path which will handle the case
> where the prereg listener is registered, but registering the normal
> PCI AS listener fails - I believe you will fail to unregister the
> prereg listener in that case.


In this case, the control goes to listener_release_exit: which calls 
vfio_listener_release() which unregisters both listeners (it is a few 
chunks above).



>>           /*
>> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
>> new file mode 100644
>> index 0000000..66cd696
>> --- /dev/null
>> +++ b/hw/vfio/prereg.c
>> @@ -0,0 +1,138 @@
>> +/*
>> + * DMA memory preregistration
>> + *
>> + * Authors:
>> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "hw/vfio/vfio.h"
>> +#include "qemu/error-report.h"
>> +#include "trace.h"
>> +
>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
>> +{
>> +    return (!memory_region_is_ram(section->mr) &&
>> +            !memory_region_is_iommu(section->mr)) ||
>> +            memory_region_is_skip_dump(section->mr);
>> +}
>> +
>> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
>> +                                                 listener);
>> +    VFIOContainer *container = vlistener->container;
>> +    hwaddr iova;
>> +    Int128 llend;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_add_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>
> You should probably explicitly check for IOMMU regions and abort if
> you find one.  An IOMMU AS appearing within address_space_memory would
> be bad news.


Oh, vfio_prereg_listener_skipped_section() allows memory_region_is_iommu(), 
I'll remove it.



>
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>> +                 (section->offset_within_region & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>
> iova is a terrible name here.  This is *not* an IOVA, but a real
> memory address.


I'll do s/iova/gpa/


>
>> +    llend = int128_make64(section->offset_within_address_space);
>> +    llend = int128_add(llend, section->size);
>> +    llend = int128_and(llend, int128_exts64(page_mask));
>> +
>> +    if (int128_ge(int128_make64(iova), llend)) {
>> +        return;
>
> IIUC, if we get here something has gone horribly wrong in our machine
> setup, and we shold probably just abort.  Same goes for the similar
> test in the IOVA listener, of course.


I do not have a good message candidate though. May be
hw_error("vfio: Broken section alignment");
?


>
>> +    }
>> +
>> +    memory_region_ref(section->mr);
>> +
>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>> +        section->offset_within_region +
>> +        (iova - section->offset_within_address_space);
>> +    reg.size = int128_get64(llend) - iova;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
>> +    if (ret) {
>> +        /*
>> +         * On the initfn path, store the first error in the container so we
>> +         * can gracefully fail.  Runtime, there's not much we can do other
>> +         * than throw a hardware error.
>> +         */
>> +        if (!container->initialized) {
>> +            if (!container->error) {
>> +                container->error = ret;
>> +            }
>> +        } else {
>> +            hw_error("vfio: DMA mapping failed, unable to continue");
>
> Wrong error message.
>
>> +        }
>> +    }
>> +}
>> +
>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
>> +                                                 listener);
>> +    VFIOContainer *container = vlistener->container;
>> +    hwaddr iova, end;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_del_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>> +                 (section->offset_within_region & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>> +    end = (section->offset_within_address_space + int128_get64(section->size)) &
>> +        page_mask;
>> +
>> +    if (iova >= end) {
>> +        return;
>> +    }
>> +
>> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
>> +        section->offset_within_region +
>> +        (iova - section->offset_within_address_space);
>> +    reg.size = end - iova;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
>> +}
>> +
>> +const MemoryListener vfio_prereg_listener = {
>> +    .region_add = vfio_prereg_listener_region_add,
>> +    .region_del = vfio_prereg_listener_region_del,
>> +};
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index b6b736c..bcbc5cb 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -70,6 +70,8 @@ typedef struct VFIOContainer {
>>       VFIOAddressSpace *space;
>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>       VFIOMemoryListener iommu_listener;
>> +    VFIOMemoryListener prereg_listener;
>> +    unsigned iommu_type;
>>       int error;
>>       bool initialized;
>>       /*
>> @@ -146,4 +148,6 @@ extern const MemoryRegionOps vfio_region_ops;
>>   extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
>>   extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
>>
>> +extern const MemoryListener vfio_prereg_listener;
>> +
>>   #endif /* !HW_VFIO_VFIO_COMMON_H */
>> diff --git a/trace-events b/trace-events
>> index 4b6ea70..f5335ec 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1725,6 +1725,8 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
>>   vfio_put_group(int fd) "close group->fd=%d"
>>   vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
>>   vfio_put_base_device(int fd) "close vdev->fd=%d"
>> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>
>>   # hw/vfio/platform.c
>>   vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper
  2016-03-10  5:47     ` Alexey Kardashevskiy
@ 2016-03-15  5:30       ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-15  5:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4746 bytes --]

On Thu, Mar 10, 2016 at 04:47:04PM +1100, Alexey Kardashevskiy wrote:
> On 03/03/2016 12:40 PM, David Gibson wrote:
> >On Tue, Mar 01, 2016 at 08:10:27PM +1100, Alexey Kardashevskiy wrote:
> >>We are going to have multiple DMA windows soon so let's start preparing.
> >>
> >>This adds a new helper to create a DMA window and makes use of it in
> >>sPAPRPHBState::realize().
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  hw/ppc/spapr_pci.c | 40 +++++++++++++++++++++++++++-------------
> >>  1 file changed, 27 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index 3d1145e..248f20a 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -803,6 +803,29 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
> >>      return buf;
> >>  }
> >>
> >>+static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>+                                       uint32_t liobn, uint32_t page_shift,
> >>+                                       uint64_t window_addr,
> >>+                                       uint64_t window_size)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+    uint32_t nb_table = window_size >> page_shift;
> >>+
> >>+    if (!nb_table) {
> >>+        return -1;
> >>+    }
> >
> >The caller shouldn't do this, so this probably makes more sense as an
> >assert() than an error return.
> 
> 
> @dma_win_size is a PHB property so the cli can set it to zero - where is it
> supposed to fail? When DMA won't work? This will be not really obvious for
> the user. Check dma_win_size==0 where we check mem_win_addr/...?

Ah.. good point.  It could be checked in the caller, but it doesn't
make a lot of sense to.

> >>+
> >>+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> >>+                               page_shift, nb_table, false);
> >>+    if (!tcet) {
> >>+        return -1;
> >>+    }
> >
> >Since you're adding error reporting, you might as well make it via the
> >error API instead of a return code.  That way if we wasnt to add more
> >detailed error API reporting to spapr_tce_new_table() in future,
> >there's less to change.
> 
> Well, the table allocation is the only real thing which may fail there and
> spapr_phb_realize() does not pass Error to the callers

?? I'm not sure what you mean by that.

> so
> spapr_phb_dma_window_enable() would be the first one to propagate an error
> and it just seems a bit over engineered. Should I still do that?

Yes.

You can argue it's overengineered, but better we move towards
overengineered in a consistent way, than continue to use a mishmash of
error codes and the error api.


> 
> 
> >>+
> >>+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> >>+                                spapr_tce_get_iommu(tcet));
> >>+    return 0;
> >>+}
> >>+
> >>  /* Macros to operate with address in OF binding to PCI */
> >>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
> >>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> >>@@ -1228,8 +1251,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>      int i;
> >>      PCIBus *bus;
> >>      uint64_t msi_window_size = 4096;
> >>-    sPAPRTCETable *tcet;
> >>-    uint32_t nb_table;
> >>
> >>      if (sphb->index != (uint32_t)-1) {
> >>          hwaddr windows_base;
> >>@@ -1381,18 +1402,11 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>          }
> >>      }
> >>
> >>-    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
> >>-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> >>-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> >>-    if (!tcet) {
> >>-        error_setg(errp, "Unable to create TCE table for %s",
> >>-                   sphb->dtbusname);
> >>-        return;
> >>-    }
> >>-
> >>      /* Register default 32bit DMA window */
> >>-    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
> >>-                                spapr_tce_get_iommu(tcet));
> >>+    if (spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
> >>+                                    sphb->dma_win_addr, sphb->dma_win_size)) {
> >>+        error_setg(errp, "Unable to create TCE table for %s", sphb->dtbusname);
> >>+    }
> >>
> >>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >>  }
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table
  2016-03-10  7:39     ` Alexey Kardashevskiy
@ 2016-03-15  5:32       ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-15  5:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7111 bytes --]

On Thu, Mar 10, 2016 at 06:39:39PM +1100, Alexey Kardashevskiy wrote:
> On 03/03/2016 02:00 PM, David Gibson wrote:
> >On Tue, Mar 01, 2016 at 08:10:29PM +1100, Alexey Kardashevskiy wrote:
> >>Currently TCE tables are created once at start and their sizes never
> >>change. We are going to change that by introducing a Dynamic DMA windows
> >>support where DMA configuration may change during the guest execution.
> >>
> >>This changes spapr_tce_new_table() to create an empty zero-size IOMMU
> >>memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
> >>It still will be called once at the owner object (VIO or PHB) creation.
> >>
> >>This introduces an "enabled" state for TCE table objects with two
> >>helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> >>- spapr_tce_table_enable() receives TCE table parameters, allocates
> >>a guest view of the TCE table (in the user space or KVM) and
> >>sets the correct size on the IOMMU MR.
> >>- spapr_tce_table_disable() disposes the table and resets the IOMMU MR
> >>size.
> >>
> >>This changes the PHB reset handler to do the default DMA initialization
> >>instead of spapr_phb_realize(). This does not make differenct now but
> >>later with more than just one DMA window, we will have to remove them all
> >>and create the default one on a system reset.
> >>
> >>No visible change in behaviour is expected except the actual table
> >>will be reallocated every reset. We might optimize this later.
> >>
> >>The other way to implement this would be dynamically create/remove
> >>the TCE table QOM objects but this would make migration impossible
> >>as the migration code expects all QOM objects to exist at the receiver
> >>so we have to have TCE table objects created when migration begins.
> >>
> >>spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> >>as later it will be called at the sPAPRTCETable post-migration stage when
> >>it already has all the properties set after the migration.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> >Although there's one nit that could be improved:
> >
> >
> >>---
> >>  hw/ppc/spapr_iommu.c   | 80 +++++++++++++++++++++++++++++++++++---------------
> >>  hw/ppc/spapr_pci.c     | 21 +++++++++----
> >>  hw/ppc/spapr_vio.c     |  9 +++---
> >>  include/hw/ppc/spapr.h | 10 +++----
> >>  4 files changed, 80 insertions(+), 40 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 8132f64..e66e128 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -174,15 +174,8 @@ static int spapr_tce_table_realize(DeviceState *dev)
> >>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> >>
> >>      tcet->fd = -1;
> >>-    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> >>-                                        tcet->page_shift,
> >>-                                        tcet->nb_table,
> >>-                                        &tcet->fd,
> >>-                                        tcet->need_vfio);
> >>-
> >>      memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
> >>-                             "iommu-spapr",
> >>-                             (uint64_t)tcet->nb_table << tcet->page_shift);
> >>+                             "iommu-spapr", 0);
> >>
> >>      QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
> >>
> >>@@ -224,14 +217,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
> >>      tcet->table = newtable;
> >>  }
> >>
> >>-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> >>-                                   uint64_t bus_offset,
> >>-                                   uint32_t page_shift,
> >>-                                   uint32_t nb_table,
> >>-                                   bool need_vfio)
> >>+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
> >>  {
> >>      sPAPRTCETable *tcet;
> >>-    char tmp[64];
> >>+    char tmp[32];
> >>
> >>      if (spapr_tce_find_by_liobn(liobn)) {
> >>          fprintf(stderr, "Attempted to create TCE table with duplicate"
> >>@@ -239,16 +228,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> >>          return NULL;
> >>      }
> >>
> >>-    if (!nb_table) {
> >>-        return NULL;
> >>-    }
> >>-
> >>      tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
> >>      tcet->liobn = liobn;
> >>-    tcet->bus_offset = bus_offset;
> >>-    tcet->page_shift = page_shift;
> >>-    tcet->nb_table = nb_table;
> >>-    tcet->need_vfio = need_vfio;
> >>
> >>      snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
> >>      object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> >>@@ -258,14 +239,65 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> >>      return tcet;
> >>  }
> >>
> >>+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
> >>+{
> >>+    if (!tcet->nb_table) {
> >>+        return;
> >>+    }
> >>+
> >>+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> >>+                                        tcet->page_shift,
> >>+                                        tcet->nb_table,
> >>+                                        &tcet->fd,
> >>+                                        tcet->need_vfio);
> >>+
> >>+    memory_region_set_size(&tcet->iommu,
> >>+                           (uint64_t)tcet->nb_table << tcet->page_shift);
> >>+
> >>+    tcet->enabled = true;
> >>+}
> >>+
> >>+void spapr_tce_table_enable(sPAPRTCETable *tcet,
> >>+                            uint32_t page_shift, uint64_t bus_offset,
> >>+                            uint32_t nb_table, bool need_vfio)
> >>+{
> >>+    if (tcet->enabled) {
> >>+        return;
> >
> >If the given parameters are different from the current ones, treating
> >this as a no-op is rather misleading.  I gather that to resize the
> >window you're expected to disable, then re-enable.  In which case I
> >think it would be safer to actually throw some kind of error on a
> >double enable.
> 
> I'll add here
> 
>  error_report("Warning: trying to enable already enabled TCE table");
> 
> ...
> 
> 
> 
> 
> >
> >
> >>+    }
> >>+
> >>+    tcet->bus_offset = bus_offset;
> >>+    tcet->page_shift = page_shift;
> >>+    tcet->nb_table = nb_table;
> >>+    tcet->need_vfio = need_vfio;
> >>+
> >>+    spapr_tce_table_do_enable(tcet);
> >>+}
> >>+
> >>+static void spapr_tce_table_disable(sPAPRTCETable *tcet)
> >>+{
> >>+    if (!tcet->enabled) {
> 
> ...and
> 
> error_report("Warning: trying to disable already disabled TCE table");

That sounds good.

> or g_assert()?

Erm.. only if this can't be triggered by user actions.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-15  2:53     ` Alexey Kardashevskiy
@ 2016-03-15  5:42       ` David Gibson
  2016-03-17  5:04         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-15  5:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 16031 bytes --]

On Tue, Mar 15, 2016 at 01:53:48PM +1100, Alexey Kardashevskiy wrote:
> On 03/03/2016 05:30 PM, David Gibson wrote:
> >On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
> >>This makes use of the new "memory registering" feature. The idea is
> >>to provide the userspace ability to notify the host kernel about pages
> >>which are going to be used for DMA. Having this information, the host
> >>kernel can pin them all once per user process, do locked pages
> >>accounting (once) and not spent time on doing that in real time with
> >>possible failures which cannot be handled nicely in some cases.
> >>
> >>This adds a prereg memory listener which listens on address_space_memory
> >>and notifies a VFIO container about memory which needs to be
> >>pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> >>
> >>As there is no per-IOMMU-type release() callback anymore, this stores
> >>the IOMMU type in the container so vfio_listener_release() can device
> >>if it needs to unregister @prereg_listener.
> >>
> >>The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >>not call it when v2 is detected and enabled.
> >>
> >>This does not change the guest visible interface.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  hw/vfio/Makefile.objs         |   1 +
> >>  hw/vfio/common.c              |  39 +++++++++---
> >>  hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/vfio/vfio-common.h |   4 ++
> >>  trace-events                  |   2 +
> >>  5 files changed, 175 insertions(+), 9 deletions(-)
> >>  create mode 100644 hw/vfio/prereg.c
> >>
> >>diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> >>index ceddbb8..5800e0e 100644
> >>--- a/hw/vfio/Makefile.objs
> >>+++ b/hw/vfio/Makefile.objs
> >>@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
> >>  obj-$(CONFIG_SOFTMMU) += platform.o
> >>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
> >>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> >>+obj-$(CONFIG_SOFTMMU) += prereg.o
> >>  endif
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 3aaa6b5..f2a03e0 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
> >>  static void vfio_listener_release(VFIOContainer *container)
> >>  {
> >>      memory_listener_unregister(&container->iommu_listener.listener);
> >>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>+        memory_listener_unregister(&container->prereg_listener.listener);
> >>+    }
> >>  }
> >>
> >>  int vfio_mmap_region(Object *obj, VFIORegion *region,
> >>@@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto free_container_exit;
> >>          }
> >>
> >>-        ret = ioctl(fd, VFIO_SET_IOMMU,
> >>-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> >>+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> >>+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>          if (ret) {
> >>              error_report("vfio: failed to set iommu for container: %m");
> >>              ret = -errno;
> >>@@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >>              container->iova_pgsizes = info.iova_pgsizes;
> >>          }
> >>-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >>+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >>+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >>          struct vfio_iommu_spapr_tce_info info;
> >>+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >>
> >>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>          if (ret) {
> >>@@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              ret = -errno;
> >>              goto free_container_exit;
> >>          }
> >>-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >>+        container->iommu_type =
> >>+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> >>+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >
> >It'd be nice to consolidate the setting of container->iommu_type and
> >then the SET_IOMMU ioctl() rather than having more or less duplicated
> >logic for Type1 and SPAPR, but it's not a big deal.
> 
> 
> May be but I cannot think of any nice way of doing this though.
> 
> 
> >
> >>          if (ret) {
> >>              error_report("vfio: failed to set iommu for container: %m");
> >>              ret = -errno;
> >>@@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>           * when container fd is closed so we do not call it explicitly
> >>           * in this file.
> >>           */
> >>-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >>-        if (ret) {
> >>-            error_report("vfio: failed to enable container: %m");
> >>-            ret = -errno;
> >>-            goto free_container_exit;
> >>+        if (!v2) {
> >>+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >>+            if (ret) {
> >>+                error_report("vfio: failed to enable container: %m");
> >>+                ret = -errno;
> >>+                goto free_container_exit;
> >>+            }
> >>+        } else {
> >>+            container->prereg_listener.container = container;
> >>+            container->prereg_listener.listener = vfio_prereg_listener;
> >>+
> >>+            memory_listener_register(&container->prereg_listener.listener,
> >>+                                     &address_space_memory);
> >
> >This assumes that the target address space of the (guest) IOMMU is
> >address_space_memory.  Which is fine - vfio already assumes that - but
> >it reminds me that it'd be nice to have an explicit check for that (I
> >guess it would have to go in vfio_iommu_map_notify()).  So that if
> >someone constructs a machine where that's not the case, it'll at least
> >be obvious why VFIO isn't working.
> 
> Ok, I'll add a small patch for this in the next respin.

Ok.

> >>+            if (container->error) {
> >>+                error_report("vfio: RAM memory listener initialization failed for container");
> >>+                memory_listener_unregister(
> >>+                    &container->prereg_listener.listener);
> >>+                goto free_container_exit;
> >>+            }
> >>          }
> >
> >Looks like you don't have an error path which will handle the case
> >where the prereg listener is registered, but registering the normal
> >PCI AS listener fails - I believe you will fail to unregister the
> >prereg listener in that case.
> 
> 
> In this case, the control goes to listener_release_exit: which calls
> vfio_listener_release() which unregisters both listeners (it is a few chunks
> above).

Ah.. yes.  In which case this could also jump to listener_release_exit
and avoid the explicit unreg(), yes?

> >>          /*
> >>diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> >>new file mode 100644
> >>index 0000000..66cd696
> >>--- /dev/null
> >>+++ b/hw/vfio/prereg.c
> >>@@ -0,0 +1,138 @@
> >>+/*
> >>+ * DMA memory preregistration
> >>+ *
> >>+ * Authors:
> >>+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
> >>+ *
> >>+ * This work is licensed under the terms of the GNU GPL, version 2.  See
> >>+ * the COPYING file in the top-level directory.
> >>+ */
> >>+
> >>+#include "qemu/osdep.h"
> >>+#include <sys/ioctl.h>
> >>+#include <linux/vfio.h>
> >>+
> >>+#include "hw/vfio/vfio-common.h"
> >>+#include "hw/vfio/vfio.h"
> >>+#include "qemu/error-report.h"
> >>+#include "trace.h"
> >>+
> >>+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> >>+{
> >>+    return (!memory_region_is_ram(section->mr) &&
> >>+            !memory_region_is_iommu(section->mr)) ||
> >>+            memory_region_is_skip_dump(section->mr);
> >>+}
> >>+
> >>+static void vfio_prereg_listener_region_add(MemoryListener *listener,
> >>+                                            MemoryRegionSection *section)
> >>+{
> >>+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> >>+                                                 listener);
> >>+    VFIOContainer *container = vlistener->container;
> >>+    hwaddr iova;
> >>+    Int128 llend;
> >>+    int ret;
> >>+    hwaddr page_mask = qemu_real_host_page_mask;
> >>+    struct vfio_iommu_spapr_register_memory reg = {
> >>+        .argsz = sizeof(reg),
> >>+        .flags = 0,
> >>+    };
> >>+
> >>+    if (vfio_prereg_listener_skipped_section(section)) {
> >>+        trace_vfio_listener_region_add_skip(
> >>+                section->offset_within_address_space,
> >>+                section->offset_within_address_space +
> >>+                int128_get64(int128_sub(section->size, int128_one())));
> >>+        return;
> >>+    }
> >
> >You should probably explicitly check for IOMMU regions and abort if
> >you find one.  An IOMMU AS appearing within address_space_memory would
> >be bad news.
> 
> 
> Oh, vfio_prereg_listener_skipped_section() allows memory_region_is_iommu(),
> I'll remove it.

Well, that's part.

But IOMMU regions appearing here shouldn't just be ignored - they
should be treated as a fatal error.

> 
> 
> 
> >
> >>+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> >>+                 (section->offset_within_region & ~page_mask))) {
> >>+        error_report("%s received unaligned region", __func__);
> >>+        return;
> >>+    }
> >>+
> >>+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
> >
> >iova is a terrible name here.  This is *not* an IOVA, but a real
> >memory address.
> 
> 
> I'll do s/iova/gpa/

Good.

> 
> 
> >
> >>+    llend = int128_make64(section->offset_within_address_space);
> >>+    llend = int128_add(llend, section->size);
> >>+    llend = int128_and(llend, int128_exts64(page_mask));
> >>+
> >>+    if (int128_ge(int128_make64(iova), llend)) {
> >>+        return;
> >
> >IIUC, if we get here something has gone horribly wrong in our machine
> >setup, and we shold probably just abort.  Same goes for the similar
> >test in the IOVA listener, of course.
> 
> 
> I do not have a good message candidate though. May be
> hw_error("vfio: Broken section alignment");
> ?

Just an assert should be fine.

> 
> 
> >
> >>+    }
> >>+
> >>+    memory_region_ref(section->mr);
> >>+
> >>+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >>+        section->offset_within_region +
> >>+        (iova - section->offset_within_address_space);
> >>+    reg.size = int128_get64(llend) - iova;
> >>+
> >>+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> >>+    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> >>+    if (ret) {
> >>+        /*
> >>+         * On the initfn path, store the first error in the container so we
> >>+         * can gracefully fail.  Runtime, there's not much we can do other
> >>+         * than throw a hardware error.
> >>+         */
> >>+        if (!container->initialized) {
> >>+            if (!container->error) {
> >>+                container->error = ret;
> >>+            }
> >>+        } else {
> >>+            hw_error("vfio: DMA mapping failed, unable to continue");
> >
> >Wrong error message.
> >
> >>+        }
> >>+    }
> >>+}
> >>+
> >>+static void vfio_prereg_listener_region_del(MemoryListener *listener,
> >>+                                            MemoryRegionSection *section)
> >>+{
> >>+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> >>+                                                 listener);
> >>+    VFIOContainer *container = vlistener->container;
> >>+    hwaddr iova, end;
> >>+    int ret;
> >>+    hwaddr page_mask = qemu_real_host_page_mask;
> >>+    struct vfio_iommu_spapr_register_memory reg = {
> >>+        .argsz = sizeof(reg),
> >>+        .flags = 0,
> >>+    };
> >>+
> >>+    if (vfio_prereg_listener_skipped_section(section)) {
> >>+        trace_vfio_listener_region_del_skip(
> >>+                section->offset_within_address_space,
> >>+                section->offset_within_address_space +
> >>+                int128_get64(int128_sub(section->size, int128_one())));
> >>+        return;
> >>+    }
> >>+
> >>+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> >>+                 (section->offset_within_region & ~page_mask))) {
> >>+        error_report("%s received unaligned region", __func__);
> >>+        return;
> >>+    }
> >>+
> >>+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
> >>+    end = (section->offset_within_address_space + int128_get64(section->size)) &
> >>+        page_mask;
> >>+
> >>+    if (iova >= end) {
> >>+        return;
> >>+    }
> >>+
> >>+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >>+        section->offset_within_region +
> >>+        (iova - section->offset_within_address_space);
> >>+    reg.size = end - iova;
> >>+
> >>+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> >>+    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> >>+}
> >>+
> >>+const MemoryListener vfio_prereg_listener = {
> >>+    .region_add = vfio_prereg_listener_region_add,
> >>+    .region_del = vfio_prereg_listener_region_del,
> >>+};
> >>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>index b6b736c..bcbc5cb 100644
> >>--- a/include/hw/vfio/vfio-common.h
> >>+++ b/include/hw/vfio/vfio-common.h
> >>@@ -70,6 +70,8 @@ typedef struct VFIOContainer {
> >>      VFIOAddressSpace *space;
> >>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >>      VFIOMemoryListener iommu_listener;
> >>+    VFIOMemoryListener prereg_listener;
> >>+    unsigned iommu_type;
> >>      int error;
> >>      bool initialized;
> >>      /*
> >>@@ -146,4 +148,6 @@ extern const MemoryRegionOps vfio_region_ops;
> >>  extern QLIST_HEAD(vfio_group_head, VFIOGroup) vfio_group_list;
> >>  extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
> >>
> >>+extern const MemoryListener vfio_prereg_listener;
> >>+
> >>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> >>diff --git a/trace-events b/trace-events
> >>index 4b6ea70..f5335ec 100644
> >>--- a/trace-events
> >>+++ b/trace-events
> >>@@ -1725,6 +1725,8 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
> >>  vfio_put_group(int fd) "close group->fd=%d"
> >>  vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
> >>  vfio_put_base_device(int fd) "close vdev->fd=%d"
> >>+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>
> >>  # hw/vfio/platform.c
> >>  vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-11  9:03     ` Alexey Kardashevskiy
@ 2016-03-15  5:53       ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2016-03-15  5:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 19749 bytes --]

On Fri, Mar 11, 2016 at 08:03:43PM +1100, Alexey Kardashevskiy wrote:
> On 03/04/2016 03:51 PM, David Gibson wrote:
> >On Tue, Mar 01, 2016 at 08:10:41PM +1100, Alexey Kardashevskiy wrote:
> >>This adds support for Dynamic DMA Windows (DDW) option defined by
> >>the SPAPR specification which allows to have additional DMA window(s)
> >>
> >>This implements DDW for emulated and VFIO devices. As all TCE root regions
> >>are mapped at 0 and 64bit long (and actual tables are child regions),
> >>this replaces memory_region_add_subregion() with _overlap() to make
> >>QEMU memory API happy.
> >>
> >>This reserves RTAS token numbers for DDW calls.
> >>
> >>This changes the TCE table migration descriptor to support dynamic
> >>tables as from now on, PHB will create as many stub TCE table objects
> >>as PHB can possibly support but not all of them might be initialized at
> >>the time of migration because DDW might or might not be requested by
> >>the guest.
> >>
> >>The "ddw" property is enabled by default on a PHB but for compatibility
> >>the pseries-2.5 machine and older disable it.
> >>
> >>This implements DDW for VFIO. The host kernel support is required.
> >>This adds a "levels" property to PHB to control the number of levels
> >>in the actual TCE table allocated by the host kernel, 0 is the default
> >>value to tell QEMU to calculate the correct value. Current hardware
> >>supports up to 5 levels.
> >>
> >>The existing linux guests try creating one additional huge DMA window
> >>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >>property which is a bus address for the 64bit window and by default
> >>set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >>uses and this allows having emulated and VFIO devices on the same bus.
> >>
> >>This adds 4 RTAS handlers:
> >>* ibm,query-pe-dma-window
> >>* ibm,create-pe-dma-window
> >>* ibm,remove-pe-dma-window
> >>* ibm,reset-pe-dma-window
> >>These are registered from type_init() callback.
> >>
> >>These RTAS handlers are implemented in a separate file to avoid polluting
> >>spapr_iommu.c with PCI.
> >>
> >>TODO (which I have no idea how to implement properly):
> >>1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
> >>windows and 12/16/24 page shift;
> >
> >As noted in a different subthread, this information is there in the
> >container.
> 
> Well, I rather want this in rtas_ibm_query_pe_dma_window() to report to the
> guest the supported page sizes but I cannot because of missing
> vfio_container_ioctl().

You'll need to add a new interface(s) in the VFIO code to retrieve
this.  It should take an AddressSpace and return the minimum
capabilities that can be simultaneously supported by all attached
containers.

> I guest I'll just make page_size_mask, windows_supported and
> dma64_window_start PHB properties, set them to what I think the host
> supports and if the host does not support something, then QEMU will just
> fail quite quick and quite obviously why.

Actually.. that's a better idea.  In general I think it makes for
saner handling of compatibility in future if you make the guest
properties directly settable and check whether they're possible on the
host, rather than trying to autoset the guest capabilities to match
the host.

> >>2. fix container::min_iova, max_iova - as for now, they are useless,
> >>and I'd expect IOMMU MR boundaries to serve this purpose really;
> >
> >This seems to show a similar confusion of concepts to #1.
> >container::min_iova, container::max_iova advertise limitations of the
> >host IOMMU, the IOMMU MR boundaries show constraints of the guest
> >IOMMU.  You need to verify the guest constraints against the host
> >constraints.
> >
> >A more flexible method than min/max iova will be necessary though, now
> >that the host IOMMU allows more flexible configurations than a single
> >window.
> >
> >>3. vfio_listener_region_add/vfio_listener_region_del do explicitely
> >>create/remove huge DMA window as we do not have vfio_container_ioctl()
> >>anymore, do we want to move these to some sort of callbacks? How, where?
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >># Conflicts:
> >>#	include/hw/pci-host/spapr.h
> >>
> >># Conflicts:
> >>#	hw/vfio/common.c
> >>---
> >>  hw/ppc/Makefile.objs        |   1 +
> >>  hw/ppc/spapr.c              |   7 +-
> >>  hw/ppc/spapr_iommu.c        |  32 ++++-
> >>  hw/ppc/spapr_pci.c          |  61 +++++++--
> >>  hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/common.c            |  70 +++++++++-
> >>  include/hw/pci-host/spapr.h |  13 ++
> >>  include/hw/ppc/spapr.h      |  17 ++-
> >>  trace-events                |   6 +
> >>  9 files changed, 489 insertions(+), 24 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>
> >>diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>index c1ffc77..986b36f 100644
> >>--- a/hw/ppc/Makefile.objs
> >>+++ b/hw/ppc/Makefile.objs
> >>@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>  obj-y += spapr_pci_vfio.o
> >>  endif
> >>+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >>diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>index e9d4abf..2473217 100644
> >>--- a/hw/ppc/spapr.c
> >>+++ b/hw/ppc/spapr.c
> >>@@ -2370,7 +2370,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>   * pseries-2.5
> >>   */
> >>  #define SPAPR_COMPAT_2_5 \
> >>-        HW_COMPAT_2_5
> >>+        HW_COMPAT_2_5 \
> >>+        {\
> >>+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >>+            .property = "ddw",\
> >>+            .value    = stringify(off),\
> >>+        },
> >>
> >>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >>  {
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 8aa2238..e32f71b 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -150,6 +150,15 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >>      return 1ULL << tcet->page_shift;
> >>  }
> >>
> >>+static void spapr_tce_table_pre_save(void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> >>+
> >>+    tcet->migtable = tcet->table;
> >>+}
> >>+
> >>+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
> >>+
> >>  static int spapr_tce_table_post_load(void *opaque, int version_id)
> >>  {
> >>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> >>@@ -158,22 +167,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
> >>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
> >>      }
> >>
> >>+    if (tcet->enabled) {
> >>+        if (!tcet->table) {
> >>+            tcet->enabled = false;
> >>+            /* VFIO does not migrate so pass vfio_accel == false */
> >>+            spapr_tce_table_do_enable(tcet, false);
> >>+        }
> >
> >What if there was an existing table, but its size doesn't match that
> >in the incoming migration?Don't you need to free() it and
> >re-allocate?  IIUC this would happen in practice if you migrated a
> >guest which had removed the default window and replaced it with one of
> >a different size.
> >
> >>+        memcpy(tcet->table, tcet->migtable,
> >>+               tcet->nb_table * sizeof(tcet->table[0]));
> >>+        free(tcet->migtable);
> >>+        tcet->migtable = NULL;
> >>+    }
> >
> >Likewise, what if your incoming migration is of a guest which has
> >completely removed the default window?  Don't you need to free the
> >existing default table?
> >
> >>      return 0;
> >>  }
> >>
> >>  static const VMStateDescription vmstate_spapr_tce_table = {
> >>      .name = "spapr_iommu",
> >>-    .version_id = 2,
> >>+    .version_id = 3,
> >>      .minimum_version_id = 2,
> >>+    .pre_save = spapr_tce_table_pre_save,
> >>      .post_load = spapr_tce_table_post_load,
> >>      .fields      = (VMStateField []) {
> >>          /* Sanity check */
> >>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> >>-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
> >>
> >>          /* IOMMU state */
> >>+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> >>+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> >>+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
> >>+        VMSTATE_UINT32(nb_table, sPAPRTCETable),
> >>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> >>-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> >>+        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
> >>+                                    vmstate_info_uint64, uint64_t),
> >>
> >>          VMSTATE_END_OF_LIST()
> >>      },
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index 4c6e687..1bc0710 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -803,10 +803,10 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
> >>      return buf;
> >>  }
> >>
> >>-static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>-                                       uint32_t liobn, uint32_t page_shift,
> >>-                                       uint64_t window_addr,
> >>-                                       uint64_t window_size)
> >>+int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>+                                uint32_t liobn, uint32_t page_shift,
> >>+                                uint64_t window_addr,
> >>+                                uint64_t window_size)
> >>  {
> >>      sPAPRTCETable *tcet;
> >>      uint32_t nb_table = window_size >> page_shift;
> >>@@ -820,12 +820,16 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>          return -1;
> >>      }
> >>
> >>+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> >>+        return -1;
> >>+    }
> >>+
> >>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
> >>
> >>      return 0;
> >>  }
> >>
> >>-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>  {
> >>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>
> >>@@ -1418,14 +1422,21 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>      }
> >>
> >>      /* DMA setup */
> >>-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> >>-    if (!tcet) {
> >>-        error_report("No default TCE table for %s", sphb->dtbusname);
> >>-        return;
> >>-    }
> >>+    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
> >>+    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
> >>+    sphb->dma64_window_size = pow2ceil(ram_size);
> >
> >Why do you need this value?  Isn't the size of the dma64 window
> >supplied when you create it with RTAS?  It makes more sense to me to
> >validate the value at that point rather than here where you have to
> >use a global.
> >
> >Plus.. if your machine allows hotplug memory you probably need
> >maxram_size, rather than ram_size here.
> >
> >>
> >>-    memory_region_add_subregion(&sphb->iommu_root, 0,
> >>-                                spapr_tce_get_iommu(tcet));
> >>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>+        tcet = spapr_tce_new_table(DEVICE(sphb),
> >>+                                   SPAPR_PCI_LIOBN(sphb->index, i));
> >>+        if (!tcet) {
> >>+            error_setg(errp, "Creating window#%d failed for %s",
> >>+                       i, sphb->dtbusname);
> >>+            return;
> >>+        }
> >>+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>+                                            spapr_tce_get_iommu(tcet), 0);
> >>+    }
> >>
> >>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >>  }
> >>@@ -1443,7 +1454,11 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
> >>
> >>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >>  {
> >>-    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> >>+    int i;
> >>+
> >>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>+        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
> >>+    }
> >>
> >>      /* Register default 32bit DMA window */
> >>      spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
> >>@@ -1481,6 +1496,9 @@ static Property spapr_phb_properties[] = {
> >>      /* Default DMA window is 0..1GB */
> >>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >>+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> >>+                       0x800000000000000ULL),
> >>+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>
> >>@@ -1734,6 +1752,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      uint32_t interrupt_map_mask[] = {
> >>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >>+    uint32_t ddw_applicable[] = {
> >>+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >>+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >>+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >>+    };
> >>+    uint32_t ddw_extensions[] = {
> >>+        cpu_to_be32(1),
> >>+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >>+    };
> >>      sPAPRTCETable *tcet;
> >>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>      sPAPRFDT s_fdt;
> >>@@ -1758,6 +1785,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>
> >>+    /* Dynamic DMA window */
> >>+    if (phb->ddw_enabled) {
> >>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >>+                         sizeof(ddw_applicable)));
> >>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >>+                         &ddw_extensions, sizeof(ddw_extensions)));
> >>+    }
> >>+
> >>      /* Build the interrupt-map, this must matches what is done
> >>       * in pci_spapr_map_irq
> >>       */
> >>diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >>new file mode 100644
> >>index 0000000..b8ea910
> >>--- /dev/null
> >>+++ b/hw/ppc/spapr_rtas_ddw.c
> >>@@ -0,0 +1,306 @@
> >>+/*
> >>+ * QEMU sPAPR Dynamic DMA windows support
> >>+ *
> >>+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >>+ *
> >>+ *  This program is free software; you can redistribute it and/or modify
> >>+ *  it under the terms of the GNU General Public License as published by
> >>+ *  the Free Software Foundation; either version 2 of the License,
> >>+ *  or (at your option) any later version.
> >>+ *
> >>+ *  This program is distributed in the hope that it will be useful,
> >>+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>+ *  GNU General Public License for more details.
> >>+ *
> >>+ *  You should have received a copy of the GNU General Public License
> >>+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >>+ */
> >>+
> >>+#include "qemu/osdep.h"
> >>+#include "qemu/error-report.h"
> >>+#include "hw/ppc/spapr.h"
> >>+#include "hw/pci-host/spapr.h"
> >>+#include "trace.h"
> >>+
> >>+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+
> >>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>+    if (tcet && tcet->enabled) {
> >>+        ++*(unsigned *)opaque;
> >>+    }
> >>+    return 0;
> >>+}
> >>+
> >>+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >>+{
> >>+    unsigned ret = 0;
> >>+
> >>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >>+
> >>+    return ret;
> >>+}
> >>+
> >>+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+
> >>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>+    if (tcet && !tcet->enabled) {
> >>+        *(uint32_t *)opaque = tcet->liobn;
> >>+        return 1;
> >>+    }
> >>+    return 0;
> >>+}
> >>+
> >>+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >>+{
> >>+    uint32_t liobn = 0;
> >>+
> >>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >>+
> >>+    return liobn;
> >>+}
> >>+
> >>+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> >>+                                 uint64_t page_mask)
> >>+{
> >>+    int i, j;
> >>+    uint32_t mask = 0;
> >>+    const struct { int shift; uint32_t mask; } masks[] = {
> >>+        { 12, RTAS_DDW_PGSIZE_4K },
> >>+        { 16, RTAS_DDW_PGSIZE_64K },
> >>+        { 24, RTAS_DDW_PGSIZE_16M },
> >>+        { 25, RTAS_DDW_PGSIZE_32M },
> >>+        { 26, RTAS_DDW_PGSIZE_64M },
> >>+        { 27, RTAS_DDW_PGSIZE_128M },
> >>+        { 28, RTAS_DDW_PGSIZE_256M },
> >>+        { 34, RTAS_DDW_PGSIZE_16G },
> >>+    };
> >>+
> >>+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> >>+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> >>+            if ((sps[i].page_shift == masks[j].shift) &&
> >>+                    (page_mask & (1ULL << masks[j].shift))) {
> >>+                mask |= masks[j].mask;
> >>+            }
> >>+        }
> >>+    }
> >
> >Hmm... checking against the list of page sizes supported by the vcpu
> >seems conceptually wrong, although it's probably correct in practice.
> >Is there a way of checking directly against the pagesizes supported by
> >the host IOMMU.
> 
> 
> VFIO_IOMMU_SPAPR_TCE_GET_INFO returns the mask but since
> vfio_container_ioctl() is gone, there is no direct way of knowing it here,
> it is hidded now in hw/vfio/common.c.
> 
> Anyway the host IOMMU always supports 4K|64K|16M. QEMU may or may not use
> huge pages for the guest RAM, this defines whether H_PUT_TCE for 16M page
> suceeds or fails.

Ah, so you need to check against both the host IOMMU supported
pagesizes and the host pagesize backing RAM.  So.. the full set of
pagesizes in the VCPU isn't relevant, just the minimum page size used
to back RAM.

So I think you'll need something inside VFIO that acts as a variant of
kvm_fixup_page_sizes() checking the host supported IOMMU page sizes
against the RAM pagesize.  Then you'll need some interface to check
the guest IOMMU page sizes against that list.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-15  5:42       ` David Gibson
@ 2016-03-17  5:04         ` Alexey Kardashevskiy
  2016-03-17  6:10           ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-17  5:04 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/15/2016 04:42 PM, David Gibson wrote:
> On Tue, Mar 15, 2016 at 01:53:48PM +1100, Alexey Kardashevskiy wrote:
>> On 03/03/2016 05:30 PM, David Gibson wrote:
>>> On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
>>>> This makes use of the new "memory registering" feature. The idea is
>>>> to provide the userspace ability to notify the host kernel about pages
>>>> which are going to be used for DMA. Having this information, the host
>>>> kernel can pin them all once per user process, do locked pages
>>>> accounting (once) and not spent time on doing that in real time with
>>>> possible failures which cannot be handled nicely in some cases.
>>>>
>>>> This adds a prereg memory listener which listens on address_space_memory
>>>> and notifies a VFIO container about memory which needs to be
>>>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>>>
>>>> As there is no per-IOMMU-type release() callback anymore, this stores
>>>> the IOMMU type in the container so vfio_listener_release() can device
>>>> if it needs to unregister @prereg_listener.
>>>>
>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>> not call it when v2 is detected and enabled.
>>>>
>>>> This does not change the guest visible interface.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>   hw/vfio/Makefile.objs         |   1 +
>>>>   hw/vfio/common.c              |  39 +++++++++---
>>>>   hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
>>>>   include/hw/vfio/vfio-common.h |   4 ++
>>>>   trace-events                  |   2 +
>>>>   5 files changed, 175 insertions(+), 9 deletions(-)
>>>>   create mode 100644 hw/vfio/prereg.c
>>>>
>>>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>>>> index ceddbb8..5800e0e 100644
>>>> --- a/hw/vfio/Makefile.objs
>>>> +++ b/hw/vfio/Makefile.objs
>>>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>>>   obj-$(CONFIG_SOFTMMU) += platform.o
>>>>   obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>>>   obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>>>> +obj-$(CONFIG_SOFTMMU) += prereg.o
>>>>   endif
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 3aaa6b5..f2a03e0 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
>>>>   static void vfio_listener_release(VFIOContainer *container)
>>>>   {
>>>>       memory_listener_unregister(&container->iommu_listener.listener);
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +        memory_listener_unregister(&container->prereg_listener.listener);
>>>> +    }
>>>>   }
>>>>
>>>>   int vfio_mmap_region(Object *obj, VFIORegion *region,
>>>> @@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>               goto free_container_exit;
>>>>           }
>>>>
>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>>>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>>>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>>           if (ret) {
>>>>               error_report("vfio: failed to set iommu for container: %m");
>>>>               ret = -errno;
>>>> @@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>           if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>>>               container->iova_pgsizes = info.iova_pgsizes;
>>>>           }
>>>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>>>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>>>           struct vfio_iommu_spapr_tce_info info;
>>>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>>>
>>>>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>>>           if (ret) {
>>>> @@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>               ret = -errno;
>>>>               goto free_container_exit;
>>>>           }
>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>>>> +        container->iommu_type =
>>>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>
>>> It'd be nice to consolidate the setting of container->iommu_type and
>>> then the SET_IOMMU ioctl() rather than having more or less duplicated
>>> logic for Type1 and SPAPR, but it's not a big deal.
>>
>>
>> May be but I cannot think of any nice way of doing this though.
>>
>>
>>>
>>>>           if (ret) {
>>>>               error_report("vfio: failed to set iommu for container: %m");
>>>>               ret = -errno;
>>>> @@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>            * when container fd is closed so we do not call it explicitly
>>>>            * in this file.
>>>>            */
>>>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>> -        if (ret) {
>>>> -            error_report("vfio: failed to enable container: %m");
>>>> -            ret = -errno;
>>>> -            goto free_container_exit;
>>>> +        if (!v2) {
>>>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>> +            if (ret) {
>>>> +                error_report("vfio: failed to enable container: %m");
>>>> +                ret = -errno;
>>>> +                goto free_container_exit;
>>>> +            }
>>>> +        } else {
>>>> +            container->prereg_listener.container = container;
>>>> +            container->prereg_listener.listener = vfio_prereg_listener;
>>>> +
>>>> +            memory_listener_register(&container->prereg_listener.listener,
>>>> +                                     &address_space_memory);
>>>
>>> This assumes that the target address space of the (guest) IOMMU is
>>> address_space_memory.  Which is fine - vfio already assumes that - but
>>> it reminds me that it'd be nice to have an explicit check for that (I
>>> guess it would have to go in vfio_iommu_map_notify()).  So that if
>>> someone constructs a machine where that's not the case, it'll at least
>>> be obvious why VFIO isn't working.
>>
>> Ok, I'll add a small patch for this in the next respin.
>
> Ok.
>
>>>> +            if (container->error) {
>>>> +                error_report("vfio: RAM memory listener initialization failed for container");
>>>> +                memory_listener_unregister(
>>>> +                    &container->prereg_listener.listener);
>>>> +                goto free_container_exit;
>>>> +            }
>>>>           }
>>>
>>> Looks like you don't have an error path which will handle the case
>>> where the prereg listener is registered, but registering the normal
>>> PCI AS listener fails - I believe you will fail to unregister the
>>> prereg listener in that case.
>>
>>
>> In this case, the control goes to listener_release_exit: which calls
>> vfio_listener_release() which unregisters both listeners (it is a few chunks
>> above).
>
> Ah.. yes.  In which case this could also jump to listener_release_exit
> and avoid the explicit unreg(), yes?


Sorry, I do not follow you here. It does jump to listener_release_exit already.


>
>>>>           /*
>>>> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
>>>> new file mode 100644
>>>> index 0000000..66cd696
>>>> --- /dev/null
>>>> +++ b/hw/vfio/prereg.c
>>>> @@ -0,0 +1,138 @@
>>>> +/*
>>>> + * DMA memory preregistration
>>>> + *
>>>> + * Authors:
>>>> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>>> + * the COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include <sys/ioctl.h>
>>>> +#include <linux/vfio.h>
>>>> +
>>>> +#include "hw/vfio/vfio-common.h"
>>>> +#include "hw/vfio/vfio.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "trace.h"
>>>> +
>>>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
>>>> +{
>>>> +    return (!memory_region_is_ram(section->mr) &&
>>>> +            !memory_region_is_iommu(section->mr)) ||
>>>> +            memory_region_is_skip_dump(section->mr);
>>>> +}
>>>> +
>>>> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>>> +                                            MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
>>>> +                                                 listener);
>>>> +    VFIOContainer *container = vlistener->container;
>>>> +    hwaddr iova;
>>>> +    Int128 llend;
>>>> +    int ret;
>>>> +    hwaddr page_mask = qemu_real_host_page_mask;
>>>> +    struct vfio_iommu_spapr_register_memory reg = {
>>>> +        .argsz = sizeof(reg),
>>>> +        .flags = 0,
>>>> +    };
>>>> +
>>>> +    if (vfio_prereg_listener_skipped_section(section)) {
>>>> +        trace_vfio_listener_region_add_skip(
>>>> +                section->offset_within_address_space,
>>>> +                section->offset_within_address_space +
>>>> +                int128_get64(int128_sub(section->size, int128_one())));
>>>> +        return;
>>>> +    }
>>>
>>> You should probably explicitly check for IOMMU regions and abort if
>>> you find one.  An IOMMU AS appearing within address_space_memory would
>>> be bad news.
>>
>>
>> Oh, vfio_prereg_listener_skipped_section() allows memory_region_is_iommu(),
>> I'll remove it.
>
> Well, that's part.
>
> But IOMMU regions appearing here shouldn't just be ignored - they
> should be treated as a fatal error.


Is this always an error when they appear in &address_space_memory? Because 
if it is so, we should do this check somewhere in 
memory_region_transaction_commit() (memory_region_add_subregion() cannot 
check this, there is no AS).




-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-17  5:04         ` Alexey Kardashevskiy
@ 2016-03-17  6:10           ` David Gibson
  2016-03-17  9:23             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-17  6:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 11721 bytes --]

On Thu, Mar 17, 2016 at 04:04:29PM +1100, Alexey Kardashevskiy wrote:
> On 03/15/2016 04:42 PM, David Gibson wrote:
> >On Tue, Mar 15, 2016 at 01:53:48PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/03/2016 05:30 PM, David Gibson wrote:
> >>>On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
> >>>>This makes use of the new "memory registering" feature. The idea is
> >>>>to provide the userspace ability to notify the host kernel about pages
> >>>>which are going to be used for DMA. Having this information, the host
> >>>>kernel can pin them all once per user process, do locked pages
> >>>>accounting (once) and not spent time on doing that in real time with
> >>>>possible failures which cannot be handled nicely in some cases.
> >>>>
> >>>>This adds a prereg memory listener which listens on address_space_memory
> >>>>and notifies a VFIO container about memory which needs to be
> >>>>pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> >>>>
> >>>>As there is no per-IOMMU-type release() callback anymore, this stores
> >>>>the IOMMU type in the container so vfio_listener_release() can device
> >>>>if it needs to unregister @prereg_listener.
> >>>>
> >>>>The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>>>are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >>>>not call it when v2 is detected and enabled.
> >>>>
> >>>>This does not change the guest visible interface.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  hw/vfio/Makefile.objs         |   1 +
> >>>>  hw/vfio/common.c              |  39 +++++++++---
> >>>>  hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
> >>>>  include/hw/vfio/vfio-common.h |   4 ++
> >>>>  trace-events                  |   2 +
> >>>>  5 files changed, 175 insertions(+), 9 deletions(-)
> >>>>  create mode 100644 hw/vfio/prereg.c
> >>>>
> >>>>diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> >>>>index ceddbb8..5800e0e 100644
> >>>>--- a/hw/vfio/Makefile.objs
> >>>>+++ b/hw/vfio/Makefile.objs
> >>>>@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
> >>>>  obj-$(CONFIG_SOFTMMU) += platform.o
> >>>>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
> >>>>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> >>>>+obj-$(CONFIG_SOFTMMU) += prereg.o
> >>>>  endif
> >>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>index 3aaa6b5..f2a03e0 100644
> >>>>--- a/hw/vfio/common.c
> >>>>+++ b/hw/vfio/common.c
> >>>>@@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
> >>>>  static void vfio_listener_release(VFIOContainer *container)
> >>>>  {
> >>>>      memory_listener_unregister(&container->iommu_listener.listener);
> >>>>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>>+        memory_listener_unregister(&container->prereg_listener.listener);
> >>>>+    }
> >>>>  }
> >>>>
> >>>>  int vfio_mmap_region(Object *obj, VFIORegion *region,
> >>>>@@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>              goto free_container_exit;
> >>>>          }
> >>>>
> >>>>-        ret = ioctl(fd, VFIO_SET_IOMMU,
> >>>>-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> >>>>+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> >>>>+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>>>          if (ret) {
> >>>>              error_report("vfio: failed to set iommu for container: %m");
> >>>>              ret = -errno;
> >>>>@@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >>>>              container->iova_pgsizes = info.iova_pgsizes;
> >>>>          }
> >>>>-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >>>>+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >>>>+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >>>>          struct vfio_iommu_spapr_tce_info info;
> >>>>+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >>>>
> >>>>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>>>          if (ret) {
> >>>>@@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>              ret = -errno;
> >>>>              goto free_container_exit;
> >>>>          }
> >>>>-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >>>>+        container->iommu_type =
> >>>>+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> >>>>+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>>
> >>>It'd be nice to consolidate the setting of container->iommu_type and
> >>>then the SET_IOMMU ioctl() rather than having more or less duplicated
> >>>logic for Type1 and SPAPR, but it's not a big deal.
> >>
> >>
> >>May be but I cannot think of any nice way of doing this though.
> >>
> >>
> >>>
> >>>>          if (ret) {
> >>>>              error_report("vfio: failed to set iommu for container: %m");
> >>>>              ret = -errno;
> >>>>@@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>           * when container fd is closed so we do not call it explicitly
> >>>>           * in this file.
> >>>>           */
> >>>>-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >>>>-        if (ret) {
> >>>>-            error_report("vfio: failed to enable container: %m");
> >>>>-            ret = -errno;
> >>>>-            goto free_container_exit;
> >>>>+        if (!v2) {
> >>>>+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >>>>+            if (ret) {
> >>>>+                error_report("vfio: failed to enable container: %m");
> >>>>+                ret = -errno;
> >>>>+                goto free_container_exit;
> >>>>+            }
> >>>>+        } else {
> >>>>+            container->prereg_listener.container = container;
> >>>>+            container->prereg_listener.listener = vfio_prereg_listener;
> >>>>+
> >>>>+            memory_listener_register(&container->prereg_listener.listener,
> >>>>+                                     &address_space_memory);
> >>>
> >>>This assumes that the target address space of the (guest) IOMMU is
> >>>address_space_memory.  Which is fine - vfio already assumes that - but
> >>>it reminds me that it'd be nice to have an explicit check for that (I
> >>>guess it would have to go in vfio_iommu_map_notify()).  So that if
> >>>someone constructs a machine where that's not the case, it'll at least
> >>>be obvious why VFIO isn't working.
> >>
> >>Ok, I'll add a small patch for this in the next respin.
> >
> >Ok.
> >
> >>>>+            if (container->error) {
> >>>>+                error_report("vfio: RAM memory listener initialization failed for container");
> >>>>+                memory_listener_unregister(
> >>>>+                    &container->prereg_listener.listener);
> >>>>+                goto free_container_exit;
> >>>>+            }
> >>>>          }
> >>>
> >>>Looks like you don't have an error path which will handle the case
> >>>where the prereg listener is registered, but registering the normal
> >>>PCI AS listener fails - I believe you will fail to unregister the
> >>>prereg listener in that case.
> >>
> >>
> >>In this case, the control goes to listener_release_exit: which calls
> >>vfio_listener_release() which unregisters both listeners (it is a few chunks
> >>above).
> >
> >Ah.. yes.  In which case this could also jump to listener_release_exit
> >and avoid the explicit unreg(), yes?
> 
> 
> Sorry, I do not follow you here. It does jump to
> listener_release_exit already.

I mean you can use the listener_release_exit label on failure of the
prereg listener as well as failure of the regular listener.

> >>>>          /*
> >>>>diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> >>>>new file mode 100644
> >>>>index 0000000..66cd696
> >>>>--- /dev/null
> >>>>+++ b/hw/vfio/prereg.c
> >>>>@@ -0,0 +1,138 @@
> >>>>+/*
> >>>>+ * DMA memory preregistration
> >>>>+ *
> >>>>+ * Authors:
> >>>>+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>+ *
> >>>>+ * This work is licensed under the terms of the GNU GPL, version 2.  See
> >>>>+ * the COPYING file in the top-level directory.
> >>>>+ */
> >>>>+
> >>>>+#include "qemu/osdep.h"
> >>>>+#include <sys/ioctl.h>
> >>>>+#include <linux/vfio.h>
> >>>>+
> >>>>+#include "hw/vfio/vfio-common.h"
> >>>>+#include "hw/vfio/vfio.h"
> >>>>+#include "qemu/error-report.h"
> >>>>+#include "trace.h"
> >>>>+
> >>>>+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> >>>>+{
> >>>>+    return (!memory_region_is_ram(section->mr) &&
> >>>>+            !memory_region_is_iommu(section->mr)) ||
> >>>>+            memory_region_is_skip_dump(section->mr);
> >>>>+}
> >>>>+
> >>>>+static void vfio_prereg_listener_region_add(MemoryListener *listener,
> >>>>+                                            MemoryRegionSection *section)
> >>>>+{
> >>>>+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> >>>>+                                                 listener);
> >>>>+    VFIOContainer *container = vlistener->container;
> >>>>+    hwaddr iova;
> >>>>+    Int128 llend;
> >>>>+    int ret;
> >>>>+    hwaddr page_mask = qemu_real_host_page_mask;
> >>>>+    struct vfio_iommu_spapr_register_memory reg = {
> >>>>+        .argsz = sizeof(reg),
> >>>>+        .flags = 0,
> >>>>+    };
> >>>>+
> >>>>+    if (vfio_prereg_listener_skipped_section(section)) {
> >>>>+        trace_vfio_listener_region_add_skip(
> >>>>+                section->offset_within_address_space,
> >>>>+                section->offset_within_address_space +
> >>>>+                int128_get64(int128_sub(section->size, int128_one())));
> >>>>+        return;
> >>>>+    }
> >>>
> >>>You should probably explicitly check for IOMMU regions and abort if
> >>>you find one.  An IOMMU AS appearing within address_space_memory would
> >>>be bad news.
> >>
> >>
> >>Oh, vfio_prereg_listener_skipped_section() allows memory_region_is_iommu(),
> >>I'll remove it.
> >
> >Well, that's part.
> >
> >But IOMMU regions appearing here shouldn't just be ignored - they
> >should be treated as a fatal error.
> 
> 
> Is this always an error when they appear in &address_space_memory? Because
> if it is so, we should do this check somewhere in
> memory_region_transaction_commit() (memory_region_add_subregion() cannot
> check this, there is no AS).

Hmm.. not necessarily, although it would certainly be strange.  I can
imagine some sort of off-board memory device which incorporates an
"IO"MMU which will remap requests both from the cpu and from DMA
devices.

But, that wouldn't work with VFIO, since we no longer know where the
host memory is which is backing address_space_memory to preregister
it.  Well.. it might be possible by looking at the *target* AS of an
iommu registered in address_space_memory, setting up a listener on
that, and keeping on going until you reach a real memory region.

The case is so unlikely that it's not worth actually implementing the
code for.  But I think it is worth a sanity check.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-17  6:10           ` David Gibson
@ 2016-03-17  9:23             ` Alexey Kardashevskiy
  2016-03-21  4:53               ` David Gibson
  0 siblings, 1 reply; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-17  9:23 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/17/2016 05:10 PM, David Gibson wrote:
> On Thu, Mar 17, 2016 at 04:04:29PM +1100, Alexey Kardashevskiy wrote:
>> On 03/15/2016 04:42 PM, David Gibson wrote:
>>> On Tue, Mar 15, 2016 at 01:53:48PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/03/2016 05:30 PM, David Gibson wrote:
>>>>> On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
>>>>>> This makes use of the new "memory registering" feature. The idea is
>>>>>> to provide the userspace ability to notify the host kernel about pages
>>>>>> which are going to be used for DMA. Having this information, the host
>>>>>> kernel can pin them all once per user process, do locked pages
>>>>>> accounting (once) and not spent time on doing that in real time with
>>>>>> possible failures which cannot be handled nicely in some cases.
>>>>>>
>>>>>> This adds a prereg memory listener which listens on address_space_memory
>>>>>> and notifies a VFIO container about memory which needs to be
>>>>>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>>>>>
>>>>>> As there is no per-IOMMU-type release() callback anymore, this stores
>>>>>> the IOMMU type in the container so vfio_listener_release() can device
>>>>>> if it needs to unregister @prereg_listener.
>>>>>>
>>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>>>> not call it when v2 is detected and enabled.
>>>>>>
>>>>>> This does not change the guest visible interface.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>   hw/vfio/Makefile.objs         |   1 +
>>>>>>   hw/vfio/common.c              |  39 +++++++++---
>>>>>>   hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>   include/hw/vfio/vfio-common.h |   4 ++
>>>>>>   trace-events                  |   2 +
>>>>>>   5 files changed, 175 insertions(+), 9 deletions(-)
>>>>>>   create mode 100644 hw/vfio/prereg.c
>>>>>>
>>>>>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>>>>>> index ceddbb8..5800e0e 100644
>>>>>> --- a/hw/vfio/Makefile.objs
>>>>>> +++ b/hw/vfio/Makefile.objs
>>>>>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>>>>>   obj-$(CONFIG_SOFTMMU) += platform.o
>>>>>>   obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>>>>>   obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>>>>>> +obj-$(CONFIG_SOFTMMU) += prereg.o
>>>>>>   endif
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index 3aaa6b5..f2a03e0 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
>>>>>>   static void vfio_listener_release(VFIOContainer *container)
>>>>>>   {
>>>>>>       memory_listener_unregister(&container->iommu_listener.listener);
>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>> +        memory_listener_unregister(&container->prereg_listener.listener);
>>>>>> +    }
>>>>>>   }
>>>>>>
>>>>>>   int vfio_mmap_region(Object *obj, VFIORegion *region,
>>>>>> @@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>               goto free_container_exit;
>>>>>>           }
>>>>>>
>>>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>>>>>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>>>>>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>>>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>>>>           if (ret) {
>>>>>>               error_report("vfio: failed to set iommu for container: %m");
>>>>>>               ret = -errno;
>>>>>> @@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>           if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>>>>>               container->iova_pgsizes = info.iova_pgsizes;
>>>>>>           }
>>>>>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>>>>>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>>>>>           struct vfio_iommu_spapr_tce_info info;
>>>>>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>>>>>
>>>>>>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>>>>>           if (ret) {
>>>>>> @@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>               ret = -errno;
>>>>>>               goto free_container_exit;
>>>>>>           }
>>>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>>>>>> +        container->iommu_type =
>>>>>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>>>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>>>
>>>>> It'd be nice to consolidate the setting of container->iommu_type and
>>>>> then the SET_IOMMU ioctl() rather than having more or less duplicated
>>>>> logic for Type1 and SPAPR, but it's not a big deal.
>>>>
>>>>
>>>> May be but I cannot think of any nice way of doing this though.
>>>>
>>>>
>>>>>
>>>>>>           if (ret) {
>>>>>>               error_report("vfio: failed to set iommu for container: %m");
>>>>>>               ret = -errno;
>>>>>> @@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>            * when container fd is closed so we do not call it explicitly
>>>>>>            * in this file.
>>>>>>            */
>>>>>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>>>> -        if (ret) {
>>>>>> -            error_report("vfio: failed to enable container: %m");
>>>>>> -            ret = -errno;
>>>>>> -            goto free_container_exit;
>>>>>> +        if (!v2) {
>>>>>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>>>> +            if (ret) {
>>>>>> +                error_report("vfio: failed to enable container: %m");
>>>>>> +                ret = -errno;
>>>>>> +                goto free_container_exit;
>>>>>> +            }
>>>>>> +        } else {
>>>>>> +            container->prereg_listener.container = container;
>>>>>> +            container->prereg_listener.listener = vfio_prereg_listener;
>>>>>> +
>>>>>> +            memory_listener_register(&container->prereg_listener.listener,
>>>>>> +                                     &address_space_memory);
>>>>>
>>>>> This assumes that the target address space of the (guest) IOMMU is
>>>>> address_space_memory.  Which is fine - vfio already assumes that - but
>>>>> it reminds me that it'd be nice to have an explicit check for that (I
>>>>> guess it would have to go in vfio_iommu_map_notify()).  So that if
>>>>> someone constructs a machine where that's not the case, it'll at least
>>>>> be obvious why VFIO isn't working.
>>>>
>>>> Ok, I'll add a small patch for this in the next respin.
>>>
>>> Ok.
>>>
>>>>>> +            if (container->error) {
>>>>>> +                error_report("vfio: RAM memory listener initialization failed for container");
>>>>>> +                memory_listener_unregister(
>>>>>> +                    &container->prereg_listener.listener);
>>>>>> +                goto free_container_exit;
>>>>>> +            }
>>>>>>           }
>>>>>
>>>>> Looks like you don't have an error path which will handle the case
>>>>> where the prereg listener is registered, but registering the normal
>>>>> PCI AS listener fails - I believe you will fail to unregister the
>>>>> prereg listener in that case.
>>>>
>>>>
>>>> In this case, the control goes to listener_release_exit: which calls
>>>> vfio_listener_release() which unregisters both listeners (it is a few chunks
>>>> above).
>>>
>>> Ah.. yes.  In which case this could also jump to listener_release_exit
>>> and avoid the explicit unreg(), yes?
>>
>>
>> Sorry, I do not follow you here. It does jump to
>> listener_release_exit already.
>
> I mean you can use the listener_release_exit label on failure of the
> prereg listener as well as failure of the regular listener.


When vfio_prereg_listener fails, vfio_memory_listener is not registered - 
it just looks cleaner to jump further to free_container_exit than to 
listener_release_exit, does not it?



>>>>>>           /*
>>>>>> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
>>>>>> new file mode 100644
>>>>>> index 0000000..66cd696
>>>>>> --- /dev/null
>>>>>> +++ b/hw/vfio/prereg.c
>>>>>> @@ -0,0 +1,138 @@
>>>>>> +/*
>>>>>> + * DMA memory preregistration
>>>>>> + *
>>>>>> + * Authors:
>>>>>> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> + *
>>>>>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>>>>> + * the COPYING file in the top-level directory.
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include <sys/ioctl.h>
>>>>>> +#include <linux/vfio.h>
>>>>>> +
>>>>>> +#include "hw/vfio/vfio-common.h"
>>>>>> +#include "hw/vfio/vfio.h"
>>>>>> +#include "qemu/error-report.h"
>>>>>> +#include "trace.h"
>>>>>> +
>>>>>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
>>>>>> +{
>>>>>> +    return (!memory_region_is_ram(section->mr) &&
>>>>>> +            !memory_region_is_iommu(section->mr)) ||
>>>>>> +            memory_region_is_skip_dump(section->mr);
>>>>>> +}
>>>>>> +
>>>>>> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>>>>> +                                            MemoryRegionSection *section)
>>>>>> +{
>>>>>> +    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
>>>>>> +                                                 listener);
>>>>>> +    VFIOContainer *container = vlistener->container;
>>>>>> +    hwaddr iova;
>>>>>> +    Int128 llend;
>>>>>> +    int ret;
>>>>>> +    hwaddr page_mask = qemu_real_host_page_mask;
>>>>>> +    struct vfio_iommu_spapr_register_memory reg = {
>>>>>> +        .argsz = sizeof(reg),
>>>>>> +        .flags = 0,
>>>>>> +    };
>>>>>> +
>>>>>> +    if (vfio_prereg_listener_skipped_section(section)) {
>>>>>> +        trace_vfio_listener_region_add_skip(
>>>>>> +                section->offset_within_address_space,
>>>>>> +                section->offset_within_address_space +
>>>>>> +                int128_get64(int128_sub(section->size, int128_one())));
>>>>>> +        return;
>>>>>> +    }
>>>>>
>>>>> You should probably explicitly check for IOMMU regions and abort if
>>>>> you find one.  An IOMMU AS appearing within address_space_memory would
>>>>> be bad news.
>>>>
>>>>
>>>> Oh, vfio_prereg_listener_skipped_section() allows memory_region_is_iommu(),
>>>> I'll remove it.
>>>
>>> Well, that's part.
>>>
>>> But IOMMU regions appearing here shouldn't just be ignored - they
>>> should be treated as a fatal error.
>>
>>
>> Is this always an error when they appear in &address_space_memory? Because
>> if it is so, we should do this check somewhere in
>> memory_region_transaction_commit() (memory_region_add_subregion() cannot
>> check this, there is no AS).
>
> Hmm.. not necessarily, although it would certainly be strange.  I can
> imagine some sort of off-board memory device which incorporates an
> "IO"MMU which will remap requests both from the cpu and from DMA
> devices.
>
> But, that wouldn't work with VFIO, since we no longer know where the
> host memory is which is backing address_space_memory to preregister
> it.  Well.. it might be possible by looking at the *target* AS of an
> iommu registered in address_space_memory, setting up a listener on
> that, and keeping on going until you reach a real memory region.
>
> The case is so unlikely that it's not worth actually implementing the
> code for.  But I think it is worth a sanity check.


Ok then. g_assert() or hwerror()?



-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-17  9:23             ` Alexey Kardashevskiy
@ 2016-03-21  4:53               ` David Gibson
  2016-03-21  6:08                 ` Alexey Kardashevskiy
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2016-03-21  4:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 13019 bytes --]

On Thu, Mar 17, 2016 at 08:23:35PM +1100, Alexey Kardashevskiy wrote:
> On 03/17/2016 05:10 PM, David Gibson wrote:
> >On Thu, Mar 17, 2016 at 04:04:29PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/15/2016 04:42 PM, David Gibson wrote:
> >>>On Tue, Mar 15, 2016 at 01:53:48PM +1100, Alexey Kardashevskiy wrote:
> >>>>On 03/03/2016 05:30 PM, David Gibson wrote:
> >>>>>On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>This makes use of the new "memory registering" feature. The idea is
> >>>>>>to provide the userspace ability to notify the host kernel about pages
> >>>>>>which are going to be used for DMA. Having this information, the host
> >>>>>>kernel can pin them all once per user process, do locked pages
> >>>>>>accounting (once) and not spent time on doing that in real time with
> >>>>>>possible failures which cannot be handled nicely in some cases.
> >>>>>>
> >>>>>>This adds a prereg memory listener which listens on address_space_memory
> >>>>>>and notifies a VFIO container about memory which needs to be
> >>>>>>pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> >>>>>>
> >>>>>>As there is no per-IOMMU-type release() callback anymore, this stores
> >>>>>>the IOMMU type in the container so vfio_listener_release() can device
> >>>>>>if it needs to unregister @prereg_listener.
> >>>>>>
> >>>>>>The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>>>>>are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >>>>>>not call it when v2 is detected and enabled.
> >>>>>>
> >>>>>>This does not change the guest visible interface.
> >>>>>>
> >>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>---
> >>>>>>  hw/vfio/Makefile.objs         |   1 +
> >>>>>>  hw/vfio/common.c              |  39 +++++++++---
> >>>>>>  hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  include/hw/vfio/vfio-common.h |   4 ++
> >>>>>>  trace-events                  |   2 +
> >>>>>>  5 files changed, 175 insertions(+), 9 deletions(-)
> >>>>>>  create mode 100644 hw/vfio/prereg.c
> >>>>>>
> >>>>>>diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> >>>>>>index ceddbb8..5800e0e 100644
> >>>>>>--- a/hw/vfio/Makefile.objs
> >>>>>>+++ b/hw/vfio/Makefile.objs
> >>>>>>@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
> >>>>>>  obj-$(CONFIG_SOFTMMU) += platform.o
> >>>>>>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
> >>>>>>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> >>>>>>+obj-$(CONFIG_SOFTMMU) += prereg.o
> >>>>>>  endif
> >>>>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>>index 3aaa6b5..f2a03e0 100644
> >>>>>>--- a/hw/vfio/common.c
> >>>>>>+++ b/hw/vfio/common.c
> >>>>>>@@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
> >>>>>>  static void vfio_listener_release(VFIOContainer *container)
> >>>>>>  {
> >>>>>>      memory_listener_unregister(&container->iommu_listener.listener);
> >>>>>>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>>>>+        memory_listener_unregister(&container->prereg_listener.listener);
> >>>>>>+    }
> >>>>>>  }
> >>>>>>
> >>>>>>  int vfio_mmap_region(Object *obj, VFIORegion *region,
> >>>>>>@@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>>>              goto free_container_exit;
> >>>>>>          }
> >>>>>>
> >>>>>>-        ret = ioctl(fd, VFIO_SET_IOMMU,
> >>>>>>-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> >>>>>>+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> >>>>>>+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>>>>>          if (ret) {
> >>>>>>              error_report("vfio: failed to set iommu for container: %m");
> >>>>>>              ret = -errno;
> >>>>>>@@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>>>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >>>>>>              container->iova_pgsizes = info.iova_pgsizes;
> >>>>>>          }
> >>>>>>-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >>>>>>+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >>>>>>+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >>>>>>          struct vfio_iommu_spapr_tce_info info;
> >>>>>>+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >>>>>>
> >>>>>>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>>>>>          if (ret) {
> >>>>>>@@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>>>              ret = -errno;
> >>>>>>              goto free_container_exit;
> >>>>>>          }
> >>>>>>-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >>>>>>+        container->iommu_type =
> >>>>>>+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> >>>>>>+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>>>>
> >>>>>It'd be nice to consolidate the setting of container->iommu_type and
> >>>>>then the SET_IOMMU ioctl() rather than having more or less duplicated
> >>>>>logic for Type1 and SPAPR, but it's not a big deal.
> >>>>
> >>>>
> >>>>May be but I cannot think of any nice way of doing this though.
> >>>>
> >>>>
> >>>>>
> >>>>>>          if (ret) {
> >>>>>>              error_report("vfio: failed to set iommu for container: %m");
> >>>>>>              ret = -errno;
> >>>>>>@@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>>>           * when container fd is closed so we do not call it explicitly
> >>>>>>           * in this file.
> >>>>>>           */
> >>>>>>-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >>>>>>-        if (ret) {
> >>>>>>-            error_report("vfio: failed to enable container: %m");
> >>>>>>-            ret = -errno;
> >>>>>>-            goto free_container_exit;
> >>>>>>+        if (!v2) {
> >>>>>>+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >>>>>>+            if (ret) {
> >>>>>>+                error_report("vfio: failed to enable container: %m");
> >>>>>>+                ret = -errno;
> >>>>>>+                goto free_container_exit;
> >>>>>>+            }
> >>>>>>+        } else {
> >>>>>>+            container->prereg_listener.container = container;
> >>>>>>+            container->prereg_listener.listener = vfio_prereg_listener;
> >>>>>>+
> >>>>>>+            memory_listener_register(&container->prereg_listener.listener,
> >>>>>>+                                     &address_space_memory);
> >>>>>
> >>>>>This assumes that the target address space of the (guest) IOMMU is
> >>>>>address_space_memory.  Which is fine - vfio already assumes that - but
> >>>>>it reminds me that it'd be nice to have an explicit check for that (I
> >>>>>guess it would have to go in vfio_iommu_map_notify()).  So that if
> >>>>>someone constructs a machine where that's not the case, it'll at least
> >>>>>be obvious why VFIO isn't working.
> >>>>
> >>>>Ok, I'll add a small patch for this in the next respin.
> >>>
> >>>Ok.
> >>>
> >>>>>>+            if (container->error) {
> >>>>>>+                error_report("vfio: RAM memory listener initialization failed for container");
> >>>>>>+                memory_listener_unregister(
> >>>>>>+                    &container->prereg_listener.listener);
> >>>>>>+                goto free_container_exit;
> >>>>>>+            }
> >>>>>>          }
> >>>>>
> >>>>>Looks like you don't have an error path which will handle the case
> >>>>>where the prereg listener is registered, but registering the normal
> >>>>>PCI AS listener fails - I believe you will fail to unregister the
> >>>>>prereg listener in that case.
> >>>>
> >>>>
> >>>>In this case, the control goes to listener_release_exit: which calls
> >>>>vfio_listener_release() which unregisters both listeners (it is a few chunks
> >>>>above).
> >>>
> >>>Ah.. yes.  In which case this could also jump to listener_release_exit
> >>>and avoid the explicit unreg(), yes?
> >>
> >>
> >>Sorry, I do not follow you here. It does jump to
> >>listener_release_exit already.
> >
> >I mean you can use the listener_release_exit label on failure of the
> >prereg listener as well as failure of the regular listener.
> 
> 
> When vfio_prereg_listener fails, vfio_memory_listener is not registered - it
> just looks cleaner to jump further to free_container_exit than to
> listener_release_exit, does not it?

If I was writing from scratch, I'd probably do it like that.  But the
existing failure path for the PCI address space listener goes to the
label which (optionally) cleans up the PCI address space listener,
which should not be necessary if registration has failed.

I'm suggesting using the same label on the other listener registration
failure path for consistency.

> 
> 
> 
> >>>>>>          /*
> >>>>>>diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> >>>>>>new file mode 100644
> >>>>>>index 0000000..66cd696
> >>>>>>--- /dev/null
> >>>>>>+++ b/hw/vfio/prereg.c
> >>>>>>@@ -0,0 +1,138 @@
> >>>>>>+/*
> >>>>>>+ * DMA memory preregistration
> >>>>>>+ *
> >>>>>>+ * Authors:
> >>>>>>+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>+ *
> >>>>>>+ * This work is licensed under the terms of the GNU GPL, version 2.  See
> >>>>>>+ * the COPYING file in the top-level directory.
> >>>>>>+ */
> >>>>>>+
> >>>>>>+#include "qemu/osdep.h"
> >>>>>>+#include <sys/ioctl.h>
> >>>>>>+#include <linux/vfio.h>
> >>>>>>+
> >>>>>>+#include "hw/vfio/vfio-common.h"
> >>>>>>+#include "hw/vfio/vfio.h"
> >>>>>>+#include "qemu/error-report.h"
> >>>>>>+#include "trace.h"
> >>>>>>+
> >>>>>>+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> >>>>>>+{
> >>>>>>+    return (!memory_region_is_ram(section->mr) &&
> >>>>>>+            !memory_region_is_iommu(section->mr)) ||
> >>>>>>+            memory_region_is_skip_dump(section->mr);
> >>>>>>+}
> >>>>>>+
> >>>>>>+static void vfio_prereg_listener_region_add(MemoryListener *listener,
> >>>>>>+                                            MemoryRegionSection *section)
> >>>>>>+{
> >>>>>>+    VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener,
> >>>>>>+                                                 listener);
> >>>>>>+    VFIOContainer *container = vlistener->container;
> >>>>>>+    hwaddr iova;
> >>>>>>+    Int128 llend;
> >>>>>>+    int ret;
> >>>>>>+    hwaddr page_mask = qemu_real_host_page_mask;
> >>>>>>+    struct vfio_iommu_spapr_register_memory reg = {
> >>>>>>+        .argsz = sizeof(reg),
> >>>>>>+        .flags = 0,
> >>>>>>+    };
> >>>>>>+
> >>>>>>+    if (vfio_prereg_listener_skipped_section(section)) {
> >>>>>>+        trace_vfio_listener_region_add_skip(
> >>>>>>+                section->offset_within_address_space,
> >>>>>>+                section->offset_within_address_space +
> >>>>>>+                int128_get64(int128_sub(section->size, int128_one())));
> >>>>>>+        return;
> >>>>>>+    }
> >>>>>
> >>>>>You should probably explicitly check for IOMMU regions and abort if
> >>>>>you find one.  An IOMMU AS appearing within address_space_memory would
> >>>>>be bad news.
> >>>>
> >>>>
> >>>>Oh, vfio_prereg_listener_skipped_section() allows memory_region_is_iommu(),
> >>>>I'll remove it.
> >>>
> >>>Well, that's part.
> >>>
> >>>But IOMMU regions appearing here shouldn't just be ignored - they
> >>>should be treated as a fatal error.
> >>
> >>
> >>Is this always an error when they appear in &address_space_memory? Because
> >>if it is so, we should do this check somewhere in
> >>memory_region_transaction_commit() (memory_region_add_subregion() cannot
> >>check this, there is no AS).
> >
> >Hmm.. not necessarily, although it would certainly be strange.  I can
> >imagine some sort of off-board memory device which incorporates an
> >"IO"MMU which will remap requests both from the cpu and from DMA
> >devices.
> >
> >But, that wouldn't work with VFIO, since we no longer know where the
> >host memory is which is backing address_space_memory to preregister
> >it.  Well.. it might be possible by looking at the *target* AS of an
> >iommu registered in address_space_memory, setting up a listener on
> >that, and keeping on going until you reach a real memory region.
> >
> >The case is so unlikely that it's not worth actually implementing the
> >code for.  But I think it is worth a sanity check.
> 
> 
> Ok then. g_assert() or hwerror()?

error_report(), I think.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-21  4:53               ` David Gibson
@ 2016-03-21  6:08                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 50+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  6:08 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/21/2016 03:53 PM, David Gibson wrote:
> On Thu, Mar 17, 2016 at 08:23:35PM +1100, Alexey Kardashevskiy wrote:
>> On 03/17/2016 05:10 PM, David Gibson wrote:
>>> On Thu, Mar 17, 2016 at 04:04:29PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/15/2016 04:42 PM, David Gibson wrote:
>>>>> On Tue, Mar 15, 2016 at 01:53:48PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/03/2016 05:30 PM, David Gibson wrote:
>>>>>>> On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> This makes use of the new "memory registering" feature. The idea is
>>>>>>>> to provide the userspace ability to notify the host kernel about pages
>>>>>>>> which are going to be used for DMA. Having this information, the host
>>>>>>>> kernel can pin them all once per user process, do locked pages
>>>>>>>> accounting (once) and not spent time on doing that in real time with
>>>>>>>> possible failures which cannot be handled nicely in some cases.
>>>>>>>>
>>>>>>>> This adds a prereg memory listener which listens on address_space_memory
>>>>>>>> and notifies a VFIO container about memory which needs to be
>>>>>>>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>>>>>>>
>>>>>>>> As there is no per-IOMMU-type release() callback anymore, this stores
>>>>>>>> the IOMMU type in the container so vfio_listener_release() can device
>>>>>>>> if it needs to unregister @prereg_listener.
>>>>>>>>
>>>>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>>>>>> not call it when v2 is detected and enabled.
>>>>>>>>
>>>>>>>> This does not change the guest visible interface.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>   hw/vfio/Makefile.objs         |   1 +
>>>>>>>>   hw/vfio/common.c              |  39 +++++++++---
>>>>>>>>   hw/vfio/prereg.c              | 138 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>   include/hw/vfio/vfio-common.h |   4 ++
>>>>>>>>   trace-events                  |   2 +
>>>>>>>>   5 files changed, 175 insertions(+), 9 deletions(-)
>>>>>>>>   create mode 100644 hw/vfio/prereg.c
>>>>>>>>
>>>>>>>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>>>>>>>> index ceddbb8..5800e0e 100644
>>>>>>>> --- a/hw/vfio/Makefile.objs
>>>>>>>> +++ b/hw/vfio/Makefile.objs
>>>>>>>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>>>>>>>   obj-$(CONFIG_SOFTMMU) += platform.o
>>>>>>>>   obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>>>>>>>   obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>>>>>>>> +obj-$(CONFIG_SOFTMMU) += prereg.o
>>>>>>>>   endif
>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>>> index 3aaa6b5..f2a03e0 100644
>>>>>>>> --- a/hw/vfio/common.c
>>>>>>>> +++ b/hw/vfio/common.c
>>>>>>>> @@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = {
>>>>>>>>   static void vfio_listener_release(VFIOContainer *container)
>>>>>>>>   {
>>>>>>>>       memory_listener_unregister(&container->iommu_listener.listener);
>>>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>>>> +        memory_listener_unregister(&container->prereg_listener.listener);
>>>>>>>> +    }
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   int vfio_mmap_region(Object *obj, VFIORegion *region,
>>>>>>>> @@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>>>               goto free_container_exit;
>>>>>>>>           }
>>>>>>>>
>>>>>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>>>>>>>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>>>>>>>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>>>>>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>>>>>>           if (ret) {
>>>>>>>>               error_report("vfio: failed to set iommu for container: %m");
>>>>>>>>               ret = -errno;
>>>>>>>> @@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>>>           if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>>>>>>>               container->iova_pgsizes = info.iova_pgsizes;
>>>>>>>>           }
>>>>>>>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>>>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>>>>>>>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>>>>>>>           struct vfio_iommu_spapr_tce_info info;
>>>>>>>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>>>>>>>
>>>>>>>>           ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>>>>>>>           if (ret) {
>>>>>>>> @@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>>>               ret = -errno;
>>>>>>>>               goto free_container_exit;
>>>>>>>>           }
>>>>>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>>>>>>>> +        container->iommu_type =
>>>>>>>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>>>>>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>>>>>
>>>>>>> It'd be nice to consolidate the setting of container->iommu_type and
>>>>>>> then the SET_IOMMU ioctl() rather than having more or less duplicated
>>>>>>> logic for Type1 and SPAPR, but it's not a big deal.
>>>>>>
>>>>>>
>>>>>> May be but I cannot think of any nice way of doing this though.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>           if (ret) {
>>>>>>>>               error_report("vfio: failed to set iommu for container: %m");
>>>>>>>>               ret = -errno;
>>>>>>>> @@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>>>>>            * when container fd is closed so we do not call it explicitly
>>>>>>>>            * in this file.
>>>>>>>>            */
>>>>>>>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>>>>>> -        if (ret) {
>>>>>>>> -            error_report("vfio: failed to enable container: %m");
>>>>>>>> -            ret = -errno;
>>>>>>>> -            goto free_container_exit;
>>>>>>>> +        if (!v2) {
>>>>>>>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>>>>>> +            if (ret) {
>>>>>>>> +                error_report("vfio: failed to enable container: %m");
>>>>>>>> +                ret = -errno;
>>>>>>>> +                goto free_container_exit;
>>>>>>>> +            }
>>>>>>>> +        } else {
>>>>>>>> +            container->prereg_listener.container = container;
>>>>>>>> +            container->prereg_listener.listener = vfio_prereg_listener;
>>>>>>>> +
>>>>>>>> +            memory_listener_register(&container->prereg_listener.listener,
>>>>>>>> +                                     &address_space_memory);
>>>>>>>
>>>>>>> This assumes that the target address space of the (guest) IOMMU is
>>>>>>> address_space_memory.  Which is fine - vfio already assumes that - but
>>>>>>> it reminds me that it'd be nice to have an explicit check for that (I
>>>>>>> guess it would have to go in vfio_iommu_map_notify()).  So that if
>>>>>>> someone constructs a machine where that's not the case, it'll at least
>>>>>>> be obvious why VFIO isn't working.
>>>>>>
>>>>>> Ok, I'll add a small patch for this in the next respin.
>>>>>
>>>>> Ok.
>>>>>
>>>>>>>> +            if (container->error) {
>>>>>>>> +                error_report("vfio: RAM memory listener initialization failed for container");
>>>>>>>> +                memory_listener_unregister(
>>>>>>>> +                    &container->prereg_listener.listener);
>>>>>>>> +                goto free_container_exit;
>>>>>>>> +            }
>>>>>>>>           }
>>>>>>>
>>>>>>> Looks like you don't have an error path which will handle the case
>>>>>>> where the prereg listener is registered, but registering the normal
>>>>>>> PCI AS listener fails - I believe you will fail to unregister the
>>>>>>> prereg listener in that case.
>>>>>>
>>>>>>
>>>>>> In this case, the control goes to listener_release_exit: which calls
>>>>>> vfio_listener_release() which unregisters both listeners (it is a few chunks
>>>>>> above).
>>>>>
>>>>> Ah.. yes.  In which case this could also jump to listener_release_exit
>>>>> and avoid the explicit unreg(), yes?
>>>>
>>>>
>>>> Sorry, I do not follow you here. It does jump to
>>>> listener_release_exit already.
>>>
>>> I mean you can use the listener_release_exit label on failure of the
>>> prereg listener as well as failure of the regular listener.
>>
>>
>> When vfio_prereg_listener fails, vfio_memory_listener is not registered - it
>> just looks cleaner to jump further to free_container_exit than to
>> listener_release_exit, does not it?
>
> If I was writing from scratch, I'd probably do it like that.  But the
> existing failure path for the PCI address space listener goes to the
> label which (optionally) cleans up the PCI address space listener,
> which should not be necessary if registration has failed.


vfio_listener_release() unconditionally calls memory_listener_unregister() 
which unconditionally calls QTAILQ_REMOVE. Is it considered safe (the code 
looks ok) to call QTAILQ_REMOVE() on something which has not been 
QTAILQ_INSERT_BEFORE()'d?




-- 
Alexey

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2016-03-21  6:08 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-01  9:10 [Qemu-devel] [PATCH qemu v13 00/16] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 01/16] memory: Fix IOMMU replay base address Alexey Kardashevskiy
2016-03-03  1:34   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 02/16] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
2016-03-03  1:40   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-10  5:47     ` Alexey Kardashevskiy
2016-03-15  5:30       ` David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 03/16] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 04/16] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2016-03-03  3:00   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-10  7:39     ` Alexey Kardashevskiy
2016-03-15  5:32       ` David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 05/16] spapr_iommu: Add root memory region Alexey Kardashevskiy
2016-03-04  4:08   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 06/16] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
2016-03-03  3:02   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 07/16] vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
2016-03-03  5:28   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-03  6:01     ` Alexey Kardashevskiy
2016-03-04  4:01       ` David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 08/16] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-03-03  5:33   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 09/16] vfio: Generalize IOMMU memory listener Alexey Kardashevskiy
2016-03-03  5:36   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-03  6:07     ` Alexey Kardashevskiy
2016-03-04  3:44       ` David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 10/16] vfio: Use different page size for different IOMMU types Alexey Kardashevskiy
2016-03-03  6:08   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2016-03-03  6:30   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-15  2:53     ` Alexey Kardashevskiy
2016-03-15  5:42       ` David Gibson
2016-03-17  5:04         ` Alexey Kardashevskiy
2016-03-17  6:10           ` David Gibson
2016-03-17  9:23             ` Alexey Kardashevskiy
2016-03-21  4:53               ` David Gibson
2016-03-21  6:08                 ` Alexey Kardashevskiy
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 12/16] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2016-03-03  6:31   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 13/16] spapr_iommu: Remove need_vfio flag from sPAPRTCETable Alexey Kardashevskiy
2016-03-03  6:38   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 14/16] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
2016-03-03  6:39   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 15/16] vfio: Move iova_pgsizes from container to guest IOMMU Alexey Kardashevskiy
2016-03-03 11:22   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-04  0:02     ` Alexey Kardashevskiy
2016-03-01  9:10 ` [Qemu-devel] [PATCH qemu v13 16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-03-04  4:51   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2016-03-11  9:03     ` Alexey Kardashevskiy
2016-03-15  5:53       ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.