[Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW)
@ 2016-03-21  7:46 Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address Alexey Kardashevskiy
                   ` (17 more replies)
  0 siblings, 18 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows on pseries.

This patchset is based on the upstream QEMU.

This implements comments from v13; another change is the patches are
smaller and VFIO code is slightly better separated from sPAPR and common code.

This includes "vmstate: Define VARRAY with VMS_ALLOC" as the patchset needs
it and it has been posted separately but has not been neither accepted
nor rejected so far.

Please comment. Thanks!

Alexey Kardashevskiy (18):
  memory: Fix IOMMU replay base address
  vmstate: Define VARRAY with VMS_ALLOC
  spapr_pci: Move DMA window enablement to a helper
  spapr_iommu: Move table allocation to helpers
  spapr_iommu: Introduce "enabled" state for TCE table
  spapr_iommu: Finish renaming vfio_accel to need_vfio
  spapr_iommu: Realloc table during migration
  spapr_iommu: Migrate full state
  spapr_iommu: Add root memory region
  spapr_pci: Reset DMA config on PHB reset
  memory: Add reporting of supported page sizes
  vfio: Check that IOMMU MR translates to system address space
  vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  spapr_pci: Add and export DMA resetting helper
  vfio: Add host side IOMMU capabilities
  spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being
    used by VFIO
  vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

 hw/ppc/Makefile.objs          |   1 +
 hw/ppc/spapr.c                |   7 +-
 hw/ppc/spapr_iommu.c          | 202 ++++++++++++++++++++++------
 hw/ppc/spapr_pci.c            | 125 +++++++++++++++---
 hw/ppc/spapr_rtas_ddw.c       | 300 ++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_vio.c            |   8 +-
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              | 241 +++++++++++++++++++++++++++------
 hw/vfio/prereg.c              | 137 +++++++++++++++++++
 include/exec/memory.h         |  22 +++-
 include/hw/pci-host/spapr.h   |  15 +++
 include/hw/ppc/spapr.h        |  31 ++++-
 include/hw/vfio/vfio-common.h |  14 +-
 include/migration/vmstate.h   |  10 ++
 memory.c                      |  17 ++-
 target-ppc/kvm_ppc.h          |   2 +-
 trace-events                  |  10 +-
 17 files changed, 1018 insertions(+), 125 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c
 create mode 100644 hw/vfio/prereg.c

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-22  0:49   ` David Gibson
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 02/18] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
when new VFIO listener is added, all existing IOMMU mappings are
replayed. However there is a problem that the base address of
an IOMMU memory region (IOMMU MR) is ignored which is not a problem
for the existing user (which is pseries) with its default 32bit DMA
window starting at 0 but it is if there is another DMA window.

This stores the IOMMU's offset_within_address_space and adjusts
the IOVA before calling vfio_dma_map/vfio_dma_unmap.

As the IOMMU notifier expects IOVA offset rather than the absolute
address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
calling notifier(s).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c          |  2 +-
 hw/vfio/common.c              | 14 ++++++++------
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 7dd4588..277f289 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
     tcet->table[index] = tce;
 
     entry.target_as = &address_space_memory,
-    entry.iova = ioba & page_mask;
+    entry.iova = (ioba - tcet->bus_offset) & page_mask;
     entry.translated_addr = tce & page_mask;
     entry.addr_mask = ~page_mask;
     entry.perm = spapr_tce_iommu_access_flags(tce);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb588d8..d45e2db 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
     IOMMUTLBEntry *iotlb = data;
+    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
     void *vaddr;
     int ret;
 
-    trace_vfio_iommu_map_notify(iotlb->iova,
-                                iotlb->iova + iotlb->addr_mask);
+    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
 
     /*
      * The IOMMU TLB entry we have just covers translation through
@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         vaddr = memory_region_get_ram_ptr(mr) + xlat;
-        ret = vfio_dma_map(container, iotlb->iova,
+        ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
                            !(iotlb->perm & IOMMU_WO) || mr->readonly);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iotlb->iova,
+                         container, iova,
                          iotlb->addr_mask + 1, ret);
         }
     }
@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
          */
         giommu = g_malloc0(sizeof(*giommu));
         giommu->iommu = section->mr;
+        giommu->offset_within_address_space =
+            section->offset_within_address_space;
         giommu->container = container;
         giommu->n.notify = vfio_iommu_map_notify;
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eb0e1b0..5341e05 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -90,6 +90,7 @@ typedef struct VFIOContainer {
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
     MemoryRegion *iommu;
+    hwaddr offset_within_address_space;
     Notifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 02/18] vmstate: Define VARRAY with VMS_ALLOC
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

This allows dynamic allocation for migrating arrays.

Already existing VMSTATE_VARRAY_UINT32 requires an array to be
pre-allocated, however there are cases when the size is not known in
advance and there is no real need to enforce it.

This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
flag which tells the receiving side to allocate memory for the array
before receiving the data.

The first user of it is a dynamic DMA window which existence and size
are totally dynamic.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 include/migration/vmstate.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 84ee355..1622638 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
     .offset     = vmstate_offset_pointer(_state, _field, _type),     \
 }
 
+#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+    .info       = &(_info),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
+    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
     .name       = (stringify(_field)),                               \
     .version_id = (_version),                                        \
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 02/18] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-22  1:02   ` David Gibson
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 04/18] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

We are going to have multiple DMA windows soon so let's start preparing.

This adds a new helper to create a DMA window and makes use of it in
sPAPRPHBState::realize().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* replaced "int" return to Error* in spapr_phb_dma_window_enable()
---
 hw/ppc/spapr_pci.c | 47 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 79baa7b..18332bf 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -803,6 +803,33 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
     return buf;
 }
 
+static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                       uint32_t liobn,
+                                       uint32_t page_shift,
+                                       uint64_t window_addr,
+                                       uint64_t window_size,
+                                       Error **errp)
+{
+    sPAPRTCETable *tcet;
+    uint32_t nb_table = window_size >> page_shift;
+
+    if (!nb_table) {
+        error_setg(errp, "Zero size table");
+        return;
+    }
+
+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
+                               page_shift, nb_table, false);
+    if (!tcet) {
+        error_setg(errp, "Unable to create TCE table liobn %x for %s",
+                   liobn, sphb->dtbusname);
+        return;
+    }
+
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
+                                spapr_tce_get_iommu(tcet));
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1307,8 +1334,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     int i;
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
-    sPAPRTCETable *tcet;
-    uint32_t nb_table;
+    Error *local_err = NULL;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1460,18 +1486,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
-    if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return;
-    }
-
     /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
-                                spapr_tce_get_iommu(tcet));
+    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
+                                sphb->dma_win_addr, sphb->dma_win_size,
+                                &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 04/18] spapr_iommu: Move table allocation to helpers
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

At the moment presence of vfio-pci devices on a bus affect the way
the guest view table is allocated. If there is no vfio-pci on a PHB
and the host kernel supports KVM acceleration of H_PUT_TCE, a table
is allocated in KVM. However, if there is vfio-pci and we do yet not
KVM acceleration for these, the table has to be allocated by
the userspace. At the moment the table is allocated once at boot time
but next patches will reallocate it.

This moves kvmppc_create_spapr_tce/g_malloc0 and their counterparts
to helpers.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c | 58 +++++++++++++++++++++++++++++++++++-----------------
 trace-events         |  2 +-
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 277f289..8132f64 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -75,6 +75,37 @@ static IOMMUAccessFlags spapr_tce_iommu_access_flags(uint64_t tce)
     }
 }
 
+static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
+                                       uint32_t page_shift,
+                                       uint32_t nb_table,
+                                       int *fd,
+                                       bool need_vfio)
+{
+    uint64_t *table = NULL;
+    uint64_t window_size = (uint64_t)nb_table << page_shift;
+
+    if (kvm_enabled() && !(window_size >> 32)) {
+        table = kvmppc_create_spapr_tce(liobn, window_size, fd, need_vfio);
+    }
+
+    if (!table) {
+        *fd = -1;
+        table = g_malloc0(nb_table * sizeof(uint64_t));
+    }
+
+    trace_spapr_iommu_new_table(liobn, table, *fd);
+
+    return table;
+}
+
+static void spapr_tce_free_table(uint64_t *table, int fd, uint32_t nb_table)
+{
+    if (!kvm_enabled() ||
+        (kvmppc_remove_spapr_tce(table, fd, nb_table) != 0)) {
+        g_free(table);
+    }
+}
+
 /* Called from RCU critical section */
 static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
                                                bool is_write)
@@ -141,21 +172,13 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
-    uint64_t window_size = (uint64_t)tcet->nb_table << tcet->page_shift;
 
-    if (kvm_enabled() && !(window_size >> 32)) {
-        tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
-                                              window_size,
-                                              &tcet->fd,
-                                              tcet->need_vfio);
-    }
-
-    if (!tcet->table) {
-        size_t table_size = tcet->nb_table * sizeof(uint64_t);
-        tcet->table = g_malloc0(table_size);
-    }
-
-    trace_spapr_iommu_new_table(tcet->liobn, tcet, tcet->table, tcet->fd);
+    tcet->fd = -1;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
 
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
                              "iommu-spapr",
@@ -241,11 +264,8 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
     QLIST_REMOVE(tcet, list);
 
-    if (!kvm_enabled() ||
-        (kvmppc_remove_spapr_tce(tcet->table, tcet->fd,
-                                 tcet->nb_table) != 0)) {
-        g_free(tcet->table);
-    }
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/trace-events b/trace-events
index d494de1..6a94736 100644
--- a/trace-events
+++ b/trace-events
@@ -1430,7 +1430,7 @@ spapr_iommu_pci_get(uint64_t liobn, uint64_t ioba, uint64_t ret, uint64_t tce) "
 spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN, uint64_t tceN, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcelist=0x%"PRIx64" iobaN=0x%"PRIx64" tceN=0x%"PRIx64" ret=%"PRId64
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
-spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
+spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 04/18] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-22  1:11   ` David Gibson
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

Currently TCE tables are created once at start and their sizes never
change. We are going to change that by introducing a Dynamic DMA windows
support where DMA configuration may change during the guest execution.

This changes spapr_tce_new_table() to create an empty zero-size IOMMU
memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
It still will be called once at the owner object (VIO or PHB) creation.

This introduces an "enabled" state for TCE table objects with two
helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
- spapr_tce_table_enable() receives TCE table parameters, allocates
a guest view of the TCE table (in the user space or KVM) and
sets the correct size on the IOMMU MR.
- spapr_tce_table_disable() disposes the table and resets the IOMMU MR
size.

This changes the PHB reset handler to do the default DMA initialization
instead of spapr_phb_realize(). This does not make differenct now but
later with more than just one DMA window, we will have to remove them all
and create the default one on a system reset.

No visible change in behaviour is expected except the actual table
will be reallocated every reset. We might optimize this later.

The other way to implement this would be dynamically create/remove
the TCE table QOM objects but this would make migration impossible
as the migration code expects all QOM objects to exist at the receiver
so we have to have TCE table objects created when migration begins.

spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
as later it will be called at the sPAPRTCETable post-migration stage when
it already has all the properties set after the migration; the same is
done for spapr_tce_table_disable().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v14:
* added spapr_tce_table_do_disable(), will make difference in following
patch with fully dynamic table migration
---
 hw/ppc/spapr_iommu.c   | 86 ++++++++++++++++++++++++++++++++++++--------------
 hw/ppc/spapr_pci.c     | 13 ++++++--
 hw/ppc/spapr_vio.c     |  8 ++---
 include/hw/ppc/spapr.h | 10 +++---
 4 files changed, 81 insertions(+), 36 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 8132f64..9bcd3f6 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "hw/hw.h"
 #include "sysemu/kvm.h"
 #include "hw/qdev.h"
@@ -174,15 +175,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     tcet->fd = -1;
-    tcet->table = spapr_tce_alloc_table(tcet->liobn,
-                                        tcet->page_shift,
-                                        tcet->nb_table,
-                                        &tcet->fd,
-                                        tcet->need_vfio);
-
+    tcet->need_vfio = false;
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr",
-                             (uint64_t)tcet->nb_table << tcet->page_shift);
+                             "iommu-spapr", 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -224,14 +219,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
     tcet->table = newtable;
 }
 
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio)
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
 {
     sPAPRTCETable *tcet;
-    char tmp[64];
+    char tmp[32];
 
     if (spapr_tce_find_by_liobn(liobn)) {
         fprintf(stderr, "Attempted to create TCE table with duplicate"
@@ -239,16 +230,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
         return NULL;
     }
 
-    if (!nb_table) {
-        return NULL;
-    }
-
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
     tcet->liobn = liobn;
-    tcet->bus_offset = bus_offset;
-    tcet->page_shift = page_shift;
-    tcet->nb_table = nb_table;
-    tcet->need_vfio = need_vfio;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
@@ -258,14 +241,69 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
     return tcet;
 }
 
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
+{
+    if (!tcet->nb_table) {
+        return;
+    }
+
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
+
+    memory_region_set_size(&tcet->iommu,
+                           (uint64_t)tcet->nb_table << tcet->page_shift);
+
+    tcet->enabled = true;
+}
+
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table)
+{
+    if (tcet->enabled) {
+        error_report("Warning: trying to enable already enabled TCE table");
+        return;
+    }
+
+    tcet->bus_offset = bus_offset;
+    tcet->page_shift = page_shift;
+    tcet->nb_table = nb_table;
+
+    spapr_tce_table_do_enable(tcet);
+}
+
+static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
+{
+    memory_region_set_size(&tcet->iommu, 0);
+
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
+    tcet->table = NULL;
+    tcet->enabled = false;
+    tcet->bus_offset = 0;
+    tcet->page_shift = 0;
+    tcet->nb_table = 0;
+}
+
+static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+{
+    if (!tcet->enabled) {
+        error_report("Warning: trying to disable already disabled TCE table");
+        return;
+    }
+    spapr_tce_table_do_disable(tcet);
+}
+
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     QLIST_REMOVE(tcet, list);
 
-    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
-    tcet->fd = -1;
+    spapr_tce_table_disable(tcet);
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 18332bf..df5f7b9 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -818,14 +818,15 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
         return;
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
-                               page_shift, nb_table, false);
+    tcet = spapr_tce_find_by_liobn(liobn);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table liobn %x for %s",
                    liobn, sphb->dtbusname);
         return;
     }
 
+    spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
+
     memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
 }
@@ -1335,6 +1336,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     Error *local_err = NULL;
+    sPAPRTCETable *tcet;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1486,6 +1488,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
+    /* DMA setup */
+    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
+    if (!tcet) {
+        error_report("No default TCE table for %s", sphb->dtbusname);
+        return;
+    }
+
     /* Register default 32bit DMA window */
     spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
                                 sphb->dma_win_addr, sphb->dma_win_size,
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index 0f61a55..7f57290 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -481,11 +481,9 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
         memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
         address_space_init(&dev->as, &dev->mrroot, qdev->id);
 
-        dev->tcet = spapr_tce_new_table(qdev, liobn,
-                                        0,
-                                        SPAPR_TCE_PAGE_SHIFT,
-                                        pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT, false);
+        dev->tcet = spapr_tce_new_table(qdev, liobn);
+        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
+                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT);
         dev->tcet->vdev = dev;
         memory_region_add_subregion_overlap(&dev->mrroot, 0,
                                             spapr_tce_get_iommu(dev->tcet), 2);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 098d85d..75b0b55 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -539,6 +539,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 
 struct sPAPRTCETable {
     DeviceState parent;
+    bool enabled;
     uint32_t liobn;
     uint32_t nb_table;
     uint64_t bus_offset;
@@ -566,11 +567,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(sPAPRMachineState *sm,
                                  target_ulong addr, target_ulong size,
                                  bool cpu_update, bool memory_update);
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio);
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-22  1:18   ` David Gibson
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

6a81dd17 "spapr_iommu: Rename vfio_accel parameter" renamed vfio_accel
flag everywhere but one spot was missed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 target-ppc/kvm_ppc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
index fc79312..3b2090e 100644
--- a/target-ppc/kvm_ppc.h
+++ b/target-ppc/kvm_ppc.h
@@ -163,7 +163,7 @@ static inline bool kvmppc_spapr_use_multitce(void)
 
 static inline void *kvmppc_create_spapr_tce(uint32_t liobn,
                                             uint32_t window_size, int *fd,
-                                            bool vfio_accel)
+                                            bool need_vfio)
 {
     return NULL;
 }
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-22  1:23   ` David Gibson
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

The source guest could have reallocated the default TCE table and
migrate bigger/smaller table. This adds reallocation in post_load()
if the default table size is different on source and destination.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* new to the series
---
 hw/ppc/spapr_iommu.c   | 36 ++++++++++++++++++++++++++++++++++--
 include/hw/ppc/spapr.h |  2 ++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 9bcd3f6..549cd94 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -137,6 +137,16 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->mig_table = tcet->table;
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
+static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -145,6 +155,26 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (tcet->enabled) {
+        if (tcet->nb_table != tcet->mig_nb_table) {
+            if (tcet->nb_table) {
+                spapr_tce_table_do_disable(tcet);
+            }
+            tcet->nb_table = tcet->mig_nb_table;
+            spapr_tce_table_do_enable(tcet);
+        }
+
+        memcpy(tcet->table, tcet->mig_table,
+               tcet->nb_table * sizeof(tcet->table[0]));
+
+        free(tcet->mig_table);
+        tcet->mig_table = NULL;
+
+    } else if (tcet->table) {
+        /* Destination guest has a default table but source does not -> free */
+        spapr_tce_table_do_disable(tcet);
+    }
+
     return 0;
 }
 
@@ -152,15 +182,17 @@ static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
     .version_id = 2,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 75b0b55..c1ea49c 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -545,6 +545,8 @@ struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint32_t mig_nb_table;
+    uint64_t *mig_table;
     bool bypass;
     bool need_vfio;
     int fd;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-22  1:31   ` David Gibson
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 09/18] spapr_iommu: Add root memory region Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

This adds @bus_offset, @page_shift, @enabled members to migration stream.
These cannot change without dynamic DMA windows so no change in
behavior is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* new to the series
---
 hw/ppc/spapr_iommu.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 549cd94..5ea5948 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -180,7 +180,7 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
 
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
-    .version_id = 2,
+    .version_id = 3,
     .minimum_version_id = 2,
     .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
@@ -189,6 +189,9 @@ static const VMStateDescription vmstate_spapr_tce_table = {
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
         VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
         VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, nb_table, 0,
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 09/18] spapr_iommu: Add root memory region
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 10/18] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

We are going to have multiple DMA windows at different offsets on
a PCI bus. For the sake of migration, we will have as many TCE table
objects pre-created as many windows supported.
So we need a way to map windows dynamically onto a PCI bus
when migration of a table is completed but at this stage a TCE table
object does not have access to a PHB to ask it to map a DMA window
backed by just migrated TCE table.

This adds a "root" memory region (UINT64_MAX long) to the TCE object.
This new region is mapped on a PCI bus with enabled overlapping as
there will be one root MR per TCE table, each of them mapped at 0.
The actual IOMMU memory region is a subregion of the root region and
a TCE table enables/disables this subregion and maps it at
the specific offset inside the root MR which is 1:1 mapping of
a PCI address space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 hw/ppc/spapr_iommu.c   | 13 ++++++++++---
 hw/ppc/spapr_pci.c     |  6 +++---
 include/hw/ppc/spapr.h |  2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 5ea5948..481ce3c 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -208,11 +208,16 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+    Object *tcetobj = OBJECT(tcet);
+    char tmp[32];
 
     tcet->fd = -1;
     tcet->need_vfio = false;
-    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr", 0);
+    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
+    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
+
+    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
+    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -290,6 +295,7 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
+    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
 
     tcet->enabled = true;
 }
@@ -312,6 +318,7 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
 
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
 {
+    memory_region_del_subregion(&tcet->root, &tcet->iommu);
     memory_region_set_size(&tcet->iommu, 0);
 
     spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
@@ -343,7 +350,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
 {
-    return &tcet->iommu;
+    return &tcet->root;
 }
 
 static void spapr_tce_reset(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index df5f7b9..f1d49d5 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -826,9 +826,6 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
     }
 
     spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
-
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
-                                spapr_tce_get_iommu(tcet));
 }
 
 /* Macros to operate with address in OF binding to PCI */
@@ -1495,6 +1492,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                        spapr_tce_get_iommu(tcet), 0);
+
     /* Register default 32bit DMA window */
     spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
                                 sphb->dma_win_addr, sphb->dma_win_size,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index c1ea49c..e9cdfe3 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -550,7 +550,7 @@ struct sPAPRTCETable {
     bool bypass;
     bool need_vfio;
     int fd;
-    MemoryRegion iommu;
+    MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
     QLIST_ENTRY(sPAPRTCETable) list;
 };
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 10/18] spapr_pci: Reset DMA config on PHB reset
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 09/18] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

LoPAPR dictates that during system reset all DMA windows must be removed
and the default DMA32 window must be created so does the patch.

At the moment there is just one window supported so no change in
behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_iommu.c   |  2 +-
 hw/ppc/spapr_pci.c     | 38 +++++++++++++++++++++++++++++---------
 include/hw/ppc/spapr.h |  1 +
 3 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 481ce3c..dd662da 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -330,7 +330,7 @@ static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
     tcet->nb_table = 0;
 }
 
-static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+void spapr_tce_table_disable(sPAPRTCETable *tcet)
 {
     if (!tcet->enabled) {
         error_report("Warning: trying to disable already disabled TCE table");
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index f1d49d5..1e53dad 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -828,6 +828,19 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
     spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
 }
 
+static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
+{
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+
+    if (!tcet) {
+        return -1;
+    }
+
+    spapr_tce_table_disable(tcet);
+
+    return 0;
+}
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1332,7 +1345,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     int i;
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
-    Error *local_err = NULL;
     sPAPRTCETable *tcet;
 
     if (sphb->index != (uint32_t)-1) {
@@ -1495,14 +1507,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
                                         spapr_tce_get_iommu(tcet), 0);
 
-    /* Register default 32bit DMA window */
-    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
-                                sphb->dma_win_addr, sphb->dma_win_size,
-                                &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-    }
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1519,6 +1523,22 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 static void spapr_phb_reset(DeviceState *qdev)
 {
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+    Error *local_err = NULL;
+
+    if (tcet && tcet->enabled) {
+        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
+    }
+
+    /* Register default 32bit DMA window */
+    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
+                                sphb->dma_win_addr, sphb->dma_win_size,
+                                &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+    }
+
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index e9cdfe3..471eb4a 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -573,6 +573,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
 void spapr_tce_table_enable(sPAPRTCETable *tcet,
                             uint32_t page_shift, uint64_t bus_offset,
                             uint32_t nb_table);
+void spapr_tce_table_disable(sPAPRTCETable *tcet);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 10/18] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
@ 2016-03-21  7:46 ` Alexey Kardashevskiy
  2016-03-22  3:02   ` David Gibson
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the actual
page size(s) used by an IOMMU.

The qemu_real_host_page_mask is used as fallback.

This removes vfio_container_granularity() and uses new callback in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* removed vfio_container_granularity(), changed memory_region_iommu_replay()

v4:
* s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
---
 hw/ppc/spapr_iommu.c  |  8 ++++++++
 hw/vfio/common.c      |  6 ------
 include/exec/memory.h | 18 ++++++++++++++----
 memory.c              | 17 ++++++++++++++---
 4 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index dd662da..6dc3c45 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -144,6 +144,13 @@ static void spapr_tce_table_pre_save(void *opaque)
     tcet->mig_table = tcet->table;
 }
 
+static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return 1ULL << tcet->page_shift;
+}
+
 static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
 
@@ -203,6 +210,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
+    .get_page_sizes = spapr_tce_get_page_sizes,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d45e2db..55723c9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -313,11 +313,6 @@ out:
     rcu_read_unlock();
 }
 
-static hwaddr vfio_container_granularity(VFIOContainer *container)
-{
-    return (hwaddr)1 << ctz64(container->iova_pgsizes);
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -385,7 +380,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
-                                   vfio_container_granularity(container),
                                    false);
 
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2de7898..eb5ce67 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /* Returns supported page sizes */
+    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -573,6 +575,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
 
 
 /**
+ * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
+ *
+ * Returns %bitmap of supported page sizes for an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
+
+/**
  * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
  *
  * @mr: the memory region that was changed
@@ -596,16 +607,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
 
 /**
  * memory_region_iommu_replay: replay existing IOMMU translations to
- * a notifier
+ * a notifier with the minimum page granularity returned by
+ * mr->iommu_ops->get_page_sizes().
  *
  * @mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
- * @granularity: Minimum page granularity to replay notifications for
  * @is_write: Whether to treat the replay as a translate "write"
  *     through the iommu
  */
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write);
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
 
 /**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
diff --git a/memory.c b/memory.c
index 95f7209..64a84d3 100644
--- a/memory.c
+++ b/memory.c
@@ -1512,12 +1512,14 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
     notifier_list_add(&mr->iommu_notify, n);
 }
 
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write)
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
 {
-    hwaddr addr;
+    hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    g_assert(mr->iommu_ops && mr->iommu_ops->get_page_sizes);
+    granularity = (hwaddr)1 << ctz64(mr->iommu_ops->get_page_sizes(mr));
+
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
         iotlb = mr->iommu_ops->translate(mr, addr, is_write);
         if (iotlb.perm != IOMMU_NONE) {
@@ -1544,6 +1546,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     notifier_list_notify(&mr->iommu_notify, &entry);
 }
 
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
+{
+    assert(memory_region_is_iommu(mr));
+    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
+        return mr->iommu_ops->get_page_sizes(mr);
+    }
+    return qemu_real_host_page_size;
+}
+
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
     uint8_t mask = 1 << client;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-03-21  7:47 ` Alexey Kardashevskiy
  2016-03-22  3:05   ` David Gibson
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

At the moment IOMMU MR only translate to the system memory.
However if some new code changes this, we will need clear indication why
it is not working so here is the check.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* new to the series
---
 hw/vfio/common.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 55723c9..9587c25 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
 
     trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
 
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name?iotlb->target_as->name:"noname");
+        return;
+    }
+
     /*
      * The IOMMU TLB entry we have just covers translation through
      * this IOMMU to its immediate target.  We need to translate
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
@ 2016-03-21  7:47 ` Alexey Kardashevskiy
  2016-03-22  4:04   ` David Gibson
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 14/18] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

As there is no per-IOMMU-type release() callback anymore, this stores
the IOMMU type in the container so vfio_listener_release() can device
if it needs to unregister @prereg_listener.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This does not change the guest visible interface.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* s/free_container_exit/listener_release_exit/g
* added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
---
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              |  38 +++++++++---
 hw/vfio/prereg.c              | 137 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   4 ++
 trace-events                  |   2 +
 5 files changed, 172 insertions(+), 10 deletions(-)
 create mode 100644 hw/vfio/prereg.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index ceddbb8..5800e0e 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
+obj-$(CONFIG_SOFTMMU) += prereg.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9587c25..a8deb16 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -493,6 +493,9 @@ static const MemoryListener vfio_memory_listener = {
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
 }
 
 int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
@@ -800,8 +803,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -826,8 +829,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
             container->iova_pgsizes = info.iova_pgsizes;
         }
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
 
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
@@ -835,7 +840,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -847,11 +854,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto listener_release_exit;
+            }
         }
 
         /*
@@ -864,7 +882,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if (ret) {
             error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
             ret = -errno;
-            goto free_container_exit;
+            goto listener_release_exit;
         }
         container->min_iova = info.dma32_window_start;
         container->max_iova = container->min_iova + info.dma32_window_size - 1;
diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
new file mode 100644
index 0000000..36c9ff5
--- /dev/null
+++ b/hw/vfio/prereg.c
@@ -0,0 +1,137 @@
+/*
+ * DMA memory preregistration
+ *
+ * Authors:
+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
+{
+    if (memory_region_is_iommu(section->mr)) {
+        error_report("Cannot possibly preregister IOMMU memory");
+        return true;
+    }
+
+    return !memory_region_is_ram(section->mr) ||
+            memory_region_is_skip_dump(section->mr);
+}
+
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    hwaddr gpa;
+    Int128 llend;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_add_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    gpa = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(page_mask));
+
+    g_assert(!int128_ge(int128_make64(gpa), llend));
+
+    memory_region_ref(section->mr);
+
+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (gpa - section->offset_within_address_space);
+    reg.size = int128_get64(llend) - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
+    if (ret) {
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->initialized) {
+            if (!container->error) {
+                container->error = ret;
+            }
+        } else {
+            hw_error("vfio: Memory registering failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    hwaddr gpa, end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_del_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    gpa = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
+    end = (section->offset_within_address_space + int128_get64(section->size)) &
+        page_mask;
+
+    if (gpa >= end) {
+        return;
+    }
+
+    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (gpa - section->offset_within_address_space);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
+    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+}
+
+const MemoryListener vfio_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5341e05..b861eec 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -73,6 +73,8 @@ typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
+    MemoryListener prereg_listener;
+    unsigned iommu_type;
     int error;
     bool initialized;
     /*
@@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
                          struct vfio_region_info **info);
 #endif
+extern const MemoryListener vfio_prereg_listener;
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index 6a94736..cc619e1 100644
--- a/trace-events
+++ b/trace-events
@@ -1734,6 +1734,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 14/18] spapr_pci: Add and export DMA resetting helper
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2016-03-21  7:47 ` Alexey Kardashevskiy
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

This will be later used by the "ibm,reset-pe-dma-window" RTAS handler
which resets the DMA configuration to the defaults.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci.c          | 10 ++++++++--
 include/hw/pci-host/spapr.h |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 1e53dad..bfcafdf 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1521,9 +1521,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
     return 0;
 }
 
-static void spapr_phb_reset(DeviceState *qdev)
+void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
     Error *local_err = NULL;
 
@@ -1538,6 +1537,13 @@ static void spapr_phb_reset(DeviceState *qdev)
     if (local_err) {
         error_report_err(local_err);
     }
+}
+
+static void spapr_phb_reset(DeviceState *qdev)
+{
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_reset(sphb);
 
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 03ee006..7848366 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -147,4 +147,6 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 }
 #endif
 
+void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 14/18] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
@ 2016-03-21  7:47 ` Alexey Kardashevskiy
  2016-03-22  4:20   ` David Gibson
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

There are going to be multiple IOMMUs per a container. This moves
the single host IOMMU parameter set to a list of VFIOHostIOMMU.

This should cause no behavioral change and will be used later by
the SPAPR TCE IOMMU v2 which will also add a vfio_host_iommu_del() helper.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c              | 65 +++++++++++++++++++++++++++++++++----------
 include/hw/vfio/vfio-common.h |  9 ++++--
 2 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a8deb16..b257655 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "exec/memory.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
+#include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
 
@@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
+static VFIOHostIOMMU *vfio_host_iommu_lookup(VFIOContainer *container,
+                                             hwaddr min_iova, hwaddr max_iova)
+{
+    VFIOHostIOMMU *hiommu;
+
+    QLIST_FOREACH(hiommu, &container->hiommu_list, hiommu_next) {
+        if (hiommu->min_iova <= min_iova && max_iova <= hiommu->max_iova) {
+            return hiommu;
+        }
+    }
+
+    return NULL;
+}
+
+static int vfio_host_iommu_add(VFIOContainer *container,
+                               hwaddr min_iova, hwaddr max_iova,
+                               uint64_t iova_pgsizes)
+{
+    VFIOHostIOMMU *hiommu;
+
+    QLIST_FOREACH(hiommu, &container->hiommu_list, hiommu_next) {
+        if (ranges_overlap(min_iova, max_iova - min_iova + 1,
+                           hiommu->min_iova,
+                           hiommu->max_iova - hiommu->min_iova + 1)) {
+            error_report("%s: Overlapped IOMMU are not enabled", __func__);
+            return -1;
+        }
+    }
+
+    hiommu = g_malloc0(sizeof(*hiommu));
+
+    hiommu->min_iova = min_iova;
+    hiommu->max_iova = max_iova;
+    hiommu->iova_pgsizes = iova_pgsizes;
+    QLIST_INSERT_HEAD(&container->hiommu_list, hiommu, hiommu_next);
+
+    return 0;
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -352,7 +392,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(llend);
 
-    if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
+    if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
                      container, iova, end - 1);
@@ -367,10 +407,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         trace_vfio_listener_region_add_iommu(iova, end - 1);
         /*
-         * FIXME: We should do some checking to see if the
-         * capabilities of the host VFIO IOMMU are adequate to model
-         * the guest IOMMU
-         *
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
          * would be the right place to wire that up (tell the KVM
@@ -818,16 +854,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        container->min_iova = 0;
-        container->max_iova = (hwaddr)-1;
-
-        /* Assume just 4K IOVA page size */
-        container->iova_pgsizes = 0x1000;
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
         /* Ignore errors */
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
-            container->iova_pgsizes = info.iova_pgsizes;
+            vfio_host_iommu_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
+        } else {
+            /* Assume just 4K IOVA page size */
+            vfio_host_iommu_add(container, 0, (hwaddr)-1, 0x1000);
         }
     } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
                ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
@@ -884,11 +918,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto listener_release_exit;
         }
-        container->min_iova = info.dma32_window_start;
-        container->max_iova = container->min_iova + info.dma32_window_size - 1;
 
-        /* Assume just 4K IOVA pages for now */
-        container->iova_pgsizes = 0x1000;
+        /* The default table uses 4K pages */
+        vfio_host_iommu_add(container, info.dma32_window_start,
+                            info.dma32_window_start +
+                            info.dma32_window_size - 1,
+                            0x1000);
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b861eec..1b98e33 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -82,9 +82,8 @@ typedef struct VFIOContainer {
      * contiguous IOVA window.  We may need to generalize that in
      * future
      */
-    hwaddr min_iova, max_iova;
-    uint64_t iova_pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostIOMMU) hiommu_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
@@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
+typedef struct VFIOHostIOMMU {
+    hwaddr min_iova, max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostIOMMU) hiommu_next;
+} VFIOHostIOMMU;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (14 preceding siblings ...)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities Alexey Kardashevskiy
@ 2016-03-21  7:47 ` Alexey Kardashevskiy
  2016-03-22  4:45   ` David Gibson
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
a guest view of the table and a hardware TCE table. If there is no VFIO
presense in the address space, then just the guest view is used, if
this is the case, it is allocated in the KVM. However since there is no
support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
we need to move the guest view from KVM to the userspace; and we need
to do this for every IOMMU on a bus with VFIO devices.

This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
notifiy IOMMU about changing environment so it can reallocate the table
to/from KVM or (when available) hook the IOMMU groups with the logical
bus (LIOBN) in the KVM.

This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
path as the new callbacks do this better - they notify IOMMU at
the exact moment when the configuration is changed, and this also
includes the case of PCI hot unplug.

TODO: split into 2 or 3 patches, per maintainership area.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c  | 12 ++++++++++++
 hw/ppc/spapr_pci.c    |  6 ------
 hw/vfio/common.c      |  9 +++++++++
 include/exec/memory.h |  4 ++++
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 6dc3c45..702075d 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -151,6 +151,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
     return 1ULL << tcet->page_shift;
 }
 
+static void spapr_tce_vfio_start(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
+}
+
+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
+}
+
 static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
 static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
 
@@ -211,6 +221,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
     .get_page_sizes = spapr_tce_get_page_sizes,
+    .vfio_start = spapr_tce_vfio_start,
+    .vfio_stop = spapr_tce_vfio_stop,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index bfcafdf..af99a36 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1121,12 +1121,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     void *fdt = NULL;
     int fdt_start_offset = 0, fdt_size;
 
-    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
-
-        spapr_tce_set_need_vfio(tcet, true);
-    }
-
     if (dev->hotplugged) {
         fdt = create_device_tree(&fdt_size);
         fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b257655..4e873b7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -421,6 +421,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
+            section->mr->iommu_ops->vfio_start(section->mr);
+        }
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
                                    false);
 
@@ -466,6 +469,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     hwaddr iova, end;
     int ret;
+    MemoryRegion *iommu = NULL;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -487,6 +491,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (giommu->iommu == section->mr) {
                 memory_region_unregister_iommu_notifier(&giommu->n);
+                iommu = giommu->iommu;
                 QLIST_REMOVE(giommu, giommu_next);
                 g_free(giommu);
                 break;
@@ -519,6 +524,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, end - iova, ret);
     }
+
+    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
+        iommu->iommu_ops->vfio_stop(section->mr);
+    }
 }
 
 static const MemoryListener vfio_memory_listener = {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index eb5ce67..f1de133f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -152,6 +152,10 @@ struct MemoryRegionIOMMUOps {
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
     /* Returns supported page sizes */
     uint64_t (*get_page_sizes)(MemoryRegion *iommu);
+    /* Called when VFIO starts using this */
+    void (*vfio_start)(MemoryRegion *iommu);
+    /* Called when VFIO stops using this */
+    void (*vfio_stop)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (15 preceding siblings ...)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-03-21  7:47 ` Alexey Kardashevskiy
  2016-03-22  5:14   ` David Gibson
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v14:
* new to the series

---
TODO:
* export levels to PHB
---
 hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 trace-events     |   2 ++
 2 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4e873b7..421d6eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
     return 0;
 }
 
+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
+{
+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
+
+    g_assert(hiommu);
+    QLIST_REMOVE(hiommu, hiommu_next);
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(llend);
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        unsigned entries, pages;
+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
+
+        g_assert(section->mr->iommu_ops);
+        g_assert(memory_region_is_iommu(section->mr));
+
+        trace_vfio_listener_region_add_iommu(iova, end - 1);
+        /*
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        create.window_size = memory_region_size(section->mr);
+        create.page_shift =
+                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
+        /*
+         * SPAPR host supports multilevel TCE tables, there is some
+         * euristic to decide how many levels we want for our table:
+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+         */
+        entries = create.window_size >> create.page_shift;
+        pages = (entries * sizeof(uint64_t)) / getpagesize();
+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+        if (ret) {
+            error_report("Failed to create a window, ret = %d (%m)", ret);
+            goto fail;
+        }
+
+        if (create.start_addr != section->offset_within_address_space ||
+            vfio_host_iommu_lookup(container, create.start_addr,
+                                   create.start_addr + create.window_size - 1)) {
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = create.start_addr
+            };
+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
+                         section->offset_within_address_space,
+                         create.start_addr);
+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            ret = -EINVAL;
+            goto fail;
+        }
+        trace_vfio_spapr_create_window(create.page_shift,
+                                       create.window_size,
+                                       create.start_addr);
+
+        vfio_host_iommu_add(container, create.start_addr,
+                            create.start_addr + create.window_size - 1,
+                            1ULL << create.page_shift);
+    }
+
     if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
@@ -525,6 +588,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      container, iova, end - iova, ret);
     }
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        struct vfio_iommu_spapr_tce_remove remove = {
+            .argsz = sizeof(remove),
+            .start_addr = section->offset_within_address_space,
+        };
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        if (ret) {
+            error_report("Failed to remove window at %"PRIx64,
+                         remove.start_addr);
+        }
+
+        vfio_host_iommu_del(container, section->offset_within_address_space);
+
+        trace_vfio_spapr_remove_window(remove.start_addr);
+    }
+
     if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
         iommu->iommu_ops->vfio_stop(section->mr);
     }
@@ -928,11 +1007,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto listener_release_exit;
         }
 
-        /* The default table uses 4K pages */
-        vfio_host_iommu_add(container, info.dma32_window_start,
-                            info.dma32_window_start +
-                            info.dma32_window_size - 1,
-                            0x1000);
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = info.dma32_window_start,
+            };
+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            if (ret) {
+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            vfio_host_iommu_add(container, info.dma32_window_start,
+                                info.dma32_window_start +
+                                info.dma32_window_size - 1,
+                                0x1000);
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/trace-events b/trace-events
index cc619e1..f2b75a3 100644
--- a/trace-events
+++ b/trace-events
@@ -1736,6 +1736,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
 vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (16 preceding siblings ...)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
@ 2016-03-21  7:47 ` Alexey Kardashevskiy
  2016-03-23  2:13   ` David Gibson
  17 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  7:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, David Gibson

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

This implements DDW for emulated and VFIO devices.
This reserves RTAS token numbers for DDW calls.

This changes the TCE table migration descriptor to support dynamic
tables as from now on, PHB will create as many stub TCE table objects
as PHB can possibly support but not all of them might be initialized at
the time of migration because DDW might or might not be requested by
the guest.

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.5 machine and older disable it.

This implements DDW for VFIO. The host kernel support is required.
This adds a "levels" property to PHB to control the number of levels
in the actual TCE table allocated by the host kernel, 0 is the default
value to tell QEMU to calculate the correct value. Current hardware
supports up to 5 levels.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   7 +-
 hw/ppc/spapr_pci.c          |  73 ++++++++---
 hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/common.c            |   5 -
 include/hw/pci-host/spapr.h |  13 ++
 include/hw/ppc/spapr.h      |  16 ++-
 trace-events                |   4 +
 8 files changed, 395 insertions(+), 24 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c1ffc77..986b36f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d0bb423..ef4c637 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
  * pseries-2.5
  */
 #define SPAPR_COMPAT_2_5 \
-        HW_COMPAT_2_5
+        HW_COMPAT_2_5 \
+        {\
+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+            .property = "ddw",\
+            .value    = stringify(off),\
+        },
 
 static void spapr_machine_2_5_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index af99a36..3bb294a 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
     return buf;
 }
 
-static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
-                                       uint32_t liobn,
-                                       uint32_t page_shift,
-                                       uint64_t window_addr,
-                                       uint64_t window_size,
-                                       Error **errp)
+void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                 uint32_t liobn,
+                                 uint32_t page_shift,
+                                 uint64_t window_addr,
+                                 uint64_t window_size,
+                                 Error **errp)
 {
     sPAPRTCETable *tcet;
     uint32_t nb_table = window_size >> page_shift;
@@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
         return;
     }
 
+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
+        error_setg(errp,
+                   "Attempt to use second window when DDW is disabled on PHB");
+        return;
+    }
+
     spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
 }
 
-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
 {
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
@@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     /* DMA setup */
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-        error_report("No default TCE table for %s", sphb->dtbusname);
-        return;
-    }
 
-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
-                                        spapr_tce_get_iommu(tcet), 0);
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb),
+                                   SPAPR_PCI_LIOBN(sphb->index, i));
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
+    }
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
@@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
     Error *local_err = NULL;
+    int i;
 
-    if (tcet && tcet->enabled) {
-        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
+        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+
+        if (tcet && tcet->enabled) {
+            spapr_phb_dma_window_disable(sphb, liobn);
+        }
     }
 
     /* Register default 32bit DMA window */
@@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
+                       SPAPR_PCI_DMA_MAX_WINDOWS),
+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
+                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..37f805f
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,300 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->enabled) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->enabled) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
+                                 uint64_t page_mask)
+{
+    int i, j;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
+            if ((sps[i].page_shift == masks[j].shift) &&
+                    (page_mask & (1ULL << masks[j].shift))) {
+                mask |= masks[j].mask;
+            }
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    CPUPPCState *env = &cpu->env;
+    sPAPRPHBState *sphb;
+    uint64_t buid, max_window_size;
+    uint32_t avail, addr, pgmask = 0;
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    /* Work out supported page masks */
+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as maximum supported RAM size was in 4K pages.
+     */
+    max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT;
+
+    avail = sphb->windows_supported - spapr_phb_get_active_win_num(sphb);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+    rtas_st(rets, 2, max_window_size);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+    Error *local_err = NULL;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift)) ||
+        spapr_phb_get_active_win_num(sphb) == sphb->windows_supported) {
+        goto hw_error_exit;
+    }
+
+    if (window_shift < page_shift) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_window_enable(sphb, liobn, page_shift,
+                                sphb->dma64_window_addr,
+                                1ULL << window_shift, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        goto hw_error_exit;
+    }
+
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
+    if (local_err || !tcet) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+    long ret;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled || !spapr_phb_get_active_win_num(sphb)) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_window_disable(sphb, liobn);
+    trace_spapr_iommu_ddw_remove(liobn, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 421d6eb..b0ea146 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -994,11 +994,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..e81b751 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -71,6 +71,11 @@ struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint32_t windows_supported;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_addr;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
@@ -89,6 +94,8 @@ struct sPAPRPHBState {
 
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
@@ -148,5 +155,11 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 #endif
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                uint32_t liobn, uint32_t page_shift,
+                                uint64_t window_addr,
+                                uint64_t window_size,
+                                 Error **errp);
+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn);
 
 #endif /* __HW_SPAPR_PCI_H__ */
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 471eb4a..41b32c6 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -417,6 +417,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -458,8 +468,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
diff --git a/trace-events b/trace-events
index f2b75a3..e68d0e4 100644
--- a/trace-events
+++ b/trace-events
@@ -1431,6 +1431,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
+spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address Alexey Kardashevskiy
@ 2016-03-22  0:49   ` David Gibson
  2016-03-22  3:12     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-22  0:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6033 bytes --]

On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> when new VFIO listener is added, all existing IOMMU mappings are
> replayed. However there is a problem that the base address of
> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> for the existing user (which is pseries) with its default 32bit DMA
> window starting at 0 but it is if there is another DMA window.
> 
> This stores the IOMMU's offset_within_address_space and adjusts
> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> 
> As the IOMMU notifier expects IOVA offset rather than the absolute
> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> calling notifier(s).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

On a closer look, I realised this still isn't quite correct, although
I don't think any cases which would break it exist or are planned.

> ---
>  hw/ppc/spapr_iommu.c          |  2 +-
>  hw/vfio/common.c              | 14 ++++++++------
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 7dd4588..277f289 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
>      tcet->table[index] = tce;
>  
>      entry.target_as = &address_space_memory,
> -    entry.iova = ioba & page_mask;
> +    entry.iova = (ioba - tcet->bus_offset) & page_mask;
>      entry.translated_addr = tce & page_mask;
>      entry.addr_mask = ~page_mask;
>      entry.perm = spapr_tce_iommu_access_flags(tce);

This bit's right/

> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb588d8..d45e2db 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>      VFIOContainer *container = giommu->container;
>      IOMMUTLBEntry *iotlb = data;
> +    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;

This bit might be right, depending on how you define giommu->offset_within_address_space.

>      MemoryRegion *mr;
>      hwaddr xlat;
>      hwaddr len = iotlb->addr_mask + 1;
>      void *vaddr;
>      int ret;
>  
> -    trace_vfio_iommu_map_notify(iotlb->iova,
> -                                iotlb->iova + iotlb->addr_mask);
> +    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
>      /*
>       * The IOMMU TLB entry we have just covers translation through
> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>          vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -        ret = vfio_dma_map(container, iotlb->iova,
> +        ret = vfio_dma_map(container, iova,
>                             iotlb->addr_mask + 1, vaddr,
>                             !(iotlb->perm & IOMMU_WO) || mr->readonly);
>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, vaddr, ret);
>          }
>      } else {
> -        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iotlb->iova,
> +                         container, iova,
>                           iotlb->addr_mask + 1, ret);
>          }
>      }

This is fine.

> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           */
>          giommu = g_malloc0(sizeof(*giommu));
>          giommu->iommu = section->mr;
> +        giommu->offset_within_address_space =
> +            section->offset_within_address_space;

But here there's a problem.  The iova in IOMMUTLBEntry is relative to
the IOMMU MemoryRegion, but - at least in theory - only a subsection
of that MemoryRegion could be mapped into the AddressSpace.

So, to find the IOVA within the AddressSpace, from the IOVA within the
MemoryRegion, you need to first subtract the section's offset within
the MemoryRegion, then add the section's offset within the
AddressSpace.

You could precalculate the combined delta here, but...

>          giommu->container = container;
>          giommu->n.notify = vfio_iommu_map_notify;
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eb0e1b0..5341e05 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -90,6 +90,7 @@ typedef struct VFIOContainer {
>  typedef struct VFIOGuestIOMMU {
>      VFIOContainer *container;
>      MemoryRegion *iommu;
> +    hwaddr offset_within_address_space;

...it might be simpler to replace both the iommu and
offset_within_address_space fields here with a pointer to the
MemoryRegionSection instead, which should give you all the info you
need.

It might also be worth adding Paolo to the CC for this patch, since he
knows the MemoryRegion stuff better than anyone.

>      Notifier n;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
@ 2016-03-22  1:02   ` David Gibson
  2016-03-22  3:17     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-22  1:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]

On Mon, Mar 21, 2016 at 06:46:51PM +1100, Alexey Kardashevskiy wrote:
> We are going to have multiple DMA windows soon so let's start preparing.
> 
> This adds a new helper to create a DMA window and makes use of it in
> sPAPRPHBState::realize().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

With one tweak..

> ---
> Changes:
> v14:
> * replaced "int" return to Error* in spapr_phb_dma_window_enable()
> ---
>  hw/ppc/spapr_pci.c | 47 ++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 79baa7b..18332bf 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -803,6 +803,33 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>      return buf;
>  }
>  
> +static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                       uint32_t liobn,
> +                                       uint32_t page_shift,
> +                                       uint64_t window_addr,
> +                                       uint64_t window_size,
> +                                       Error **errp)
> +{
> +    sPAPRTCETable *tcet;
> +    uint32_t nb_table = window_size >> page_shift;
> +
> +    if (!nb_table) {
> +        error_setg(errp, "Zero size table");
> +        return;
> +    }
> +
> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> +                               page_shift, nb_table, false);
> +    if (!tcet) {
> +        error_setg(errp, "Unable to create TCE table liobn %x for %s",
> +                   liobn, sphb->dtbusname);
> +        return;
> +    }
> +
> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> +                                spapr_tce_get_iommu(tcet));
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> @@ -1307,8 +1334,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      int i;
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
> -    sPAPRTCETable *tcet;
> -    uint32_t nb_table;
> +    Error *local_err = NULL;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
> @@ -1460,18 +1486,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> -                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> -    if (!tcet) {
> -        error_setg(errp, "Unable to create TCE table for %s",
> -                   sphb->dtbusname);
> -        return;
> -    }
> -
>      /* Register default 32bit DMA window */
> -    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
> -                                spapr_tce_get_iommu(tcet));
> +    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
> +                                sphb->dma_win_addr, sphb->dma_win_size,
> +                                &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);

Should be a return; here so we don't continue if there's an error.

Actually.. that's not really right, we should be cleaning up all setup
we've done already on the failure path.  Without that I think we'll
leak some objects on a failed device_add.

But.. there are already a bunch of cases here that will do that, so we
can clean that up separately.  Probably the sanest way would be to add
an unrealize function() that can handle a partially realized object
and make sure it's called on all the error paths.

> +    }
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-03-22  1:11   ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  1:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 11718 bytes --]

On Mon, Mar 21, 2016 at 06:46:53PM +1100, Alexey Kardashevskiy wrote:
> Currently TCE tables are created once at start and their sizes never
> change. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
> 
> This changes spapr_tce_new_table() to create an empty zero-size IOMMU
> memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
> It still will be called once at the owner object (VIO or PHB) creation.
> 
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> - spapr_tce_table_enable() receives TCE table parameters, allocates
> a guest view of the TCE table (in the user space or KVM) and
> sets the correct size on the IOMMU MR.
> - spapr_tce_table_disable() disposes the table and resets the IOMMU MR
> size.
> 
> This changes the PHB reset handler to do the default DMA initialization
> instead of spapr_phb_realize(). This does not make differenct now but
> later with more than just one DMA window, we will have to remove them all
> and create the default one on a system reset.
> 
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
> 
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as the migration code expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
> 
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it already has all the properties set after the migration; the same is
> done for spapr_tce_table_disable().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

R-b stands, but I noticed one nit:

> ---
> Changes:
> v14:
> * added spapr_tce_table_do_disable(), will make difference in following
> patch with fully dynamic table migration
> ---
>  hw/ppc/spapr_iommu.c   | 86 ++++++++++++++++++++++++++++++++++++--------------
>  hw/ppc/spapr_pci.c     | 13 ++++++--
>  hw/ppc/spapr_vio.c     |  8 ++---
>  include/hw/ppc/spapr.h | 10 +++---
>  4 files changed, 81 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 8132f64..9bcd3f6 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -17,6 +17,7 @@
>   * License along with this library; if not, see <http://www.gnu.org/licenses/>.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  #include "hw/hw.h"
>  #include "sysemu/kvm.h"
>  #include "hw/qdev.h"
> @@ -174,15 +175,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
>      tcet->fd = -1;
> -    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> -                                        tcet->page_shift,
> -                                        tcet->nb_table,
> -                                        &tcet->fd,
> -                                        tcet->need_vfio);
> -
> +    tcet->need_vfio = false;
>      memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
> -                             "iommu-spapr",
> -                             (uint64_t)tcet->nb_table << tcet->page_shift);
> +                             "iommu-spapr", 0);
>  
>      QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
>  
> @@ -224,14 +219,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>      tcet->table = newtable;
>  }
>  
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool need_vfio)
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>  {
>      sPAPRTCETable *tcet;
> -    char tmp[64];
> +    char tmp[32];
>  
>      if (spapr_tce_find_by_liobn(liobn)) {
>          fprintf(stderr, "Attempted to create TCE table with duplicate"
> @@ -239,16 +230,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>          return NULL;
>      }
>  
> -    if (!nb_table) {
> -        return NULL;
> -    }
> -
>      tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>      tcet->liobn = liobn;
> -    tcet->bus_offset = bus_offset;
> -    tcet->page_shift = page_shift;
> -    tcet->nb_table = nb_table;
> -    tcet->need_vfio = need_vfio;
>  
>      snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>      object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> @@ -258,14 +241,69 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>      return tcet;
>  }
>  
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet)
> +{
> +    if (!tcet->nb_table) {
> +        return;
> +    }
> +
> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +                                        tcet->page_shift,
> +                                        tcet->nb_table,
> +                                        &tcet->fd,
> +                                        tcet->need_vfio);
> +
> +    memory_region_set_size(&tcet->iommu,
> +                           (uint64_t)tcet->nb_table << tcet->page_shift);
> +
> +    tcet->enabled = true;
> +}
> +
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint32_t page_shift, uint64_t bus_offset,
> +                            uint32_t nb_table)
> +{
> +    if (tcet->enabled) {
> +        error_report("Warning: trying to enable already enabled TCE table");
> +        return;
> +    }
> +
> +    tcet->bus_offset = bus_offset;
> +    tcet->page_shift = page_shift;
> +    tcet->nb_table = nb_table;
> +
> +    spapr_tce_table_do_enable(tcet);
> +}
> +
> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet)
> +{
> +    memory_region_set_size(&tcet->iommu, 0);
> +
> +    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> +    tcet->fd = -1;
> +    tcet->table = NULL;
> +    tcet->enabled = false;
> +    tcet->bus_offset = 0;
> +    tcet->page_shift = 0;
> +    tcet->nb_table = 0;
> +}
> +
> +static void spapr_tce_table_disable(sPAPRTCETable *tcet)
> +{
> +    if (!tcet->enabled) {
> +        error_report("Warning: trying to disable already disabled TCE table");
> +        return;
> +    }
> +    spapr_tce_table_do_disable(tcet);
> +}
> +
>  static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
>      QLIST_REMOVE(tcet, list);
>  
> -    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
> -    tcet->fd = -1;
> +    spapr_tce_table_disable(tcet);
>  }
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 18332bf..df5f7b9 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -818,14 +818,15 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>          return;
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> -                               page_shift, nb_table, false);
> +    tcet = spapr_tce_find_by_liobn(liobn);
>      if (!tcet) {
>          error_setg(errp, "Unable to create TCE table liobn %x for %s",
>                     liobn, sphb->dtbusname);
>          return;
>      }
>  
> +    spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
> +
>      memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>                                  spapr_tce_get_iommu(tcet));
>  }
> @@ -1335,6 +1336,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
>      Error *local_err = NULL;
> +    sPAPRTCETable *tcet;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
> @@ -1486,6 +1488,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> +    /* DMA setup */
> +    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> +    if (!tcet) {
> +        error_report("No default TCE table for %s", sphb->dtbusname);
> +        return;

You have an errp in this function so you should error_setg(), rather
than just error_report().

> +    }
> +
>      /* Register default 32bit DMA window */
>      spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
>                                  sphb->dma_win_addr, sphb->dma_win_size,
> diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
> index 0f61a55..7f57290 100644
> --- a/hw/ppc/spapr_vio.c
> +++ b/hw/ppc/spapr_vio.c
> @@ -481,11 +481,9 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
>          memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
>          address_space_init(&dev->as, &dev->mrroot, qdev->id);
>  
> -        dev->tcet = spapr_tce_new_table(qdev, liobn,
> -                                        0,
> -                                        SPAPR_TCE_PAGE_SHIFT,
> -                                        pc->rtce_window_size >>
> -                                        SPAPR_TCE_PAGE_SHIFT, false);
> +        dev->tcet = spapr_tce_new_table(qdev, liobn);
> +        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
> +                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT);
>          dev->tcet->vdev = dev;
>          memory_region_add_subregion_overlap(&dev->mrroot, 0,
>                                              spapr_tce_get_iommu(dev->tcet), 2);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 098d85d..75b0b55 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -539,6 +539,7 @@ typedef struct sPAPRTCETable sPAPRTCETable;
>  
>  struct sPAPRTCETable {
>      DeviceState parent;
> +    bool enabled;
>      uint32_t liobn;
>      uint32_t nb_table;
>      uint64_t bus_offset;
> @@ -566,11 +567,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
>  int spapr_h_cas_compose_response(sPAPRMachineState *sm,
>                                   target_ulong addr, target_ulong size,
>                                   bool cpu_update, bool memory_update);
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -                                   uint64_t bus_offset,
> -                                   uint32_t page_shift,
> -                                   uint32_t nb_table,
> -                                   bool need_vfio);
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
> +void spapr_tce_table_enable(sPAPRTCETable *tcet,
> +                            uint32_t page_shift, uint64_t bus_offset,
> +                            uint32_t nb_table);
>  void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
>  
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
@ 2016-03-22  1:18   ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  1:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

On Mon, Mar 21, 2016 at 06:46:54PM +1100, Alexey Kardashevskiy wrote:
> 6a81dd17 "spapr_iommu: Rename vfio_accel parameter" renamed vfio_accel
> flag everywhere but one spot was missed.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>


> ---
>  target-ppc/kvm_ppc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
> index fc79312..3b2090e 100644
> --- a/target-ppc/kvm_ppc.h
> +++ b/target-ppc/kvm_ppc.h
> @@ -163,7 +163,7 @@ static inline bool kvmppc_spapr_use_multitce(void)
>  
>  static inline void *kvmppc_create_spapr_tce(uint32_t liobn,
>                                              uint32_t window_size, int *fd,
> -                                            bool vfio_accel)
> +                                            bool need_vfio)
>  {
>      return NULL;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration Alexey Kardashevskiy
@ 2016-03-22  1:23   ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  1:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3819 bytes --]

On Mon, Mar 21, 2016 at 06:46:55PM +1100, Alexey Kardashevskiy wrote:
> The source guest could have reallocated the default TCE table and
> migrate bigger/smaller table. This adds reallocation in post_load()
> if the default table size is different on source and destination.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v14:
> * new to the series
> ---
>  hw/ppc/spapr_iommu.c   | 36 ++++++++++++++++++++++++++++++++++--
>  include/hw/ppc/spapr.h |  2 ++
>  2 files changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 9bcd3f6..549cd94 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -137,6 +137,16 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>      return ret;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +    tcet->mig_table = tcet->table;

Don't you need to set mig_nb_table here as well?  I can't see anywhere
else it's initialized.

> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -145,6 +155,26 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>      }
>  
> +    if (tcet->enabled) {
> +        if (tcet->nb_table != tcet->mig_nb_table) {
> +            if (tcet->nb_table) {
> +                spapr_tce_table_do_disable(tcet);
> +            }
> +            tcet->nb_table = tcet->mig_nb_table;
> +            spapr_tce_table_do_enable(tcet);
> +        }
> +
> +        memcpy(tcet->table, tcet->mig_table,
> +               tcet->nb_table * sizeof(tcet->table[0]));
> +
> +        free(tcet->mig_table);
> +        tcet->mig_table = NULL;
> +
> +    } else if (tcet->table) {
> +        /* Destination guest has a default table but source does not -> free */
> +        spapr_tce_table_do_disable(tcet);
> +    }
> +

Clunky, but I don't know of a better way.

>      return 0;
>  }
>  
> @@ -152,15 +182,17 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
>      .version_id = 2,
>      .minimum_version_id = 2,
> +    .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
>      .fields      = (VMStateField []) {
>          /* Sanity check */
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> +        VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, nb_table, 0,
> +                                    vmstate_info_uint64, uint64_t),
>  
>          VMSTATE_END_OF_LIST()
>      },
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 75b0b55..c1ea49c 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -545,6 +545,8 @@ struct sPAPRTCETable {
>      uint64_t bus_offset;
>      uint32_t page_shift;
>      uint64_t *table;
> +    uint32_t mig_nb_table;
> +    uint64_t *mig_table;
>      bool bypass;
>      bool need_vfio;
>      int fd;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state Alexey Kardashevskiy
@ 2016-03-22  1:31   ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  1:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

On Mon, Mar 21, 2016 at 06:46:56PM +1100, Alexey Kardashevskiy wrote:
> This adds @bus_offset, @page_shift, @enabled members to migration stream.
> These cannot change without dynamic DMA windows so no change in
> behavior is expected.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I think you should combine this patch with the previous one.  They're
both simple, and the functions in the previous one check
tcet->enabled, which doesn't make a lot of sense if you're not
migrating that value.

The version bump here looks correct, but it will break migration of
(for example) a pseries-2.5 VM running under qemu-2.7 back into
qemu-2.5.  That sort of backwards migration isn't considered
essential, but it is nice to have (and it's something RH cares about
downstream).

So, if possible it would be preferable to do the migration in a
backwards compatible way.  The standard trick for that seems to be to
add an optional section with the extra info, and make the "needed"
function return true iff the parameters differ from the defaults.

> ---
> Changes:
> v14:
> * new to the series
> ---
>  hw/ppc/spapr_iommu.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 549cd94..5ea5948 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -180,7 +180,7 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>  
>  static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
> -    .version_id = 2,
> +    .version_id = 3,
>      .minimum_version_id = 2,
>      .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
> @@ -189,6 +189,9 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
>          VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
>          VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, nb_table, 0,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes
  2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-03-22  3:02   ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  3:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6869 bytes --]

On Mon, Mar 21, 2016 at 06:46:59PM +1100, Alexey Kardashevskiy wrote:
> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> page size(s) used by an IOMMU.
> 
> The qemu_real_host_page_mask is used as fallback.

You're still mismatching concepts here.  The MemoryRegionIOMMUOps
represents a guest IOMMU, so falling back to qemu_real_host_page_mask
(a host property) makes no sense.  I think what you want is to fall
back to TARGET_PAGE_SIZE.

> This removes vfio_container_granularity() and uses new callback in
> memory_region_iommu_replay() when replaying IOMMU mappings on added
> IOMMU memory region.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v14:
> * removed vfio_container_granularity(), changed memory_region_iommu_replay()
> 
> v4:
> * s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
> ---
>  hw/ppc/spapr_iommu.c  |  8 ++++++++
>  hw/vfio/common.c      |  6 ------
>  include/exec/memory.h | 18 ++++++++++++++----
>  memory.c              | 17 ++++++++++++++---
>  4 files changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index dd662da..6dc3c45 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -144,6 +144,13 @@ static void spapr_tce_table_pre_save(void *opaque)
>      tcet->mig_table = tcet->table;
>  }
>  
> +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> +{
> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> +
> +    return 1ULL << tcet->page_shift;
> +}
> +
>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -203,6 +210,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
> +    .get_page_sizes = spapr_tce_get_page_sizes,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index d45e2db..55723c9 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -313,11 +313,6 @@ out:
>      rcu_read_unlock();
>  }
>  
> -static hwaddr vfio_container_granularity(VFIOContainer *container)
> -{
> -    return (hwaddr)1 << ctz64(container->iova_pgsizes);
> -}
> -
>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> @@ -385,7 +380,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> -                                   vfio_container_granularity(container),
>                                     false);
>  
>          return;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 2de7898..eb5ce67 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>  struct MemoryRegionIOMMUOps {
>      /* Return a TLB entry that contains a given address. */
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> +    /* Returns supported page sizes */
> +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -573,6 +575,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
>  
>  
>  /**
> + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
> + *
> + * Returns %bitmap of supported page sizes for an iommu.
> + *
> + * @mr: the memory region being queried
> + */
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
> +
> +/**
>   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>   *
>   * @mr: the memory region that was changed
> @@ -596,16 +607,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
>  
>  /**
>   * memory_region_iommu_replay: replay existing IOMMU translations to
> - * a notifier
> + * a notifier with the minimum page granularity returned by
> + * mr->iommu_ops->get_page_sizes().
>   *
>   * @mr: the memory region to observe
>   * @n: the notifier to which to replay iommu mappings
> - * @granularity: Minimum page granularity to replay notifications for
>   * @is_write: Whether to treat the replay as a translate "write"
>   *     through the iommu
>   */
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -                                hwaddr granularity, bool is_write);
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
>  
>  /**
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
> diff --git a/memory.c b/memory.c
> index 95f7209..64a84d3 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1512,12 +1512,14 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>      notifier_list_add(&mr->iommu_notify, n);
>  }
>  
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -                                hwaddr granularity, bool is_write)
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
>  {
> -    hwaddr addr;
> +    hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    g_assert(mr->iommu_ops && mr->iommu_ops->get_page_sizes);
> +    granularity = (hwaddr)1 << ctz64(mr->iommu_ops->get_page_sizes(mr));
> +
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>          iotlb = mr->iommu_ops->translate(mr, addr, is_write);
>          if (iotlb.perm != IOMMU_NONE) {
> @@ -1544,6 +1546,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>      notifier_list_notify(&mr->iommu_notify, &entry);
>  }
>  
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
> +{
> +    assert(memory_region_is_iommu(mr));
> +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
> +        return mr->iommu_ops->get_page_sizes(mr);
> +    }
> +    return qemu_real_host_page_size;
> +}
> +
>  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>  {
>      uint8_t mask = 1 << client;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
@ 2016-03-22  3:05   ` David Gibson
  2016-03-22 15:47     ` Alex Williamson
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-22  3:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1529 bytes --]

On Mon, Mar 21, 2016 at 06:47:00PM +1100, Alexey Kardashevskiy wrote:
> At the moment IOMMU MR only translate to the system memory.
> However if some new code changes this, we will need clear indication why
> it is not working so here is the check.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Alex, any chance we could merge this quickly, since it is a reasonable
sanity check even without the rest of the changes.

> ---
> Changes:
> v14:
> * new to the series
> ---
>  hw/vfio/common.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 55723c9..9587c25 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>  
>      trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
> +    if (iotlb->target_as != &address_space_memory) {
> +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> +                     iotlb->target_as->name?iotlb->target_as->name:"noname");
> +        return;
> +    }
> +
>      /*
>       * The IOMMU TLB entry we have just covers translation through
>       * this IOMMU to its immediate target.  We need to translate

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-22  0:49   ` David Gibson
@ 2016-03-22  3:12     ` Alexey Kardashevskiy
  2016-03-22  3:26       ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  3:12 UTC (permalink / raw)
  To: David Gibson; +Cc: Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel

On 03/22/2016 11:49 AM, David Gibson wrote:
> On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
>> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
>> when new VFIO listener is added, all existing IOMMU mappings are
>> replayed. However there is a problem that the base address of
>> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
>> for the existing user (which is pseries) with its default 32bit DMA
>> window starting at 0 but it is if there is another DMA window.
>>
>> This stores the IOMMU's offset_within_address_space and adjusts
>> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
>>
>> As the IOMMU notifier expects IOVA offset rather than the absolute
>> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
>> calling notifier(s).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> On a closer look, I realised this still isn't quite correct, although
> I don't think any cases which would break it exist or are planned.
>
>> ---
>>   hw/ppc/spapr_iommu.c          |  2 +-
>>   hw/vfio/common.c              | 14 ++++++++------
>>   include/hw/vfio/vfio-common.h |  1 +
>>   3 files changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 7dd4588..277f289 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
>>       tcet->table[index] = tce;
>>
>>       entry.target_as = &address_space_memory,
>> -    entry.iova = ioba & page_mask;
>> +    entry.iova = (ioba - tcet->bus_offset) & page_mask;
>>       entry.translated_addr = tce & page_mask;
>>       entry.addr_mask = ~page_mask;
>>       entry.perm = spapr_tce_iommu_access_flags(tce);
>
> This bit's right/
>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index fb588d8..d45e2db 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>       VFIOContainer *container = giommu->container;
>>       IOMMUTLBEntry *iotlb = data;
>> +    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
>
> This bit might be right, depending on how you define giommu->offset_within_address_space.
>
>>       MemoryRegion *mr;
>>       hwaddr xlat;
>>       hwaddr len = iotlb->addr_mask + 1;
>>       void *vaddr;
>>       int ret;
>>
>> -    trace_vfio_iommu_map_notify(iotlb->iova,
>> -                                iotlb->iova + iotlb->addr_mask);
>> +    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>>
>>       /*
>>        * The IOMMU TLB entry we have just covers translation through
>> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>>
>>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>>           vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> -        ret = vfio_dma_map(container, iotlb->iova,
>> +        ret = vfio_dma_map(container, iova,
>>                              iotlb->addr_mask + 1, vaddr,
>>                              !(iotlb->perm & IOMMU_WO) || mr->readonly);
>>           if (ret) {
>>               error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> -                         container, iotlb->iova,
>> +                         container, iova,
>>                            iotlb->addr_mask + 1, vaddr, ret);
>>           }
>>       } else {
>> -        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>>           if (ret) {
>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%m)",
>> -                         container, iotlb->iova,
>> +                         container, iova,
>>                            iotlb->addr_mask + 1, ret);
>>           }
>>       }
>
> This is fine.
>
>> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>            */
>>           giommu = g_malloc0(sizeof(*giommu));
>>           giommu->iommu = section->mr;
>> +        giommu->offset_within_address_space =
>> +            section->offset_within_address_space;
>
> But here there's a problem.  The iova in IOMMUTLBEntry is relative to
> the IOMMU MemoryRegion, but - at least in theory - only a subsection
> of that MemoryRegion could be mapped into the AddressSpace.

But the IOMMU MR stays the same - size, offset, and iova will be relative 
to its start, why does it matter if only portion is mapped?


> So, to find the IOVA within the AddressSpace, from the IOVA within the
> MemoryRegion, you need to first subtract the section's offset within
> the MemoryRegion, then add the section's offset within the
> AddressSpace.
>
> You could precalculate the combined delta here, but...
 >
>
>>           giommu->container = container;
>>           giommu->n.notify = vfio_iommu_map_notify;
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index eb0e1b0..5341e05 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -90,6 +90,7 @@ typedef struct VFIOContainer {
>>   typedef struct VFIOGuestIOMMU {
>>       VFIOContainer *container;
>>       MemoryRegion *iommu;
>> +    hwaddr offset_within_address_space;
>
> ...it might be simpler to replace both the iommu and
> offset_within_address_space fields here with a pointer to the
> MemoryRegionSection instead, which should give you all the info you
> need.


MemoryRegionSection is allocated on stack in listener_add_address_space() 
and seems to be in general some sort of temporary object.


>
> It might also be worth adding Paolo to the CC for this patch, since he
> knows the MemoryRegion stuff better than anyone.


Right, I added him in cc: now.

>
>>       Notifier n;
>>       QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>>   } VFIOGuestIOMMU;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper
  2016-03-22  1:02   ` David Gibson
@ 2016-03-22  3:17     ` Alexey Kardashevskiy
  2016-03-22  3:28       ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  3:17 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/22/2016 12:02 PM, David Gibson wrote:
> On Mon, Mar 21, 2016 at 06:46:51PM +1100, Alexey Kardashevskiy wrote:
>> We are going to have multiple DMA windows soon so let's start preparing.
>>
>> This adds a new helper to create a DMA window and makes use of it in
>> sPAPRPHBState::realize().
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> With one tweak..
>
>> ---
>> Changes:
>> v14:
>> * replaced "int" return to Error* in spapr_phb_dma_window_enable()
>> ---
>>   hw/ppc/spapr_pci.c | 47 ++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 34 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 79baa7b..18332bf 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -803,6 +803,33 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>       return buf;
>>   }
>>
>> +static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> +                                       uint32_t liobn,
>> +                                       uint32_t page_shift,
>> +                                       uint64_t window_addr,
>> +                                       uint64_t window_size,
>> +                                       Error **errp)
>> +{
>> +    sPAPRTCETable *tcet;
>> +    uint32_t nb_table = window_size >> page_shift;
>> +
>> +    if (!nb_table) {
>> +        error_setg(errp, "Zero size table");
>> +        return;
>> +    }
>> +
>> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
>> +                               page_shift, nb_table, false);
>> +    if (!tcet) {
>> +        error_setg(errp, "Unable to create TCE table liobn %x for %s",
>> +                   liobn, sphb->dtbusname);
>> +        return;
>> +    }
>> +
>> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>> +                                spapr_tce_get_iommu(tcet));
>> +}
>> +
>>   /* Macros to operate with address in OF binding to PCI */
>>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
>> @@ -1307,8 +1334,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       int i;
>>       PCIBus *bus;
>>       uint64_t msi_window_size = 4096;
>> -    sPAPRTCETable *tcet;
>> -    uint32_t nb_table;
>> +    Error *local_err = NULL;
>>
>>       if (sphb->index != (uint32_t)-1) {
>>           hwaddr windows_base;
>> @@ -1460,18 +1486,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>           }
>>       }
>>
>> -    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
>> -                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
>> -    if (!tcet) {
>> -        error_setg(errp, "Unable to create TCE table for %s",
>> -                   sphb->dtbusname);
>> -        return;
>> -    }
>> -
>>       /* Register default 32bit DMA window */
>> -    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
>> -                                spapr_tce_get_iommu(tcet));
>> +    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
>> +                                sphb->dma_win_addr, sphb->dma_win_size,
>> +                                &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>
> Should be a return; here so we don't continue if there's an error.
>
> Actually.. that's not really right, we should be cleaning up all setup
> we've done already on the failure path.  Without that I think we'll
> leak some objects on a failed device_add.
>
> But.. there are already a bunch of cases here that will do that, so we
> can clean that up separately.  Probably the sanest way would be to add
> an unrealize function() that can handle a partially realized object
> and make sure it's called on all the error paths.


So what do I do right now with this patch? Leave it as is, add "return", 
implement unrealize(), ...? In practice, being unable to create a PHB is a 
fatal error today (as we do not have PHB hotplug yet and this is what 
unrealize() is for).


>
>> +    }
>>
>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>   }
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-22  3:12     ` Alexey Kardashevskiy
@ 2016-03-22  3:26       ` David Gibson
  2016-03-22  4:28         ` Alexey Kardashevskiy
  2016-03-23 10:58         ` Paolo Bonzini
  0 siblings, 2 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  3:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7149 bytes --]

On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 11:49 AM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> >>Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> >>when new VFIO listener is added, all existing IOMMU mappings are
> >>replayed. However there is a problem that the base address of
> >>an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> >>for the existing user (which is pseries) with its default 32bit DMA
> >>window starting at 0 but it is if there is another DMA window.
> >>
> >>This stores the IOMMU's offset_within_address_space and adjusts
> >>the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> >>
> >>As the IOMMU notifier expects IOVA offset rather than the absolute
> >>address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> >>calling notifier(s).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> >On a closer look, I realised this still isn't quite correct, although
> >I don't think any cases which would break it exist or are planned.
> >
> >>---
> >>  hw/ppc/spapr_iommu.c          |  2 +-
> >>  hw/vfio/common.c              | 14 ++++++++------
> >>  include/hw/vfio/vfio-common.h |  1 +
> >>  3 files changed, 10 insertions(+), 7 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 7dd4588..277f289 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
> >>      tcet->table[index] = tce;
> >>
> >>      entry.target_as = &address_space_memory,
> >>-    entry.iova = ioba & page_mask;
> >>+    entry.iova = (ioba - tcet->bus_offset) & page_mask;
> >>      entry.translated_addr = tce & page_mask;
> >>      entry.addr_mask = ~page_mask;
> >>      entry.perm = spapr_tce_iommu_access_flags(tce);
> >
> >This bit's right/
> >
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index fb588d8..d45e2db 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>      VFIOContainer *container = giommu->container;
> >>      IOMMUTLBEntry *iotlb = data;
> >>+    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
> >
> >This bit might be right, depending on how you define giommu->offset_within_address_space.
> >
> >>      MemoryRegion *mr;
> >>      hwaddr xlat;
> >>      hwaddr len = iotlb->addr_mask + 1;
> >>      void *vaddr;
> >>      int ret;
> >>
> >>-    trace_vfio_iommu_map_notify(iotlb->iova,
> >>-                                iotlb->iova + iotlb->addr_mask);
> >>+    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> >>
> >>      /*
> >>       * The IOMMU TLB entry we have just covers translation through
> >>@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>
> >>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> >>          vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>-        ret = vfio_dma_map(container, iotlb->iova,
> >>+        ret = vfio_dma_map(container, iova,
> >>                             iotlb->addr_mask + 1, vaddr,
> >>                             !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >>          if (ret) {
> >>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>-                         container, iotlb->iova,
> >>+                         container, iova,
> >>                           iotlb->addr_mask + 1, vaddr, ret);
> >>          }
> >>      } else {
> >>-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> >>+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> >>          if (ret) {
> >>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>                           "0x%"HWADDR_PRIx") = %d (%m)",
> >>-                         container, iotlb->iova,
> >>+                         container, iova,
> >>                           iotlb->addr_mask + 1, ret);
> >>          }
> >>      }
> >
> >This is fine.
> >
> >>@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>           */
> >>          giommu = g_malloc0(sizeof(*giommu));
> >>          giommu->iommu = section->mr;
> >>+        giommu->offset_within_address_space =
> >>+            section->offset_within_address_space;
> >
> >But here there's a problem.  The iova in IOMMUTLBEntry is relative to
> >the IOMMU MemoryRegion, but - at least in theory - only a subsection
> >of that MemoryRegion could be mapped into the AddressSpace.
> 
> But the IOMMU MR stays the same - size, offset, and iova will be relative to
> its start, why does it matter if only portion is mapped?

Because the portion mapped may not sit at the start of the MR.  For
example if you had a 2G MR, and the second half is mapped at address 0
in the AS, then an IOMMUTLBEntry iova of 1G would translated to AS
address 0.

> >So, to find the IOVA within the AddressSpace, from the IOVA within the
> >MemoryRegion, you need to first subtract the section's offset within
> >the MemoryRegion, then add the section's offset within the
> >AddressSpace.
> >
> >You could precalculate the combined delta here, but...
> >
> >
> >>          giommu->container = container;
> >>          giommu->n.notify = vfio_iommu_map_notify;
> >>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>index eb0e1b0..5341e05 100644
> >>--- a/include/hw/vfio/vfio-common.h
> >>+++ b/include/hw/vfio/vfio-common.h
> >>@@ -90,6 +90,7 @@ typedef struct VFIOContainer {
> >>  typedef struct VFIOGuestIOMMU {
> >>      VFIOContainer *container;
> >>      MemoryRegion *iommu;
> >>+    hwaddr offset_within_address_space;
> >
> >...it might be simpler to replace both the iommu and
> >offset_within_address_space fields here with a pointer to the
> >MemoryRegionSection instead, which should give you all the info you
> >need.
> 
> 
> MemoryRegionSection is allocated on stack in listener_add_address_space()
> and seems to be in general some sort of temporary object.

Ah, right, I guess you'll have to store the delta, then.

> 
> 
> >
> >It might also be worth adding Paolo to the CC for this patch, since he
> >knows the MemoryRegion stuff better than anyone.
> 
> 
> Right, I added him in cc: now.
> 
> >
> >>      Notifier n;
> >>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> >>  } VFIOGuestIOMMU;
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper
  2016-03-22  3:17     ` Alexey Kardashevskiy
@ 2016-03-22  3:28       ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  3:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4971 bytes --]

On Tue, Mar 22, 2016 at 02:17:24PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 12:02 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:46:51PM +1100, Alexey Kardashevskiy wrote:
> >>We are going to have multiple DMA windows soon so let's start preparing.
> >>
> >>This adds a new helper to create a DMA window and makes use of it in
> >>sPAPRPHBState::realize().
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> >With one tweak..
> >
> >>---
> >>Changes:
> >>v14:
> >>* replaced "int" return to Error* in spapr_phb_dma_window_enable()
> >>---
> >>  hw/ppc/spapr_pci.c | 47 ++++++++++++++++++++++++++++++++++-------------
> >>  1 file changed, 34 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index 79baa7b..18332bf 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -803,6 +803,33 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
> >>      return buf;
> >>  }
> >>
> >>+static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>+                                       uint32_t liobn,
> >>+                                       uint32_t page_shift,
> >>+                                       uint64_t window_addr,
> >>+                                       uint64_t window_size,
> >>+                                       Error **errp)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+    uint32_t nb_table = window_size >> page_shift;
> >>+
> >>+    if (!nb_table) {
> >>+        error_setg(errp, "Zero size table");
> >>+        return;
> >>+    }
> >>+
> >>+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> >>+                               page_shift, nb_table, false);
> >>+    if (!tcet) {
> >>+        error_setg(errp, "Unable to create TCE table liobn %x for %s",
> >>+                   liobn, sphb->dtbusname);
> >>+        return;
> >>+    }
> >>+
> >>+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> >>+                                spapr_tce_get_iommu(tcet));
> >>+}
> >>+
> >>  /* Macros to operate with address in OF binding to PCI */
> >>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
> >>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> >>@@ -1307,8 +1334,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>      int i;
> >>      PCIBus *bus;
> >>      uint64_t msi_window_size = 4096;
> >>-    sPAPRTCETable *tcet;
> >>-    uint32_t nb_table;
> >>+    Error *local_err = NULL;
> >>
> >>      if (sphb->index != (uint32_t)-1) {
> >>          hwaddr windows_base;
> >>@@ -1460,18 +1486,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>          }
> >>      }
> >>
> >>-    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
> >>-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> >>-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> >>-    if (!tcet) {
> >>-        error_setg(errp, "Unable to create TCE table for %s",
> >>-                   sphb->dtbusname);
> >>-        return;
> >>-    }
> >>-
> >>      /* Register default 32bit DMA window */
> >>-    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
> >>-                                spapr_tce_get_iommu(tcet));
> >>+    spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
> >>+                                sphb->dma_win_addr, sphb->dma_win_size,
> >>+                                &local_err);
> >>+    if (local_err) {
> >>+        error_propagate(errp, local_err);
> >
> >Should be a return; here so we don't continue if there's an error.
> >
> >Actually.. that's not really right, we should be cleaning up all setup
> >we've done already on the failure path.  Without that I think we'll
> >leak some objects on a failed device_add.
> >
> >But.. there are already a bunch of cases here that will do that, so we
> >can clean that up separately.  Probably the sanest way would be to add
> >an unrealize function() that can handle a partially realized object
> >and make sure it's called on all the error paths.
> 
> 
> So what do I do right now with this patch? Leave it as is, add "return",
> implement unrealize(), ...? In practice, being unable to create a PHB is a
> fatal error today (as we do not have PHB hotplug yet and this is what
> unrealize() is for).

Add the return for now, since the series will need a respin anyway.
If you have time it'd be great if you could do an unrealize() patch
that cleans up the existing failure paths, but that would be separate
from this series.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2016-03-22  4:04   ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22  4:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 13803 bytes --]

On Mon, Mar 21, 2016 at 06:47:01PM +1100, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> As there is no per-IOMMU-type release() callback anymore, this stores
> the IOMMU type in the container so vfio_listener_release() can device
> if it needs to unregister @prereg_listener.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This does not change the guest visible interface.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  38 +++++++++---
>  hw/vfio/prereg.c              | 137 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   2 +
>  5 files changed, 172 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/prereg.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..5800e0e 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += prereg.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9587c25..a8deb16 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -493,6 +493,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
>  }
>  
>  int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
> @@ -800,8 +803,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -826,8 +829,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -835,7 +840,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -847,11 +854,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto listener_release_exit;
> +            }
>          }
>  
>          /*
> @@ -864,7 +882,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if (ret) {
>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>              ret = -errno;
> -            goto free_container_exit;
> +            goto listener_release_exit;
>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c
> new file mode 100644
> index 0000000..36c9ff5
> --- /dev/null
> +++ b/hw/vfio/prereg.c
> @@ -0,0 +1,137 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    if (memory_region_is_iommu(section->mr)) {
> +        error_report("Cannot possibly preregister IOMMU memory");
> +        return true;
> +    }
> +
> +    return !memory_region_is_ram(section->mr) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    hwaddr gpa;
> +    Int128 llend;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    gpa = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);

If offset_within_address_space is not aligned to page_mask, you could
fail to preregister a little bit of the section.  Sounds like you
either want to prohibit that entirely (change the test above) or round
down to preregister a bit more than just the section.

> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(page_mask));

Likewise here you round the end of the section down, which could leave
a bit un-preregistered.  Sounds like you want to ban that, or round up.

> +    g_assert(!int128_ge(int128_make64(gpa), llend));
> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +    reg.size = int128_get64(llend) - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: Memory registering failed, unable to continue");
> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    hwaddr gpa, end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    gpa = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
> +    end = (section->offset_within_address_space + int128_get64(section->size)) &
> +        page_mask;
> +
> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +    reg.size = end - gpa;

Might be useful to have a helper function that does the address
calculations, so you can use the same one for region_add and
region_del.

> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5341e05..b861eec 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
> +    MemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -156,4 +158,6 @@ extern QLIST_HEAD(vfio_as_head, VFIOAddressSpace) vfio_address_spaces;
>  int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                           struct vfio_region_info **info);
>  #endif
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index 6a94736..cc619e1 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1734,6 +1734,8 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities Alexey Kardashevskiy
@ 2016-03-22  4:20   ` David Gibson
  2016-03-22  6:47     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-22  4:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7009 bytes --]

On Mon, Mar 21, 2016 at 06:47:03PM +1100, Alexey Kardashevskiy wrote:
> There are going to be multiple IOMMUs per a container. This moves
> the single host IOMMU parameter set to a list of VFIOHostIOMMU.
> 
> This should cause no behavioral change and will be used later by
> the SPAPR TCE IOMMU v2 which will also add a vfio_host_iommu_del() helper.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

This looks ok except for the name.  Calling each window a separate
"host IOMMU" is misleading.  The different windows the container
supports might be implemented by different IOMMUs on the host side, or
it might be implemented by one IOMMU with multiple tables.

Better to call them host DMA windows, or maybe container DMA windows.

> ---
>  hw/vfio/common.c              | 65 +++++++++++++++++++++++++++++++++----------
>  include/hw/vfio/vfio-common.h |  9 ++++--
>  2 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a8deb16..b257655 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "exec/memory.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
> +#include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  
> @@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> +static VFIOHostIOMMU *vfio_host_iommu_lookup(VFIOContainer *container,
> +                                             hwaddr min_iova, hwaddr max_iova)
> +{
> +    VFIOHostIOMMU *hiommu;
> +
> +    QLIST_FOREACH(hiommu, &container->hiommu_list, hiommu_next) {
> +        if (hiommu->min_iova <= min_iova && max_iova <= hiommu->max_iova) {
> +            return hiommu;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int vfio_host_iommu_add(VFIOContainer *container,
> +                               hwaddr min_iova, hwaddr max_iova,
> +                               uint64_t iova_pgsizes)
> +{
> +    VFIOHostIOMMU *hiommu;
> +
> +    QLIST_FOREACH(hiommu, &container->hiommu_list, hiommu_next) {
> +        if (ranges_overlap(min_iova, max_iova - min_iova + 1,
> +                           hiommu->min_iova,
> +                           hiommu->max_iova - hiommu->min_iova + 1)) {
> +            error_report("%s: Overlapped IOMMU are not enabled", __func__);
> +            return -1;
> +        }
> +    }
> +
> +    hiommu = g_malloc0(sizeof(*hiommu));
> +
> +    hiommu->min_iova = min_iova;
> +    hiommu->max_iova = max_iova;
> +    hiommu->iova_pgsizes = iova_pgsizes;
> +    QLIST_INSERT_HEAD(&container->hiommu_list, hiommu, hiommu_next);
> +
> +    return 0;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -352,7 +392,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(llend);
>  
> -    if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
> +    if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>                       container, iova, end - 1);
> @@ -367,10 +407,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          trace_vfio_listener_region_add_iommu(iova, end - 1);
>          /*
> -         * FIXME: We should do some checking to see if the
> -         * capabilities of the host VFIO IOMMU are adequate to model
> -         * the guest IOMMU
> -         *
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
>           * would be the right place to wire that up (tell the KVM
> @@ -818,16 +854,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        container->min_iova = 0;
> -        container->max_iova = (hwaddr)-1;
> -
> -        /* Assume just 4K IOVA page size */
> -        container->iova_pgsizes = 0x1000;
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>          /* Ignore errors */
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> -            container->iova_pgsizes = info.iova_pgsizes;
> +            vfio_host_iommu_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> +        } else {
> +            /* Assume just 4K IOVA page size */
> +            vfio_host_iommu_add(container, 0, (hwaddr)-1, 0x1000);
>          }
>      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> @@ -884,11 +918,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto listener_release_exit;
>          }
> -        container->min_iova = info.dma32_window_start;
> -        container->max_iova = container->min_iova + info.dma32_window_size - 1;
>  
> -        /* Assume just 4K IOVA pages for now */
> -        container->iova_pgsizes = 0x1000;
> +        /* The default table uses 4K pages */
> +        vfio_host_iommu_add(container, info.dma32_window_start,
> +                            info.dma32_window_start +
> +                            info.dma32_window_size - 1,
> +                            0x1000);
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index b861eec..1b98e33 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -82,9 +82,8 @@ typedef struct VFIOContainer {
>       * contiguous IOVA window.  We may need to generalize that in
>       * future
>       */
> -    hwaddr min_iova, max_iova;
> -    uint64_t iova_pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostIOMMU) hiommu_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
> @@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;
>  
> +typedef struct VFIOHostIOMMU {
> +    hwaddr min_iova, max_iova;
> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostIOMMU) hiommu_next;
> +} VFIOHostIOMMU;
> +
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  
>  typedef struct VFIODevice {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-22  3:26       ` David Gibson
@ 2016-03-22  4:28         ` Alexey Kardashevskiy
  2016-03-22  4:59           ` David Gibson
  2016-03-23 10:58         ` Paolo Bonzini
  1 sibling, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  4:28 UTC (permalink / raw)
  To: David Gibson; +Cc: Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel

On 03/22/2016 02:26 PM, David Gibson wrote:
> On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
>> On 03/22/2016 11:49 AM, David Gibson wrote:
>>> On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
>>>> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
>>>> when new VFIO listener is added, all existing IOMMU mappings are
>>>> replayed. However there is a problem that the base address of
>>>> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
>>>> for the existing user (which is pseries) with its default 32bit DMA
>>>> window starting at 0 but it is if there is another DMA window.
>>>>
>>>> This stores the IOMMU's offset_within_address_space and adjusts
>>>> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
>>>>
>>>> As the IOMMU notifier expects IOVA offset rather than the absolute
>>>> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
>>>> calling notifier(s).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>
>>> On a closer look, I realised this still isn't quite correct, although
>>> I don't think any cases which would break it exist or are planned.
>>>
>>>> ---
>>>>   hw/ppc/spapr_iommu.c          |  2 +-
>>>>   hw/vfio/common.c              | 14 ++++++++------
>>>>   include/hw/vfio/vfio-common.h |  1 +
>>>>   3 files changed, 10 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>> index 7dd4588..277f289 100644
>>>> --- a/hw/ppc/spapr_iommu.c
>>>> +++ b/hw/ppc/spapr_iommu.c
>>>> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
>>>>       tcet->table[index] = tce;
>>>>
>>>>       entry.target_as = &address_space_memory,
>>>> -    entry.iova = ioba & page_mask;
>>>> +    entry.iova = (ioba - tcet->bus_offset) & page_mask;
>>>>       entry.translated_addr = tce & page_mask;
>>>>       entry.addr_mask = ~page_mask;
>>>>       entry.perm = spapr_tce_iommu_access_flags(tce);
>>>
>>> This bit's right/
>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index fb588d8..d45e2db 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>>>>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>       VFIOContainer *container = giommu->container;
>>>>       IOMMUTLBEntry *iotlb = data;
>>>> +    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
>>>
>>> This bit might be right, depending on how you define giommu->offset_within_address_space.
>>>
>>>>       MemoryRegion *mr;
>>>>       hwaddr xlat;
>>>>       hwaddr len = iotlb->addr_mask + 1;
>>>>       void *vaddr;
>>>>       int ret;
>>>>
>>>> -    trace_vfio_iommu_map_notify(iotlb->iova,
>>>> -                                iotlb->iova + iotlb->addr_mask);
>>>> +    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>>>>
>>>>       /*
>>>>        * The IOMMU TLB entry we have just covers translation through
>>>> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>>>>
>>>>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>>>>           vaddr = memory_region_get_ram_ptr(mr) + xlat;
>>>> -        ret = vfio_dma_map(container, iotlb->iova,
>>>> +        ret = vfio_dma_map(container, iova,
>>>>                              iotlb->addr_mask + 1, vaddr,
>>>>                              !(iotlb->perm & IOMMU_WO) || mr->readonly);
>>>>           if (ret) {
>>>>               error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>>                            "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>>> -                         container, iotlb->iova,
>>>> +                         container, iova,
>>>>                            iotlb->addr_mask + 1, vaddr, ret);
>>>>           }
>>>>       } else {
>>>> -        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>>>> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>>>>           if (ret) {
>>>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>>>                            "0x%"HWADDR_PRIx") = %d (%m)",
>>>> -                         container, iotlb->iova,
>>>> +                         container, iova,
>>>>                            iotlb->addr_mask + 1, ret);
>>>>           }
>>>>       }
>>>
>>> This is fine.
>>>
>>>> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>            */
>>>>           giommu = g_malloc0(sizeof(*giommu));
>>>>           giommu->iommu = section->mr;
>>>> +        giommu->offset_within_address_space =
>>>> +            section->offset_within_address_space;
>>>
>>> But here there's a problem.  The iova in IOMMUTLBEntry is relative to
>>> the IOMMU MemoryRegion, but - at least in theory - only a subsection
>>> of that MemoryRegion could be mapped into the AddressSpace.
>>
>> But the IOMMU MR stays the same - size, offset, and iova will be relative to
>> its start, why does it matter if only portion is mapped?
>
> Because the portion mapped may not sit at the start of the MR.  For
> example if you had a 2G MR, and the second half is mapped at address 0
> in the AS,

My imagination fails here. How could you do this in practice?

address_space_init(&as, &root)
memory_region_init(&mr, 2GB)
memory_region_add_subregion(&root, -1GB, &mr)

But offsets are unsigned.

In general, how to map only a half, what memory_region_add_xxx() does that?


> then an IOMMUTLBEntry iova of 1G would translated to AS
> address 0.
 >
>
>>> So, to find the IOVA within the AddressSpace, from the IOVA within the
>>> MemoryRegion, you need to first subtract the section's offset within
>>> the MemoryRegion, then add the section's offset within the
>>> AddressSpace.
>>>
>>> You could precalculate the combined delta here, but...
>>>
>>>
>>>>           giommu->container = container;
>>>>           giommu->n.notify = vfio_iommu_map_notify;
>>>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index eb0e1b0..5341e05 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -90,6 +90,7 @@ typedef struct VFIOContainer {
>>>>   typedef struct VFIOGuestIOMMU {
>>>>       VFIOContainer *container;
>>>>       MemoryRegion *iommu;
>>>> +    hwaddr offset_within_address_space;
>>>
>>> ...it might be simpler to replace both the iommu and
>>> offset_within_address_space fields here with a pointer to the
>>> MemoryRegionSection instead, which should give you all the info you
>>> need.
>>
>>
>> MemoryRegionSection is allocated on stack in listener_add_address_space()
>> and seems to be in general some sort of temporary object.
>
> Ah, right, I guess you'll have to store the delta, then.
>
>>
>>
>>>
>>> It might also be worth adding Paolo to the CC for this patch, since he
>>> knows the MemoryRegion stuff better than anyone.
>>
>>
>> Right, I added him in cc: now.
>>
>>>
>>>>       Notifier n;
>>>>       QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>>>>   } VFIOGuestIOMMU;
>>>
>>
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
@ 2016-03-22  4:45   ` David Gibson
  2016-03-22  6:24     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-22  4:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6454 bytes --]

On Mon, Mar 21, 2016 at 06:47:04PM +1100, Alexey Kardashevskiy wrote:
> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> a guest view of the table and a hardware TCE table. If there is no VFIO
> presense in the address space, then just the guest view is used, if
> this is the case, it is allocated in the KVM. However since there is no
> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> we need to move the guest view from KVM to the userspace; and we need
> to do this for every IOMMU on a bus with VFIO devices.
> 
> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> notifiy IOMMU about changing environment so it can reallocate the table
> to/from KVM or (when available) hook the IOMMU groups with the logical
> bus (LIOBN) in the KVM.
> 
> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> path as the new callbacks do this better - they notify IOMMU at
> the exact moment when the configuration is changed, and this also
> includes the case of PCI hot unplug.
> 
> TODO: split into 2 or 3 patches, per maintainership area.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I'm finding this one much easier to follow than the previous revision.

> ---
>  hw/ppc/spapr_iommu.c  | 12 ++++++++++++
>  hw/ppc/spapr_pci.c    |  6 ------
>  hw/vfio/common.c      |  9 +++++++++
>  include/exec/memory.h |  4 ++++
>  4 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 6dc3c45..702075d 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -151,6 +151,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> +}
> +
> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> +}

Wonder if a single callback which takes a boolean might be a little
less clunky.

>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -211,6 +221,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
>      .get_page_sizes = spapr_tce_get_page_sizes,
> +    .vfio_start = spapr_tce_vfio_start,
> +    .vfio_stop = spapr_tce_vfio_stop,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index bfcafdf..af99a36 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1121,12 +1121,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      void *fdt = NULL;
>      int fdt_start_offset = 0, fdt_size;
>  
> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> -
> -        spapr_tce_set_need_vfio(tcet, true);
> -    }
> -
>      if (dev->hotplugged) {
>          fdt = create_device_tree(&fdt_size);
>          fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index b257655..4e873b7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -421,6 +421,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> +        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
> +            section->mr->iommu_ops->vfio_start(section->mr);
> +        }
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
>                                     false);
>  
> @@ -466,6 +469,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>      hwaddr iova, end;
>      int ret;
> +    MemoryRegion *iommu = NULL;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -487,6 +491,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (giommu->iommu == section->mr) {
>                  memory_region_unregister_iommu_notifier(&giommu->n);
> +                iommu = giommu->iommu;
>                  QLIST_REMOVE(giommu, giommu_next);
>                  g_free(giommu);
>                  break;
> @@ -519,6 +524,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       "0x%"HWADDR_PRIx") = %d (%m)",
>                       container, iova, end - iova, ret);
>      }
> +
> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> +        iommu->iommu_ops->vfio_stop(section->mr);
> +    }

IIRC there can be multiple containers listening on the same PCI
address space.  In that case, this won't be correct, because once one
of the VFIO containers is removed, it will call vfio_stop, even though
the other VFIO container still needs the guest IOMMU to support it.

So I think you need some sort of refcounting here.

>  }
>  
>  static const MemoryListener vfio_memory_listener = {
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index eb5ce67..f1de133f 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -152,6 +152,10 @@ struct MemoryRegionIOMMUOps {
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>      /* Returns supported page sizes */
>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> +    /* Called when VFIO starts using this */
> +    void (*vfio_start)(MemoryRegion *iommu);
> +    /* Called when VFIO stops using this */
> +    void (*vfio_stop)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-22  4:28         ` Alexey Kardashevskiy
@ 2016-03-22  4:59           ` David Gibson
  2016-03-22  7:19             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-22  4:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6469 bytes --]

On Tue, Mar 22, 2016 at 03:28:52PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 02:26 PM, David Gibson wrote:
> >On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/22/2016 11:49 AM, David Gibson wrote:
> >>>On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> >>>>Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> >>>>when new VFIO listener is added, all existing IOMMU mappings are
> >>>>replayed. However there is a problem that the base address of
> >>>>an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> >>>>for the existing user (which is pseries) with its default 32bit DMA
> >>>>window starting at 0 but it is if there is another DMA window.
> >>>>
> >>>>This stores the IOMMU's offset_within_address_space and adjusts
> >>>>the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> >>>>
> >>>>As the IOMMU notifier expects IOVA offset rather than the absolute
> >>>>address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> >>>>calling notifier(s).
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>>
> >>>On a closer look, I realised this still isn't quite correct, although
> >>>I don't think any cases which would break it exist or are planned.
> >>>
> >>>>---
> >>>>  hw/ppc/spapr_iommu.c          |  2 +-
> >>>>  hw/vfio/common.c              | 14 ++++++++------
> >>>>  include/hw/vfio/vfio-common.h |  1 +
> >>>>  3 files changed, 10 insertions(+), 7 deletions(-)
> >>>>
> >>>>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>>>index 7dd4588..277f289 100644
> >>>>--- a/hw/ppc/spapr_iommu.c
> >>>>+++ b/hw/ppc/spapr_iommu.c
> >>>>@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
> >>>>      tcet->table[index] = tce;
> >>>>
> >>>>      entry.target_as = &address_space_memory,
> >>>>-    entry.iova = ioba & page_mask;
> >>>>+    entry.iova = (ioba - tcet->bus_offset) & page_mask;
> >>>>      entry.translated_addr = tce & page_mask;
> >>>>      entry.addr_mask = ~page_mask;
> >>>>      entry.perm = spapr_tce_iommu_access_flags(tce);
> >>>
> >>>This bit's right/
> >>>
> >>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>index fb588d8..d45e2db 100644
> >>>>--- a/hw/vfio/common.c
> >>>>+++ b/hw/vfio/common.c
> >>>>@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>>>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>>>      VFIOContainer *container = giommu->container;
> >>>>      IOMMUTLBEntry *iotlb = data;
> >>>>+    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
> >>>
> >>>This bit might be right, depending on how you define giommu->offset_within_address_space.
> >>>
> >>>>      MemoryRegion *mr;
> >>>>      hwaddr xlat;
> >>>>      hwaddr len = iotlb->addr_mask + 1;
> >>>>      void *vaddr;
> >>>>      int ret;
> >>>>
> >>>>-    trace_vfio_iommu_map_notify(iotlb->iova,
> >>>>-                                iotlb->iova + iotlb->addr_mask);
> >>>>+    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> >>>>
> >>>>      /*
> >>>>       * The IOMMU TLB entry we have just covers translation through
> >>>>@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>>>
> >>>>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> >>>>          vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>>>-        ret = vfio_dma_map(container, iotlb->iova,
> >>>>+        ret = vfio_dma_map(container, iova,
> >>>>                             iotlb->addr_mask + 1, vaddr,
> >>>>                             !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >>>>          if (ret) {
> >>>>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>>>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>>>-                         container, iotlb->iova,
> >>>>+                         container, iova,
> >>>>                           iotlb->addr_mask + 1, vaddr, ret);
> >>>>          }
> >>>>      } else {
> >>>>-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> >>>>+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> >>>>          if (ret) {
> >>>>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>>>                           "0x%"HWADDR_PRIx") = %d (%m)",
> >>>>-                         container, iotlb->iova,
> >>>>+                         container, iova,
> >>>>                           iotlb->addr_mask + 1, ret);
> >>>>          }
> >>>>      }
> >>>
> >>>This is fine.
> >>>
> >>>>@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>>>           */
> >>>>          giommu = g_malloc0(sizeof(*giommu));
> >>>>          giommu->iommu = section->mr;
> >>>>+        giommu->offset_within_address_space =
> >>>>+            section->offset_within_address_space;
> >>>
> >>>But here there's a problem.  The iova in IOMMUTLBEntry is relative to
> >>>the IOMMU MemoryRegion, but - at least in theory - only a subsection
> >>>of that MemoryRegion could be mapped into the AddressSpace.
> >>
> >>But the IOMMU MR stays the same - size, offset, and iova will be relative to
> >>its start, why does it matter if only portion is mapped?
> >
> >Because the portion mapped may not sit at the start of the MR.  For
> >example if you had a 2G MR, and the second half is mapped at address 0
> >in the AS,
> 
> My imagination fails here. How could you do this in practice?
> 
> address_space_init(&as, &root)
> memory_region_init(&mr, 2GB)
> memory_region_add_subregion(&root, -1GB, &mr)
> 
> But offsets are unsigned.
> 
> In general, how to map only a half, what memory_region_add_xxx()
> does that?

I'm not totally sure, but I think you can do it with:

address_space_init(&as, &root)
memory_region_init(&mr0, 2GB)
memory_region_init_alias(&mr1, &mr0, 1GB, 1GB)
memory_region_add_subregion(&root, 0, &mr1)

But the point is that it's possible for offset_within_region to be
non-zero, and if it is, you need to take it into account.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
@ 2016-03-22  5:14   ` David Gibson
  2016-03-22  5:54     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-22  5:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 9115 bytes --]

On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses euristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v14:
> * new to the series
> 
> ---
> TODO:
> * export levels to PHB
> ---
>  hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  trace-events     |   2 ++
>  2 files changed, 105 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4e873b7..421d6eb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
>      return 0;
>  }
>  
> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
> +{
> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);

The hard-coded 0x1000 looks dubious..

> +    g_assert(hiommu);
> +    QLIST_REMOVE(hiommu, hiommu_next);
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(llend);
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {

I think this would be clearer split out into a helper function,
vfio_create_host_window() or something.

> +        unsigned entries, pages;
> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> +
> +        g_assert(section->mr->iommu_ops);
> +        g_assert(memory_region_is_iommu(section->mr));

I don't think you need these asserts.  AFAICT the same logic should
work if a RAM MR was added directly to PCI address space - this would
create the new host window, then the existing code for adding a RAM MR
would map that block of RAM statically into the new window.

> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
> +        /*
> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> +         * avoid bouncing all map/unmaps through qemu this way, this
> +         * would be the right place to wire that up (tell the KVM
> +         * device emulation the VFIO iommu handles to use).
> +         */
> +        create.window_size = memory_region_size(section->mr);
> +        create.page_shift =
> +                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));

Ah.. except that I guess you'd need to fall back to host page size
here to handle a RAM MR.

> +        /*
> +         * SPAPR host supports multilevel TCE tables, there is some
> +         * euristic to decide how many levels we want for our table:
> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +         */
> +        entries = create.window_size >> create.page_shift;
> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> +
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (ret) {
> +            error_report("Failed to create a window, ret = %d (%m)", ret);
> +            goto fail;
> +        }
> +
> +        if (create.start_addr != section->offset_within_address_space ||
> +            vfio_host_iommu_lookup(container, create.start_addr,
> +                                   create.start_addr + create.window_size - 1)) {

Under what circumstances can this trigger?  Is the kernel ioctl
allowed to return a different window start address than the one
requested?

The second check looks very strange - if it returns true doesn't that
mean you *do* have host window which can accomodate this guest region,
which is what you want?

> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = create.start_addr
> +            };
> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> +                         section->offset_within_address_space,
> +                         create.start_addr);
> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +        trace_vfio_spapr_create_window(create.page_shift,
> +                                       create.window_size,
> +                                       create.start_addr);
> +
> +        vfio_host_iommu_add(container, create.start_addr,
> +                            create.start_addr + create.window_size - 1,
> +                            1ULL << create.page_shift);
> +    }
> +
>      if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> @@ -525,6 +588,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       container, iova, end - iova, ret);
>      }
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        struct vfio_iommu_spapr_tce_remove remove = {
> +            .argsz = sizeof(remove),
> +            .start_addr = section->offset_within_address_space,
> +        };
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        if (ret) {
> +            error_report("Failed to remove window at %"PRIx64,
> +                         remove.start_addr);
> +        }
> +
> +        vfio_host_iommu_del(container, section->offset_within_address_space);
> +
> +        trace_vfio_spapr_remove_window(remove.start_addr);
> +    }
> +
>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>          iommu->iommu_ops->vfio_stop(section->mr);
>      }
> @@ -928,11 +1007,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto listener_release_exit;
>          }
>  
> -        /* The default table uses 4K pages */
> -        vfio_host_iommu_add(container, info.dma32_window_start,
> -                            info.dma32_window_start +
> -                            info.dma32_window_size - 1,
> -                            0x1000);
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = info.dma32_window_start,
> +            };
> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            if (ret) {
> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            vfio_host_iommu_add(container, info.dma32_window_start,
> +                                info.dma32_window_start +
> +                                info.dma32_window_size - 1,
> +                                0x1000);
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/trace-events b/trace-events
> index cc619e1..f2b75a3 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1736,6 +1736,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-22  5:14   ` David Gibson
@ 2016-03-22  5:54     ` Alexey Kardashevskiy
  2016-03-23  1:08       ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  5:54 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/22/2016 04:14 PM, David Gibson wrote:
> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>> This adds ability to VFIO common code to dynamically allocate/remove
>> DMA windows in the host kernel when new VFIO container is added/removed.
>>
>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>> and adds just created IOMMU into the host IOMMU list; the opposite
>> action is taken in vfio_listener_region_del.
>>
>> When creating a new window, this uses euristic to decide on the TCE table
>> levels number.
>>
>> This should cause no guest visible change in behavior.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v14:
>> * new to the series
>>
>> ---
>> TODO:
>> * export levels to PHB
>> ---
>>   hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   trace-events     |   2 ++
>>   2 files changed, 105 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 4e873b7..421d6eb 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
>>       return 0;
>>   }
>>
>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
>> +{
>> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
>
> The hard-coded 0x1000 looks dubious..

Well, that's the minimal page size...


>
>> +    g_assert(hiommu);
>> +    QLIST_REMOVE(hiommu, hiommu_next);
>> +}
>> +
>>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>   {
>>       return (!memory_region_is_ram(section->mr) &&
>> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>       }
>>       end = int128_get64(llend);
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>
> I think this would be clearer split out into a helper function,
> vfio_create_host_window() or something.


It is rather vfio_spapr_create_host_window() and we were avoiding 
xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a 
separate file but this usually triggers more discussion and never ends well.



>> +        unsigned entries, pages;
>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>> +
>> +        g_assert(section->mr->iommu_ops);
>> +        g_assert(memory_region_is_iommu(section->mr));
>
> I don't think you need these asserts.  AFAICT the same logic should
> work if a RAM MR was added directly to PCI address space - this would
> create the new host window, then the existing code for adding a RAM MR
> would map that block of RAM statically into the new window.

In what configuration/machine can we do that on SPAPR?


>> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
>> +        /*
>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>> +         * avoid bouncing all map/unmaps through qemu this way, this
>> +         * would be the right place to wire that up (tell the KVM
>> +         * device emulation the VFIO iommu handles to use).
>> +         */
>> +        create.window_size = memory_region_size(section->mr);
>> +        create.page_shift =
>> +                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
>
> Ah.. except that I guess you'd need to fall back to host page size
> here to handle a RAM MR.

Can you give an example of such RAM MR being added to PCI AS on SPAPR?


>> +        /*
>> +         * SPAPR host supports multilevel TCE tables, there is some
>> +         * euristic to decide how many levels we want for our table:
>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>> +         */
>> +        entries = create.window_size >> create.page_shift;
>> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
>> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>> +
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +        if (ret) {
>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>> +            goto fail;
>> +        }
>> +
>> +        if (create.start_addr != section->offset_within_address_space ||
>> +            vfio_host_iommu_lookup(container, create.start_addr,
>> +                                   create.start_addr + create.window_size - 1)) {
>
> Under what circumstances can this trigger?  Is the kernel ioctl
> allowed to return a different window start address than the one
> requested?

You already asked this some time ago :) The userspace cannot request 
address, the host kernel returns one.


> The second check looks very strange - if it returns true doesn't that
> mean you *do* have host window which can accomodate this guest region,
> which is what you want?

This should not happen, this is what this check is for. Can make it 
assert() or something like this.


>
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = create.start_addr
>> +            };
>> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
>> +                         section->offset_within_address_space,
>> +                         create.start_addr);
>> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            ret = -EINVAL;
>> +            goto fail;
>> +        }
>> +        trace_vfio_spapr_create_window(create.page_shift,
>> +                                       create.window_size,
>> +                                       create.start_addr);
>> +
>> +        vfio_host_iommu_add(container, create.start_addr,
>> +                            create.start_addr + create.window_size - 1,
>> +                            1ULL << create.page_shift);
>> +    }
>> +
>>       if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
>>           error_report("vfio: IOMMU container %p can't map guest IOVA region"
>>                        " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>> @@ -525,6 +588,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                        container, iova, end - iova, ret);
>>       }
>>
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        struct vfio_iommu_spapr_tce_remove remove = {
>> +            .argsz = sizeof(remove),
>> +            .start_addr = section->offset_within_address_space,
>> +        };
>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +        if (ret) {
>> +            error_report("Failed to remove window at %"PRIx64,
>> +                         remove.start_addr);
>> +        }
>> +
>> +        vfio_host_iommu_del(container, section->offset_within_address_space);
>> +
>> +        trace_vfio_spapr_remove_window(remove.start_addr);
>> +    }
>> +
>>       if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>>           iommu->iommu_ops->vfio_stop(section->mr);
>>       }
>> @@ -928,11 +1007,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               goto listener_release_exit;
>>           }
>>
>> -        /* The default table uses 4K pages */
>> -        vfio_host_iommu_add(container, info.dma32_window_start,
>> -                            info.dma32_window_start +
>> -                            info.dma32_window_size - 1,
>> -                            0x1000);
>> +        if (v2) {
>> +            /*
>> +             * There is a default window in just created container.
>> +             * To make region_add/del simpler, we better remove this
>> +             * window now and let those iommu_listener callbacks
>> +             * create/remove them when needed.
>> +             */
>> +            struct vfio_iommu_spapr_tce_remove remove = {
>> +                .argsz = sizeof(remove),
>> +                .start_addr = info.dma32_window_start,
>> +            };
>> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +            if (ret) {
>> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            /* The default table uses 4K pages */
>> +            vfio_host_iommu_add(container, info.dma32_window_start,
>> +                                info.dma32_window_start +
>> +                                info.dma32_window_size - 1,
>> +                                0x1000);
>> +        }
>>       } else {
>>           error_report("vfio: No available IOMMU models");
>>           ret = -EINVAL;
>> diff --git a/trace-events b/trace-events
>> index cc619e1..f2b75a3 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1736,6 +1736,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>   vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>>   vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>   vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
>> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>>
>>   # hw/vfio/platform.c
>>   vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-22  4:45   ` David Gibson
@ 2016-03-22  6:24     ` Alexey Kardashevskiy
  2016-03-22 10:22       ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  6:24 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/22/2016 03:45 PM, David Gibson wrote:
> On Mon, Mar 21, 2016 at 06:47:04PM +1100, Alexey Kardashevskiy wrote:
>> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
>> a guest view of the table and a hardware TCE table. If there is no VFIO
>> presense in the address space, then just the guest view is used, if
>> this is the case, it is allocated in the KVM. However since there is no
>> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
>> we need to move the guest view from KVM to the userspace; and we need
>> to do this for every IOMMU on a bus with VFIO devices.
>>
>> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
>> notifiy IOMMU about changing environment so it can reallocate the table
>> to/from KVM or (when available) hook the IOMMU groups with the logical
>> bus (LIOBN) in the KVM.
>>
>> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
>> path as the new callbacks do this better - they notify IOMMU at
>> the exact moment when the configuration is changed, and this also
>> includes the case of PCI hot unplug.
>>
>> TODO: split into 2 or 3 patches, per maintainership area.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> I'm finding this one much easier to follow than the previous revision.
>
>> ---
>>   hw/ppc/spapr_iommu.c  | 12 ++++++++++++
>>   hw/ppc/spapr_pci.c    |  6 ------
>>   hw/vfio/common.c      |  9 +++++++++
>>   include/exec/memory.h |  4 ++++
>>   4 files changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 6dc3c45..702075d 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -151,6 +151,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>       return 1ULL << tcet->page_shift;
>>   }
>>
>> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
>> +{
>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
>> +}
>> +
>> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
>> +{
>> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
>> +}
>
> Wonder if a single callback which takes a boolean might be a little
> less clunky.

I have a feeling that at least once I was asked to do the opposite and now 
we have take_ownership/release_ownership. This does not seem to be much 
different and the existing names are more self-documenting than the 
previous vfio_notify() or whatever name I could think of.


>>   static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>>   static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>>
>> @@ -211,6 +221,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>>   static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>       .translate = spapr_tce_translate_iommu,
>>       .get_page_sizes = spapr_tce_get_page_sizes,
>> +    .vfio_start = spapr_tce_vfio_start,
>> +    .vfio_stop = spapr_tce_vfio_stop,
>>   };
>>
>>   static int spapr_tce_table_realize(DeviceState *dev)
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index bfcafdf..af99a36 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -1121,12 +1121,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>       void *fdt = NULL;
>>       int fdt_start_offset = 0, fdt_size;
>>
>> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> -
>> -        spapr_tce_set_need_vfio(tcet, true);
>> -    }
>> -
>>       if (dev->hotplugged) {
>>           fdt = create_device_tree(&fdt_size);
>>           fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index b257655..4e873b7 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -421,6 +421,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>
>>           memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> +        if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
>> +            section->mr->iommu_ops->vfio_start(section->mr);
>> +        }
>>           memory_region_iommu_replay(giommu->iommu, &giommu->n,
>>                                      false);
>>
>> @@ -466,6 +469,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>       hwaddr iova, end;
>>       int ret;
>> +    MemoryRegion *iommu = NULL;
>>
>>       if (vfio_listener_skipped_section(section)) {
>>           trace_vfio_listener_region_del_skip(
>> @@ -487,6 +491,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>               if (giommu->iommu == section->mr) {
>>                   memory_region_unregister_iommu_notifier(&giommu->n);
>> +                iommu = giommu->iommu;
>>                   QLIST_REMOVE(giommu, giommu_next);
>>                   g_free(giommu);
>>                   break;
>> @@ -519,6 +524,10 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                        "0x%"HWADDR_PRIx") = %d (%m)",
>>                        container, iova, end - iova, ret);
>>       }
>> +
>> +    if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>> +        iommu->iommu_ops->vfio_stop(section->mr);
>> +    }
>
> IIRC there can be multiple containers listening on the same PCI
> address space.  In that case, this won't be correct, because once one
> of the VFIO containers is removed, it will call vfio_stop, even though
> the other VFIO container still needs the guest IOMMU to support it.
>
> So I think you need some sort of refcounting here.


Right, missed this bit, good finding.


>
>>   }
>>
>>   static const MemoryListener vfio_memory_listener = {
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index eb5ce67..f1de133f 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -152,6 +152,10 @@ struct MemoryRegionIOMMUOps {
>>       IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>>       /* Returns supported page sizes */
>>       uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>> +    /* Called when VFIO starts using this */
>> +    void (*vfio_start)(MemoryRegion *iommu);
>> +    /* Called when VFIO stops using this */
>> +    void (*vfio_stop)(MemoryRegion *iommu);
>>   };
>>
>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities
  2016-03-22  4:20   ` David Gibson
@ 2016-03-22  6:47     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  6:47 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/22/2016 03:20 PM, David Gibson wrote:
> On Mon, Mar 21, 2016 at 06:47:03PM +1100, Alexey Kardashevskiy wrote:
>> There are going to be multiple IOMMUs per a container. This moves
>> the single host IOMMU parameter set to a list of VFIOHostIOMMU.
>>
>> This should cause no behavioral change and will be used later by
>> the SPAPR TCE IOMMU v2 which will also add a vfio_host_iommu_del() helper.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> This looks ok except for the name.  Calling each window a separate
> "host IOMMU" is misleading.  The different windows the container
> supports might be implemented by different IOMMUs on the host side, or
> it might be implemented by one IOMMU with multiple tables.
>
> Better to call them host DMA windows, or maybe container DMA windows.



VFIOHostDMAWindow it is then.




-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-22  4:59           ` David Gibson
@ 2016-03-22  7:19             ` Alexey Kardashevskiy
  2016-03-22 23:07               ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  7:19 UTC (permalink / raw)
  To: David Gibson; +Cc: Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel

On 03/22/2016 03:59 PM, David Gibson wrote:
> On Tue, Mar 22, 2016 at 03:28:52PM +1100, Alexey Kardashevskiy wrote:
>> On 03/22/2016 02:26 PM, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/22/2016 11:49 AM, David Gibson wrote:
>>>>> On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
>>>>>> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
>>>>>> when new VFIO listener is added, all existing IOMMU mappings are
>>>>>> replayed. However there is a problem that the base address of
>>>>>> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
>>>>>> for the existing user (which is pseries) with its default 32bit DMA
>>>>>> window starting at 0 but it is if there is another DMA window.
>>>>>>
>>>>>> This stores the IOMMU's offset_within_address_space and adjusts
>>>>>> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
>>>>>>
>>>>>> As the IOMMU notifier expects IOVA offset rather than the absolute
>>>>>> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
>>>>>> calling notifier(s).
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>>
>>>>> On a closer look, I realised this still isn't quite correct, although
>>>>> I don't think any cases which would break it exist or are planned.
>>>>>
>>>>>> ---
>>>>>>   hw/ppc/spapr_iommu.c          |  2 +-
>>>>>>   hw/vfio/common.c              | 14 ++++++++------
>>>>>>   include/hw/vfio/vfio-common.h |  1 +
>>>>>>   3 files changed, 10 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>>>> index 7dd4588..277f289 100644
>>>>>> --- a/hw/ppc/spapr_iommu.c
>>>>>> +++ b/hw/ppc/spapr_iommu.c
>>>>>> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
>>>>>>       tcet->table[index] = tce;
>>>>>>
>>>>>>       entry.target_as = &address_space_memory,
>>>>>> -    entry.iova = ioba & page_mask;
>>>>>> +    entry.iova = (ioba - tcet->bus_offset) & page_mask;
>>>>>>       entry.translated_addr = tce & page_mask;
>>>>>>       entry.addr_mask = ~page_mask;
>>>>>>       entry.perm = spapr_tce_iommu_access_flags(tce);
>>>>>
>>>>> This bit's right/
>>>>>
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index fb588d8..d45e2db 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>>>>>>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>>>       VFIOContainer *container = giommu->container;
>>>>>>       IOMMUTLBEntry *iotlb = data;
>>>>>> +    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
>>>>>
>>>>> This bit might be right, depending on how you define giommu->offset_within_address_space.
>>>>>
>>>>>>       MemoryRegion *mr;
>>>>>>       hwaddr xlat;
>>>>>>       hwaddr len = iotlb->addr_mask + 1;
>>>>>>       void *vaddr;
>>>>>>       int ret;
>>>>>>
>>>>>> -    trace_vfio_iommu_map_notify(iotlb->iova,
>>>>>> -                                iotlb->iova + iotlb->addr_mask);
>>>>>> +    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>>>>>>
>>>>>>       /*
>>>>>>        * The IOMMU TLB entry we have just covers translation through
>>>>>> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>>>>>>
>>>>>>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>>>>>>           vaddr = memory_region_get_ram_ptr(mr) + xlat;
>>>>>> -        ret = vfio_dma_map(container, iotlb->iova,
>>>>>> +        ret = vfio_dma_map(container, iova,
>>>>>>                              iotlb->addr_mask + 1, vaddr,
>>>>>>                              !(iotlb->perm & IOMMU_WO) || mr->readonly);
>>>>>>           if (ret) {
>>>>>>               error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>>>>>                            "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>>>>> -                         container, iotlb->iova,
>>>>>> +                         container, iova,
>>>>>>                            iotlb->addr_mask + 1, vaddr, ret);
>>>>>>           }
>>>>>>       } else {
>>>>>> -        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>>>>>> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>>>>>>           if (ret) {
>>>>>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>>>>>                            "0x%"HWADDR_PRIx") = %d (%m)",
>>>>>> -                         container, iotlb->iova,
>>>>>> +                         container, iova,
>>>>>>                            iotlb->addr_mask + 1, ret);
>>>>>>           }
>>>>>>       }
>>>>>
>>>>> This is fine.
>>>>>
>>>>>> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>>>            */
>>>>>>           giommu = g_malloc0(sizeof(*giommu));
>>>>>>           giommu->iommu = section->mr;
>>>>>> +        giommu->offset_within_address_space =
>>>>>> +            section->offset_within_address_space;
>>>>>
>>>>> But here there's a problem.  The iova in IOMMUTLBEntry is relative to
>>>>> the IOMMU MemoryRegion, but - at least in theory - only a subsection
>>>>> of that MemoryRegion could be mapped into the AddressSpace.
>>>>
>>>> But the IOMMU MR stays the same - size, offset, and iova will be relative to
>>>> its start, why does it matter if only portion is mapped?
>>>
>>> Because the portion mapped may not sit at the start of the MR.  For
>>> example if you had a 2G MR, and the second half is mapped at address 0
>>> in the AS,
>>
>> My imagination fails here. How could you do this in practice?
>>
>> address_space_init(&as, &root)
>> memory_region_init(&mr, 2GB)
>> memory_region_add_subregion(&root, -1GB, &mr)
>>
>> But offsets are unsigned.
>>
>> In general, how to map only a half, what memory_region_add_xxx()
>> does that?
>
> I'm not totally sure, but I think you can do it with:


Ok. Got it. So, how about this:

s/offset_within_address_space/iommu_offset/

and

giommu->iommu_offset = section->offset_within_address_space -
section->offset_within_region;

?


>
> address_space_init(&as, &root)
> memory_region_init(&mr0, 2GB)
> memory_region_init_alias(&mr1, &mr0, 1GB, 1GB)
> memory_region_add_subregion(&root, 0, &mr1)
>
> But the point is that it's possible for offset_within_region to be
> non-zero, and if it is, you need to take it into account.

I was not arguing this, I was trying to imagine :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO
  2016-03-22  6:24     ` Alexey Kardashevskiy
@ 2016-03-22 10:22       ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22 10:22 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2922 bytes --]

On Tue, Mar 22, 2016 at 05:24:33PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 03:45 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:47:04PM +1100, Alexey Kardashevskiy wrote:
> >>The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> >>a guest view of the table and a hardware TCE table. If there is no VFIO
> >>presense in the address space, then just the guest view is used, if
> >>this is the case, it is allocated in the KVM. However since there is no
> >>support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> >>we need to move the guest view from KVM to the userspace; and we need
> >>to do this for every IOMMU on a bus with VFIO devices.
> >>
> >>This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> >>notifiy IOMMU about changing environment so it can reallocate the table
> >>to/from KVM or (when available) hook the IOMMU groups with the logical
> >>bus (LIOBN) in the KVM.
> >>
> >>This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> >>path as the new callbacks do this better - they notify IOMMU at
> >>the exact moment when the configuration is changed, and this also
> >>includes the case of PCI hot unplug.
> >>
> >>TODO: split into 2 or 3 patches, per maintainership area.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >I'm finding this one much easier to follow than the previous revision.
> >
> >>---
> >>  hw/ppc/spapr_iommu.c  | 12 ++++++++++++
> >>  hw/ppc/spapr_pci.c    |  6 ------
> >>  hw/vfio/common.c      |  9 +++++++++
> >>  include/exec/memory.h |  4 ++++
> >>  4 files changed, 25 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 6dc3c45..702075d 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -151,6 +151,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >>      return 1ULL << tcet->page_shift;
> >>  }
> >>
> >>+static void spapr_tce_vfio_start(MemoryRegion *iommu)
> >>+{
> >>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> >>+}
> >>+
> >>+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> >>+{
> >>+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> >>+}
> >
> >Wonder if a single callback which takes a boolean might be a little
> >less clunky.
> 
> I have a feeling that at least once I was asked to do the opposite and now
> we have take_ownership/release_ownership. This does not seem to be much
> different and the existing names are more self-documenting than the previous
> vfio_notify() or whatever name I could think of.

Ok, leave it as is.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space
  2016-03-22  3:05   ` David Gibson
@ 2016-03-22 15:47     ` Alex Williamson
  2016-03-23  0:43       ` David Gibson
  2016-03-23  0:44       ` Alexey Kardashevskiy
  0 siblings, 2 replies; 64+ messages in thread
From: Alex Williamson @ 2016-03-22 15:47 UTC (permalink / raw)
  To: David Gibson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel

On Tue, 22 Mar 2016 14:05:15 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Mar 21, 2016 at 06:47:00PM +1100, Alexey Kardashevskiy wrote:
> > At the moment IOMMU MR only translate to the system memory.
> > However if some new code changes this, we will need clear indication why
> > it is not working so here is the check.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> Alex, any chance we could merge this quickly, since it is a reasonable
> sanity check even without the rest of the changes.

It all sounds very theoretical to inspire some rush to merge it
quickly, is there any chance we could actually hit this currently?

> > ---
> > Changes:
> > v14:
> > * new to the series
> > ---
> >  hw/vfio/common.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 55723c9..9587c25 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> >  
> >      trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> >  
> > +    if (iotlb->target_as != &address_space_memory) {
> > +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> > +                     iotlb->target_as->name?iotlb->target_as->name:"noname");

Spaces please.

> > +        return;
> > +    }
> > +
> >      /*
> >       * The IOMMU TLB entry we have just covers translation through
> >       * this IOMMU to its immediate target.  We need to translate  
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-22  7:19             ` Alexey Kardashevskiy
@ 2016-03-22 23:07               ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-22 23:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7198 bytes --]

On Tue, Mar 22, 2016 at 06:19:40PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 03:59 PM, David Gibson wrote:
> >On Tue, Mar 22, 2016 at 03:28:52PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/22/2016 02:26 PM, David Gibson wrote:
> >>>On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
> >>>>On 03/22/2016 11:49 AM, David Gibson wrote:
> >>>>>On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> >>>>>>when new VFIO listener is added, all existing IOMMU mappings are
> >>>>>>replayed. However there is a problem that the base address of
> >>>>>>an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> >>>>>>for the existing user (which is pseries) with its default 32bit DMA
> >>>>>>window starting at 0 but it is if there is another DMA window.
> >>>>>>
> >>>>>>This stores the IOMMU's offset_within_address_space and adjusts
> >>>>>>the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> >>>>>>
> >>>>>>As the IOMMU notifier expects IOVA offset rather than the absolute
> >>>>>>address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> >>>>>>calling notifier(s).
> >>>>>>
> >>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>>>>
> >>>>>On a closer look, I realised this still isn't quite correct, although
> >>>>>I don't think any cases which would break it exist or are planned.
> >>>>>
> >>>>>>---
> >>>>>>  hw/ppc/spapr_iommu.c          |  2 +-
> >>>>>>  hw/vfio/common.c              | 14 ++++++++------
> >>>>>>  include/hw/vfio/vfio-common.h |  1 +
> >>>>>>  3 files changed, 10 insertions(+), 7 deletions(-)
> >>>>>>
> >>>>>>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>>>>>index 7dd4588..277f289 100644
> >>>>>>--- a/hw/ppc/spapr_iommu.c
> >>>>>>+++ b/hw/ppc/spapr_iommu.c
> >>>>>>@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
> >>>>>>      tcet->table[index] = tce;
> >>>>>>
> >>>>>>      entry.target_as = &address_space_memory,
> >>>>>>-    entry.iova = ioba & page_mask;
> >>>>>>+    entry.iova = (ioba - tcet->bus_offset) & page_mask;
> >>>>>>      entry.translated_addr = tce & page_mask;
> >>>>>>      entry.addr_mask = ~page_mask;
> >>>>>>      entry.perm = spapr_tce_iommu_access_flags(tce);
> >>>>>
> >>>>>This bit's right/
> >>>>>
> >>>>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>>index fb588d8..d45e2db 100644
> >>>>>>--- a/hw/vfio/common.c
> >>>>>>+++ b/hw/vfio/common.c
> >>>>>>@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>>>>>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>>>>>      VFIOContainer *container = giommu->container;
> >>>>>>      IOMMUTLBEntry *iotlb = data;
> >>>>>>+    hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
> >>>>>
> >>>>>This bit might be right, depending on how you define giommu->offset_within_address_space.
> >>>>>
> >>>>>>      MemoryRegion *mr;
> >>>>>>      hwaddr xlat;
> >>>>>>      hwaddr len = iotlb->addr_mask + 1;
> >>>>>>      void *vaddr;
> >>>>>>      int ret;
> >>>>>>
> >>>>>>-    trace_vfio_iommu_map_notify(iotlb->iova,
> >>>>>>-                                iotlb->iova + iotlb->addr_mask);
> >>>>>>+    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> >>>>>>
> >>>>>>      /*
> >>>>>>       * The IOMMU TLB entry we have just covers translation through
> >>>>>>@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>>>>>
> >>>>>>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> >>>>>>          vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>>>>>-        ret = vfio_dma_map(container, iotlb->iova,
> >>>>>>+        ret = vfio_dma_map(container, iova,
> >>>>>>                             iotlb->addr_mask + 1, vaddr,
> >>>>>>                             !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >>>>>>          if (ret) {
> >>>>>>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>>>>>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>>>>>-                         container, iotlb->iova,
> >>>>>>+                         container, iova,
> >>>>>>                           iotlb->addr_mask + 1, vaddr, ret);
> >>>>>>          }
> >>>>>>      } else {
> >>>>>>-        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> >>>>>>+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> >>>>>>          if (ret) {
> >>>>>>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>>>>>                           "0x%"HWADDR_PRIx") = %d (%m)",
> >>>>>>-                         container, iotlb->iova,
> >>>>>>+                         container, iova,
> >>>>>>                           iotlb->addr_mask + 1, ret);
> >>>>>>          }
> >>>>>>      }
> >>>>>
> >>>>>This is fine.
> >>>>>
> >>>>>>@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>>>>>           */
> >>>>>>          giommu = g_malloc0(sizeof(*giommu));
> >>>>>>          giommu->iommu = section->mr;
> >>>>>>+        giommu->offset_within_address_space =
> >>>>>>+            section->offset_within_address_space;
> >>>>>
> >>>>>But here there's a problem.  The iova in IOMMUTLBEntry is relative to
> >>>>>the IOMMU MemoryRegion, but - at least in theory - only a subsection
> >>>>>of that MemoryRegion could be mapped into the AddressSpace.
> >>>>
> >>>>But the IOMMU MR stays the same - size, offset, and iova will be relative to
> >>>>its start, why does it matter if only portion is mapped?
> >>>
> >>>Because the portion mapped may not sit at the start of the MR.  For
> >>>example if you had a 2G MR, and the second half is mapped at address 0
> >>>in the AS,
> >>
> >>My imagination fails here. How could you do this in practice?
> >>
> >>address_space_init(&as, &root)
> >>memory_region_init(&mr, 2GB)
> >>memory_region_add_subregion(&root, -1GB, &mr)
> >>
> >>But offsets are unsigned.
> >>
> >>In general, how to map only a half, what memory_region_add_xxx()
> >>does that?
> >
> >I'm not totally sure, but I think you can do it with:
> 
> 
> Ok. Got it. So, how about this:
> 
> s/offset_within_address_space/iommu_offset/
> 
> and
> 
> giommu->iommu_offset = section->offset_within_address_space -
> section->offset_within_region;

Yes, that should do it.

> 
> ?
> 
> 
> >
> >address_space_init(&as, &root)
> >memory_region_init(&mr0, 2GB)
> >memory_region_init_alias(&mr1, &mr0, 1GB, 1GB)
> >memory_region_add_subregion(&root, 0, &mr1)
> >
> >But the point is that it's possible for offset_within_region to be
> >non-zero, and if it is, you need to take it into account.
> 
> I was not arguing this, I was trying to imagine :)
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space
  2016-03-22 15:47     ` Alex Williamson
@ 2016-03-23  0:43       ` David Gibson
  2016-03-23  0:44       ` Alexey Kardashevskiy
  1 sibling, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-23  0:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2068 bytes --]

On Tue, Mar 22, 2016 at 09:47:10AM -0600, Alex Williamson wrote:
> On Tue, 22 Mar 2016 14:05:15 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Mar 21, 2016 at 06:47:00PM +1100, Alexey Kardashevskiy wrote:
> > > At the moment IOMMU MR only translate to the system memory.
> > > However if some new code changes this, we will need clear indication why
> > > it is not working so here is the check.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>  
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > Alex, any chance we could merge this quickly, since it is a reasonable
> > sanity check even without the rest of the changes.
> 
> It all sounds very theoretical to inspire some rush to merge it
> quickly, is there any chance we could actually hit this currently?

Hm, I guess not.  Ok, let's leave it for now.

> 
> > > ---
> > > Changes:
> > > v14:
> > > * new to the series
> > > ---
> > >  hw/vfio/common.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index 55723c9..9587c25 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
> > >  
> > >      trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> > >  
> > > +    if (iotlb->target_as != &address_space_memory) {
> > > +        error_report("Wrong target AS \"%s\", only system memory is allowed",
> > > +                     iotlb->target_as->name?iotlb->target_as->name:"noname");
> 
> Spaces please.
> 
> > > +        return;
> > > +    }
> > > +
> > >      /*
> > >       * The IOMMU TLB entry we have just covers translation through
> > >       * this IOMMU to its immediate target.  We need to translate  
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space
  2016-03-22 15:47     ` Alex Williamson
  2016-03-23  0:43       ` David Gibson
@ 2016-03-23  0:44       ` Alexey Kardashevskiy
  1 sibling, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-23  0:44 UTC (permalink / raw)
  To: Alex Williamson, David Gibson; +Cc: qemu-ppc, qemu-devel

On 03/23/2016 02:47 AM, Alex Williamson wrote:
> On Tue, 22 Mar 2016 14:05:15 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
>> On Mon, Mar 21, 2016 at 06:47:00PM +1100, Alexey Kardashevskiy wrote:
>>> At the moment IOMMU MR only translate to the system memory.
>>> However if some new code changes this, we will need clear indication why
>>> it is not working so here is the check.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>
>> Alex, any chance we could merge this quickly, since it is a reasonable
>> sanity check even without the rest of the changes.
>
> It all sounds very theoretical to inspire some rush to merge it
> quickly, is there any chance we could actually hit this currently?


The chances are as big as chances that some platform starts supporting VFIO 
soon, for these new folks such a check would be a good piece of 
documentation or at least a warning trigger to ask a question in the lists.



>>> ---
>>> Changes:
>>> v14:
>>> * new to the series
>>> ---
>>>   hw/vfio/common.c | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 55723c9..9587c25 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
>>>
>>>       trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>>>
>>> +    if (iotlb->target_as != &address_space_memory) {
>>> +        error_report("Wrong target AS \"%s\", only system memory is allowed",
>>> +                     iotlb->target_as->name?iotlb->target_as->name:"noname");
>
> Spaces please.
>
>>> +        return;
>>> +    }
>>> +
>>>       /*
>>>        * The IOMMU TLB entry we have just covers translation through
>>>        * this IOMMU to its immediate target.  We need to translate
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-22  5:54     ` Alexey Kardashevskiy
@ 2016-03-23  1:08       ` David Gibson
  2016-03-23  2:12         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-23  1:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 11297 bytes --]

On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 04:14 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> >>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>This adds ability to VFIO common code to dynamically allocate/remove
> >>DMA windows in the host kernel when new VFIO container is added/removed.
> >>
> >>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>and adds just created IOMMU into the host IOMMU list; the opposite
> >>action is taken in vfio_listener_region_del.
> >>
> >>When creating a new window, this uses euristic to decide on the TCE table
> >>levels number.
> >>
> >>This should cause no guest visible change in behavior.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>Changes:
> >>v14:
> >>* new to the series
> >>
> >>---
> >>TODO:
> >>* export levels to PHB
> >>---
> >>  hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>  trace-events     |   2 ++
> >>  2 files changed, 105 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 4e873b7..421d6eb 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
> >>      return 0;
> >>  }
> >>
> >>+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
> >>+{
> >>+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
> >
> >The hard-coded 0x1000 looks dubious..
> 
> Well, that's the minimal page size...

Really?  Some BookE CPUs support 1KiB page size..

> >>+    g_assert(hiommu);
> >>+    QLIST_REMOVE(hiommu, hiommu_next);
> >>+}
> >>+
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >>      return (!memory_region_is_ram(section->mr) &&
> >>@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      }
> >>      end = int128_get64(llend);
> >>
> >>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >
> >I think this would be clearer split out into a helper function,
> >vfio_create_host_window() or something.
> 
> 
> It is rather vfio_spapr_create_host_window() and we were avoiding
> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> separate file but this usually triggers more discussion and never ends well.
> 
> 
> 
> >>+        unsigned entries, pages;
> >>+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >>+
> >>+        g_assert(section->mr->iommu_ops);
> >>+        g_assert(memory_region_is_iommu(section->mr));
> >
> >I don't think you need these asserts.  AFAICT the same logic should
> >work if a RAM MR was added directly to PCI address space - this would
> >create the new host window, then the existing code for adding a RAM MR
> >would map that block of RAM statically into the new window.
> 
> In what configuration/machine can we do that on SPAPR?

spapr guests won't ever do that.  But you can run an x86 guest on a
powernv host and this situation could come up.  In any case there's no
point asserting if the code is correct anyway.

> >>+        trace_vfio_listener_region_add_iommu(iova, end - 1);
> >>+        /*
> >>+         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>+         * avoid bouncing all map/unmaps through qemu this way, this
> >>+         * would be the right place to wire that up (tell the KVM
> >>+         * device emulation the VFIO iommu handles to use).
> >>+         */
> >>+        create.window_size = memory_region_size(section->mr);
> >>+        create.page_shift =
> >>+                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
> >
> >Ah.. except that I guess you'd need to fall back to host page size
> >here to handle a RAM MR.
> 
> Can you give an example of such RAM MR being added to PCI AS on
> SPAPR?

On spapr, no.  But you can run other machine types as guests (at least
with TCG) on a host with the spapr IOMMU.

> >>+        /*
> >>+         * SPAPR host supports multilevel TCE tables, there is some
> >>+         * euristic to decide how many levels we want for our table:
> >>+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >>+         */
> >>+        entries = create.window_size >> create.page_shift;
> >>+        pages = (entries * sizeof(uint64_t)) / getpagesize();
> >>+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> >>+
> >>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>+        if (ret) {
> >>+            error_report("Failed to create a window, ret = %d (%m)", ret);
> >>+            goto fail;
> >>+        }
> >>+
> >>+        if (create.start_addr != section->offset_within_address_space ||
> >>+            vfio_host_iommu_lookup(container, create.start_addr,
> >>+                                   create.start_addr + create.window_size - 1)) {
> >
> >Under what circumstances can this trigger?  Is the kernel ioctl
> >allowed to return a different window start address than the one
> >requested?
> 
> You already asked this some time ago :) The userspace cannot request
> address, the host kernel returns one.

Ok.  For generality it would be nice if you could succeed here as long
as the new host window covers the requested guest window, even if it
doesn't match exactly.  And for that matter to not request the new
window if the host already has a window covering the guest region.

> >The second check looks very strange - if it returns true doesn't that
> >mean you *do* have host window which can accomodate this guest region,
> >which is what you want?
> 
> This should not happen, this is what this check is for. Can make it assert()
> or something like this.

Oh.. I see.  Because you've done the ioctl, but not recorded the new
host window in the list yet.

No, I think the correct approach is to look for an existing host
window containing the requested guest window *before* you try to
create a new host window.  If one is already there, you can just carry
on.

> >>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>+                .argsz = sizeof(remove),
> >>+                .start_addr = create.start_addr
> >>+            };
> >>+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> >>+                         section->offset_within_address_space,
> >>+                         create.start_addr);
> >>+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+            ret = -EINVAL;
> >>+            goto fail;
> >>+        }
> >>+        trace_vfio_spapr_create_window(create.page_shift,
> >>+                                       create.window_size,
> >>+                                       create.start_addr);
> >>+
> >>+        vfio_host_iommu_add(container, create.start_addr,
> >>+                            create.start_addr + create.window_size - 1,
> >>+                            1ULL << create.page_shift);
> >>+    }
> >>+
> >>      if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
> >>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
> >>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> >>@@ -525,6 +588,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                       container, iova, end - iova, ret);
> >>      }
> >>
> >>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>+        struct vfio_iommu_spapr_tce_remove remove = {
> >>+            .argsz = sizeof(remove),
> >>+            .start_addr = section->offset_within_address_space,
> >>+        };
> >>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+        if (ret) {
> >>+            error_report("Failed to remove window at %"PRIx64,
> >>+                         remove.start_addr);
> >>+        }
> >>+
> >>+        vfio_host_iommu_del(container, section->offset_within_address_space);
> >>+
> >>+        trace_vfio_spapr_remove_window(remove.start_addr);
> >>+    }
> >>+
> >>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> >>          iommu->iommu_ops->vfio_stop(section->mr);
> >>      }
> >>@@ -928,11 +1007,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto listener_release_exit;
> >>          }
> >>
> >>-        /* The default table uses 4K pages */
> >>-        vfio_host_iommu_add(container, info.dma32_window_start,
> >>-                            info.dma32_window_start +
> >>-                            info.dma32_window_size - 1,
> >>-                            0x1000);
> >>+        if (v2) {
> >>+            /*
> >>+             * There is a default window in just created container.
> >>+             * To make region_add/del simpler, we better remove this
> >>+             * window now and let those iommu_listener callbacks
> >>+             * create/remove them when needed.
> >>+             */
> >>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>+                .argsz = sizeof(remove),
> >>+                .start_addr = info.dma32_window_start,
> >>+            };
> >>+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>+            if (ret) {
> >>+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> >>+                ret = -errno;
> >>+                goto free_container_exit;
> >>+            }
> >>+        } else {
> >>+            /* The default table uses 4K pages */
> >>+            vfio_host_iommu_add(container, info.dma32_window_start,
> >>+                                info.dma32_window_start +
> >>+                                info.dma32_window_size - 1,
> >>+                                0x1000);
> >>+        }
> >>      } else {
> >>          error_report("vfio: No available IOMMU models");
> >>          ret = -EINVAL;
> >>diff --git a/trace-events b/trace-events
> >>index cc619e1..f2b75a3 100644
> >>--- a/trace-events
> >>+++ b/trace-events
> >>@@ -1736,6 +1736,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
> >>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> >>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> >>+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
> >>
> >>  # hw/vfio/platform.c
> >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-23  1:08       ` David Gibson
@ 2016-03-23  2:12         ` Alexey Kardashevskiy
  2016-03-23  2:53           ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-23  2:12 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/23/2016 12:08 PM, David Gibson wrote:
> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
>> On 03/22/2016 04:14 PM, David Gibson wrote:
>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>>>> This adds ability to VFIO common code to dynamically allocate/remove
>>>> DMA windows in the host kernel when new VFIO container is added/removed.
>>>>
>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>>>> and adds just created IOMMU into the host IOMMU list; the opposite
>>>> action is taken in vfio_listener_region_del.
>>>>
>>>> When creating a new window, this uses euristic to decide on the TCE table
>>>> levels number.
>>>>
>>>> This should cause no guest visible change in behavior.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v14:
>>>> * new to the series
>>>>
>>>> ---
>>>> TODO:
>>>> * export levels to PHB
>>>> ---
>>>>   hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>   trace-events     |   2 ++
>>>>   2 files changed, 105 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 4e873b7..421d6eb 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
>>>>       return 0;
>>>>   }
>>>>
>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
>>>> +{
>>>> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
>>>
>>> The hard-coded 0x1000 looks dubious..
>>
>> Well, that's the minimal page size...
>
> Really?  Some BookE CPUs support 1KiB page size..


Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)


>
>>>> +    g_assert(hiommu);
>>>> +    QLIST_REMOVE(hiommu, hiommu_next);
>>>> +}
>>>> +
>>>>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>   {
>>>>       return (!memory_region_is_ram(section->mr) &&
>>>> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>       }
>>>>       end = int128_get64(llend);
>>>>
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>
>>> I think this would be clearer split out into a helper function,
>>> vfio_create_host_window() or something.
>>
>>
>> It is rather vfio_spapr_create_host_window() and we were avoiding
>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
>> separate file but this usually triggers more discussion and never ends well.
>>
>>
>>
>>>> +        unsigned entries, pages;
>>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>>>> +
>>>> +        g_assert(section->mr->iommu_ops);
>>>> +        g_assert(memory_region_is_iommu(section->mr));
>>>
>>> I don't think you need these asserts.  AFAICT the same logic should
>>> work if a RAM MR was added directly to PCI address space - this would
>>> create the new host window, then the existing code for adding a RAM MR
>>> would map that block of RAM statically into the new window.
>>
>> In what configuration/machine can we do that on SPAPR?
>
> spapr guests won't ever do that.  But you can run an x86 guest on a
> powernv host and this situation could come up.


I am pretty sure VFIO won't work in this case anyway.

> In any case there's no point asserting if the code is correct anyway.

Assert here says (at least) "not tested" or "not expected to happen".


>
>>>> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
>>>> +        /*
>>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>>>> +         * avoid bouncing all map/unmaps through qemu this way, this
>>>> +         * would be the right place to wire that up (tell the KVM
>>>> +         * device emulation the VFIO iommu handles to use).
>>>> +         */
>>>> +        create.window_size = memory_region_size(section->mr);
>>>> +        create.page_shift =
>>>> +                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
>>>
>>> Ah.. except that I guess you'd need to fall back to host page size
>>> here to handle a RAM MR.
>>
>> Can you give an example of such RAM MR being added to PCI AS on
>> SPAPR?
>
> On spapr, no.  But you can run other machine types as guests (at least
> with TCG) on a host with the spapr IOMMU.
>
>>>> +        /*
>>>> +         * SPAPR host supports multilevel TCE tables, there is some
>>>> +         * euristic to decide how many levels we want for our table:
>>>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>>>> +         */
>>>> +        entries = create.window_size >> create.page_shift;
>>>> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
>>>> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>>>> +
>>>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>> +        if (ret) {
>>>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>>>> +            goto fail;
>>>> +        }
>>>> +
>>>> +        if (create.start_addr != section->offset_within_address_space ||
>>>> +            vfio_host_iommu_lookup(container, create.start_addr,
>>>> +                                   create.start_addr + create.window_size - 1)) {
>>>
>>> Under what circumstances can this trigger?  Is the kernel ioctl
>>> allowed to return a different window start address than the one
>>> requested?
>>
>> You already asked this some time ago :) The userspace cannot request
>> address, the host kernel returns one.
>
> Ok.  For generality it would be nice if you could succeed here as long
> as the new host window covers the requested guest window, even if it
> doesn't match exactly.  And for that matter to not request the new
> window if the host already has a window covering the guest region.


That would be dead code - when would it possibly work? I mean I could 
instrument an artificial test but the actual user which might appear later 
will likely be soooo different so this won't help anyway.


>>> The second check looks very strange - if it returns true doesn't that
>>> mean you *do* have host window which can accomodate this guest region,
>>> which is what you want?
>>
>> This should not happen, this is what this check is for. Can make it assert()
>> or something like this.
>
> Oh.. I see.  Because you've done the ioctl, but not recorded the new
> host window in the list yet.
>
> No, I think the correct approach is to look for an existing host
> window containing the requested guest window *before* you try to
> create a new host window.  If one is already there, you can just carry
> on.

Right, I'll change this.


>
>>>> +            struct vfio_iommu_spapr_tce_remove remove = {
>>>> +                .argsz = sizeof(remove),
>>>> +                .start_addr = create.start_addr
>>>> +            };
>>>> +            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
>>>> +                         section->offset_within_address_space,
>>>> +                         create.start_addr);
>>>> +            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>>>> +            ret = -EINVAL;
>>>> +            goto fail;
>>>> +        }
>>>> +        trace_vfio_spapr_create_window(create.page_shift,
>>>> +                                       create.window_size,
>>>> +                                       create.start_addr);
>>>> +
>>>> +        vfio_host_iommu_add(container, create.start_addr,
>>>> +                            create.start_addr + create.window_size - 1,
>>>> +                            1ULL << create.page_shift);
>>>> +    }
>>>> +
>>>>       if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
>>>>           error_report("vfio: IOMMU container %p can't map guest IOVA region"
>>>>                        " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>>>> @@ -525,6 +588,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>>>                        container, iova, end - iova, ret);
>>>>       }
>>>>
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +        struct vfio_iommu_spapr_tce_remove remove = {
>>>> +            .argsz = sizeof(remove),
>>>> +            .start_addr = section->offset_within_address_space,
>>>> +        };
>>>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>>>> +        if (ret) {
>>>> +            error_report("Failed to remove window at %"PRIx64,
>>>> +                         remove.start_addr);
>>>> +        }
>>>> +
>>>> +        vfio_host_iommu_del(container, section->offset_within_address_space);
>>>> +
>>>> +        trace_vfio_spapr_remove_window(remove.start_addr);
>>>> +    }
>>>> +
>>>>       if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
>>>>           iommu->iommu_ops->vfio_stop(section->mr);
>>>>       }
>>>> @@ -928,11 +1007,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>               goto listener_release_exit;
>>>>           }
>>>>
>>>> -        /* The default table uses 4K pages */
>>>> -        vfio_host_iommu_add(container, info.dma32_window_start,
>>>> -                            info.dma32_window_start +
>>>> -                            info.dma32_window_size - 1,
>>>> -                            0x1000);
>>>> +        if (v2) {
>>>> +            /*
>>>> +             * There is a default window in just created container.
>>>> +             * To make region_add/del simpler, we better remove this
>>>> +             * window now and let those iommu_listener callbacks
>>>> +             * create/remove them when needed.
>>>> +             */
>>>> +            struct vfio_iommu_spapr_tce_remove remove = {
>>>> +                .argsz = sizeof(remove),
>>>> +                .start_addr = info.dma32_window_start,
>>>> +            };
>>>> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>>>> +            if (ret) {
>>>> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
>>>> +                ret = -errno;
>>>> +                goto free_container_exit;
>>>> +            }
>>>> +        } else {
>>>> +            /* The default table uses 4K pages */
>>>> +            vfio_host_iommu_add(container, info.dma32_window_start,
>>>> +                                info.dma32_window_start +
>>>> +                                info.dma32_window_size - 1,
>>>> +                                0x1000);
>>>> +        }
>>>>       } else {
>>>>           error_report("vfio: No available IOMMU models");
>>>>           ret = -EINVAL;
>>>> diff --git a/trace-events b/trace-events
>>>> index cc619e1..f2b75a3 100644
>>>> --- a/trace-events
>>>> +++ b/trace-events
>>>> @@ -1736,6 +1736,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>>>   vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>>>>   vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>>>   vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>>> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
>>>> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>>>>
>>>>   # hw/vfio/platform.c
>>>>   vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>>>
>>
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2016-03-23  2:13   ` David Gibson
  2016-03-23  3:28     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-23  2:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 27320 bytes --]

On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices.
> This reserves RTAS token numbers for DDW calls.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.5 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/Makefile.objs        |   1 +
>  hw/ppc/spapr.c              |   7 +-
>  hw/ppc/spapr_pci.c          |  73 ++++++++---
>  hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/common.c            |   5 -
>  include/hw/pci-host/spapr.h |  13 ++
>  include/hw/ppc/spapr.h      |  16 ++-
>  trace-events                |   4 +
>  8 files changed, 395 insertions(+), 24 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index c1ffc77..986b36f 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d0bb423..ef4c637 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>   * pseries-2.5
>   */
>  #define SPAPR_COMPAT_2_5 \
> -        HW_COMPAT_2_5
> +        HW_COMPAT_2_5 \
> +        {\
> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +            .property = "ddw",\
> +            .value    = stringify(off),\
> +        },
>  
>  static void spapr_machine_2_5_instance_options(MachineState *machine)
>  {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index af99a36..3bb294a 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>      return buf;
>  }
>  
> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> -                                       uint32_t liobn,
> -                                       uint32_t page_shift,
> -                                       uint64_t window_addr,
> -                                       uint64_t window_size,
> -                                       Error **errp)
> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                 uint32_t liobn,
> +                                 uint32_t page_shift,
> +                                 uint64_t window_addr,
> +                                 uint64_t window_size,
> +                                 Error **errp)
>  {
>      sPAPRTCETable *tcet;
>      uint32_t nb_table = window_size >> page_shift;
> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>          return;
>      }
>  
> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> +        error_setg(errp,
> +                   "Attempt to use second window when DDW is disabled on PHB");
> +        return;
> +    }

This should never happen unless something is wrong with the tests in
the RTAS functions, yes?  In which case it should probably be an
assert().


>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
>  }
>  
> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>  {
>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>  
> @@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      }
>  
>      /* DMA setup */
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -        error_report("No default TCE table for %s", sphb->dtbusname);
> -        return;
> -    }
>  
> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> -                                        spapr_tce_get_iommu(tcet), 0);
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb),
> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
> +        if (!tcet) {
> +            error_setg(errp, "Creating window#%d failed for %s",
> +                       i, sphb->dtbusname);
> +            return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
> +    }
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
> @@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>      Error *local_err = NULL;
> +    int i;
>  
> -    if (tcet && tcet->enabled) {
> -        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> +
> +        if (tcet && tcet->enabled) {
> +            spapr_phb_dma_window_disable(sphb, liobn);
> +        }
>      }
>  
>      /* Register default 32bit DMA window */
> @@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
>      /* Default DMA window is 0..1GB */
>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> +                       0x800000000000000ULL),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> +    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
> +                       SPAPR_PCI_DMA_MAX_WINDOWS),

What will happen if the user tries to set 'windows' larger than
SPAPR_PCI_DMA_MAX_WINDOWS?

> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> +                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..37f805f
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,300 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->enabled) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->enabled) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> +                                 uint64_t page_mask)
> +{
> +    int i, j;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            if ((sps[i].page_shift == masks[j].shift) &&
> +                    (page_mask & (1ULL << masks[j].shift))) {
> +                mask |= masks[j].mask;
> +            }
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    uint64_t buid, max_window_size;
> +    uint32_t avail, addr, pgmask = 0;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    /* Work out supported page masks */
> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);

There are a few potential problems here.  First you're just
arbitrarily picking the first entry in the sps array to filter
against, which doesn't seem right (except by accident).  It's a little
bit odd filtering against guest page sizes at all, although I get what
you're really trying to do is filter against allowed host page sizes.

The other problem is that silently filtering capabilities based on the
host can be a pain for migration - I've made the mistake and had it
bite me in the past.  I think it would be safer to just check the
pagesizes requested in the property against what's possible and
outright fail if they don't match.  For convenience you could also set
according to host capabilities if the user doesn't specify anything,
but that would require explicit handling of the "default" case.

Remember that this code will be relevant for DDW with emulated
devices, even if VFIO is not in play at all.

All those considerations aside, it seems like it would make more sense
to do this filtering during device realize, rather than leaving it
until the guest queries.

> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> +     */
> +    max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT;

Will maxram_size always be enough?  There will sometimes be an
alignment gap between the "base" RAM and the hotpluggable RAM, meaning
that if everything is plugged the last RAM address will be beyond
maxram_size.  Will that require pushing this number up, or will the
guest "repack" the RAM layout when it maps it into the TCE tables?

> +    avail = sphb->windows_supported - spapr_phb_get_active_win_num(sphb);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +    rtas_st(rets, 2, max_window_size);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    Error *local_err = NULL;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +
> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift)) ||
> +        spapr_phb_get_active_win_num(sphb) == sphb->windows_supported) {
> +        goto hw_error_exit;
> +    }

Bad page sizes should be H_PARAM, not H_HARDWARE, no?

Also is no available liobns H_HARDWARE, or H_RESOURCE?

> +
> +    if (window_shift < page_shift) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_phb_dma_window_enable(sphb, liobn, page_shift,
> +                                sphb->dma64_window_addr,
> +                                1ULL << window_shift, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        goto hw_error_exit;
> +    }
> +
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift,
> +                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
> +    if (local_err || !tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +    long ret;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled || !spapr_phb_get_active_win_num(sphb)) {

Checking spapr_phb_get_active_win_num() seems weird.  You already know
that this tcet is a child of the sphb.  Can't you just check its own
enabled property.

> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_window_disable(sphb, liobn);
> +    trace_spapr_iommu_ddw_remove(liobn, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +
> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 421d6eb..b0ea146 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -994,11 +994,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */

This seems like a stray change to VFIO code in a patch that's
otherwise only affecting spapr_pci.  Does this hunk belong in a
different patch?

>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 7848366..e81b751 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -71,6 +71,11 @@ struct sPAPRPHBState {
>      spapr_pci_msi_mig *msi_devs;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
> +
> +    bool ddw_enabled;
> +    uint32_t windows_supported;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_window_addr;
>  };
>  
>  #define SPAPR_PCI_MAX_INDEX          255
> @@ -89,6 +94,8 @@ struct sPAPRPHBState {
>  
>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>  
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> @@ -148,5 +155,11 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>  #endif
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                uint32_t liobn, uint32_t page_shift,
> +                                uint64_t window_addr,
> +                                uint64_t window_size,
> +                                 Error **errp);
> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn);
>  
>  #endif /* __HW_SPAPR_PCI_H__ */
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 471eb4a..41b32c6 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -417,6 +417,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -458,8 +468,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> diff --git a/trace-events b/trace-events
> index f2b75a3..e68d0e4 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1431,6 +1431,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
> +spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-23  2:12         ` Alexey Kardashevskiy
@ 2016-03-23  2:53           ` David Gibson
  2016-03-23  3:06             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-23  2:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 12969 bytes --]

On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 12:08 PM, David Gibson wrote:
> >On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/22/2016 04:14 PM, David Gibson wrote:
> >>>On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> >>>>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>>>This adds ability to VFIO common code to dynamically allocate/remove
> >>>>DMA windows in the host kernel when new VFIO container is added/removed.
> >>>>
> >>>>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>>>and adds just created IOMMU into the host IOMMU list; the opposite
> >>>>action is taken in vfio_listener_region_del.
> >>>>
> >>>>When creating a new window, this uses euristic to decide on the TCE table
> >>>>levels number.
> >>>>
> >>>>This should cause no guest visible change in behavior.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>Changes:
> >>>>v14:
> >>>>* new to the series
> >>>>
> >>>>---
> >>>>TODO:
> >>>>* export levels to PHB
> >>>>---
> >>>>  hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>  trace-events     |   2 ++
> >>>>  2 files changed, 105 insertions(+), 5 deletions(-)
> >>>>
> >>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>index 4e873b7..421d6eb 100644
> >>>>--- a/hw/vfio/common.c
> >>>>+++ b/hw/vfio/common.c
> >>>>@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
> >>>>      return 0;
> >>>>  }
> >>>>
> >>>>+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
> >>>>+{
> >>>>+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
> >>>
> >>>The hard-coded 0x1000 looks dubious..
> >>
> >>Well, that's the minimal page size...
> >
> >Really?  Some BookE CPUs support 1KiB page size..
> 
> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)

Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
it's been done for CPU MMU I wouldn't count on it not being done for
IOMMU.

1 is a safer choice.

> 
> 
> >
> >>>>+    g_assert(hiommu);
> >>>>+    QLIST_REMOVE(hiommu, hiommu_next);
> >>>>+}
> >>>>+
> >>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>  {
> >>>>      return (!memory_region_is_ram(section->mr) &&
> >>>>@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>>>      }
> >>>>      end = int128_get64(llend);
> >>>>
> >>>>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>
> >>>I think this would be clearer split out into a helper function,
> >>>vfio_create_host_window() or something.
> >>
> >>
> >>It is rather vfio_spapr_create_host_window() and we were avoiding
> >>xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> >>separate file but this usually triggers more discussion and never ends well.
> >>
> >>
> >>
> >>>>+        unsigned entries, pages;
> >>>>+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >>>>+
> >>>>+        g_assert(section->mr->iommu_ops);
> >>>>+        g_assert(memory_region_is_iommu(section->mr));
> >>>
> >>>I don't think you need these asserts.  AFAICT the same logic should
> >>>work if a RAM MR was added directly to PCI address space - this would
> >>>create the new host window, then the existing code for adding a RAM MR
> >>>would map that block of RAM statically into the new window.
> >>
> >>In what configuration/machine can we do that on SPAPR?
> >
> >spapr guests won't ever do that.  But you can run an x86 guest on a
> >powernv host and this situation could come up.
> 
> 
> I am pretty sure VFIO won't work in this case anyway.

I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.

> >In any case there's no point asserting if the code is correct anyway.
> 
> Assert here says (at least) "not tested" or "not expected to
> happen".

Hmmm..

> 
> 
> >
> >>>>+        trace_vfio_listener_region_add_iommu(iova, end - 1);
> >>>>+        /*
> >>>>+         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>>>+         * avoid bouncing all map/unmaps through qemu this way, this
> >>>>+         * would be the right place to wire that up (tell the KVM
> >>>>+         * device emulation the VFIO iommu handles to use).
> >>>>+         */
> >>>>+        create.window_size = memory_region_size(section->mr);
> >>>>+        create.page_shift =
> >>>>+                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
> >>>
> >>>Ah.. except that I guess you'd need to fall back to host page size
> >>>here to handle a RAM MR.
> >>
> >>Can you give an example of such RAM MR being added to PCI AS on
> >>SPAPR?
> >
> >On spapr, no.  But you can run other machine types as guests (at least
> >with TCG) on a host with the spapr IOMMU.
> >
> >>>>+        /*
> >>>>+         * SPAPR host supports multilevel TCE tables, there is some
> >>>>+         * euristic to decide how many levels we want for our table:
> >>>>+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >>>>+         */
> >>>>+        entries = create.window_size >> create.page_shift;
> >>>>+        pages = (entries * sizeof(uint64_t)) / getpagesize();
> >>>>+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> >>>>+
> >>>>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>>>+        if (ret) {
> >>>>+            error_report("Failed to create a window, ret = %d (%m)", ret);
> >>>>+            goto fail;
> >>>>+        }
> >>>>+
> >>>>+        if (create.start_addr != section->offset_within_address_space ||
> >>>>+            vfio_host_iommu_lookup(container, create.start_addr,
> >>>>+                                   create.start_addr + create.window_size - 1)) {
> >>>
> >>>Under what circumstances can this trigger?  Is the kernel ioctl
> >>>allowed to return a different window start address than the one
> >>>requested?
> >>
> >>You already asked this some time ago :) The userspace cannot request
> >>address, the host kernel returns one.
> >
> >Ok.  For generality it would be nice if you could succeed here as long
> >as the new host window covers the requested guest window, even if it
> >doesn't match exactly.  And for that matter to not request the new
> >window if the host already has a window covering the guest region.
> 
> 
> That would be dead code - when would it possibly work? I mean I could
> instrument an artificial test but the actual user which might appear later
> will likely be soooo different so this won't help anyway.

Hmm, I suppose.  It actually shouldn't be that hard to trigger a case
like this, if you just bumped the bridge's dma64 base address property
up a little bit - above the host kernel's base address, but small
enough that you can still easily fit the guest memory in.

> >>>The second check looks very strange - if it returns true doesn't that
> >>>mean you *do* have host window which can accomodate this guest region,
> >>>which is what you want?
> >>
> >>This should not happen, this is what this check is for. Can make it assert()
> >>or something like this.
> >
> >Oh.. I see.  Because you've done the ioctl, but not recorded the new
> >host window in the list yet.
> >
> >No, I think the correct approach is to look for an existing host
> >window containing the requested guest window *before* you try to
> >create a new host window.  If one is already there, you can just carry
> >on.
> 
> Right, I'll change this.
> 
> 
> >
> >>>>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>>>+                .argsz = sizeof(remove),
> >>>>+                .start_addr = create.start_addr
> >>>>+            };
> >>>>+            error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> >>>>+                         section->offset_within_address_space,
> >>>>+                         create.start_addr);
> >>>>+            ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>>>+            ret = -EINVAL;
> >>>>+            goto fail;
> >>>>+        }
> >>>>+        trace_vfio_spapr_create_window(create.page_shift,
> >>>>+                                       create.window_size,
> >>>>+                                       create.start_addr);
> >>>>+
> >>>>+        vfio_host_iommu_add(container, create.start_addr,
> >>>>+                            create.start_addr + create.window_size - 1,
> >>>>+                            1ULL << create.page_shift);
> >>>>+    }
> >>>>+
> >>>>      if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
> >>>>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
> >>>>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> >>>>@@ -525,6 +588,22 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>>>                       container, iova, end - iova, ret);
> >>>>      }
> >>>>
> >>>>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>>+        struct vfio_iommu_spapr_tce_remove remove = {
> >>>>+            .argsz = sizeof(remove),
> >>>>+            .start_addr = section->offset_within_address_space,
> >>>>+        };
> >>>>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>>>+        if (ret) {
> >>>>+            error_report("Failed to remove window at %"PRIx64,
> >>>>+                         remove.start_addr);
> >>>>+        }
> >>>>+
> >>>>+        vfio_host_iommu_del(container, section->offset_within_address_space);
> >>>>+
> >>>>+        trace_vfio_spapr_remove_window(remove.start_addr);
> >>>>+    }
> >>>>+
> >>>>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> >>>>          iommu->iommu_ops->vfio_stop(section->mr);
> >>>>      }
> >>>>@@ -928,11 +1007,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>>>              goto listener_release_exit;
> >>>>          }
> >>>>
> >>>>-        /* The default table uses 4K pages */
> >>>>-        vfio_host_iommu_add(container, info.dma32_window_start,
> >>>>-                            info.dma32_window_start +
> >>>>-                            info.dma32_window_size - 1,
> >>>>-                            0x1000);
> >>>>+        if (v2) {
> >>>>+            /*
> >>>>+             * There is a default window in just created container.
> >>>>+             * To make region_add/del simpler, we better remove this
> >>>>+             * window now and let those iommu_listener callbacks
> >>>>+             * create/remove them when needed.
> >>>>+             */
> >>>>+            struct vfio_iommu_spapr_tce_remove remove = {
> >>>>+                .argsz = sizeof(remove),
> >>>>+                .start_addr = info.dma32_window_start,
> >>>>+            };
> >>>>+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >>>>+            if (ret) {
> >>>>+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> >>>>+                ret = -errno;
> >>>>+                goto free_container_exit;
> >>>>+            }
> >>>>+        } else {
> >>>>+            /* The default table uses 4K pages */
> >>>>+            vfio_host_iommu_add(container, info.dma32_window_start,
> >>>>+                                info.dma32_window_start +
> >>>>+                                info.dma32_window_size - 1,
> >>>>+                                0x1000);
> >>>>+        }
> >>>>      } else {
> >>>>          error_report("vfio: No available IOMMU models");
> >>>>          ret = -EINVAL;
> >>>>diff --git a/trace-events b/trace-events
> >>>>index cc619e1..f2b75a3 100644
> >>>>--- a/trace-events
> >>>>+++ b/trace-events
> >>>>@@ -1736,6 +1736,8 @@ vfio_region_finalize(const char *name, int index) "Device %s, region %d"
> >>>>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> >>>>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>>>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>>>+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> >>>>+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
> >>>>
> >>>>  # hw/vfio/platform.c
> >>>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> >>>
> >>
> >>
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-23  2:53           ` David Gibson
@ 2016-03-23  3:06             ` Alexey Kardashevskiy
  2016-03-23  6:03               ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-23  3:06 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/23/2016 01:53 PM, David Gibson wrote:
> On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
>> On 03/23/2016 12:08 PM, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/22/2016 04:14 PM, David Gibson wrote:
>>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
>>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>>>>>> This adds ability to VFIO common code to dynamically allocate/remove
>>>>>> DMA windows in the host kernel when new VFIO container is added/removed.
>>>>>>
>>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>>>>>> and adds just created IOMMU into the host IOMMU list; the opposite
>>>>>> action is taken in vfio_listener_region_del.
>>>>>>
>>>>>> When creating a new window, this uses euristic to decide on the TCE table
>>>>>> levels number.
>>>>>>
>>>>>> This should cause no guest visible change in behavior.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>> Changes:
>>>>>> v14:
>>>>>> * new to the series
>>>>>>
>>>>>> ---
>>>>>> TODO:
>>>>>> * export levels to PHB
>>>>>> ---
>>>>>>   hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>>>   trace-events     |   2 ++
>>>>>>   2 files changed, 105 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index 4e873b7..421d6eb 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
>>>>>>       return 0;
>>>>>>   }
>>>>>>
>>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
>>>>>> +{
>>>>>> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
>>>>>
>>>>> The hard-coded 0x1000 looks dubious..
>>>>
>>>> Well, that's the minimal page size...
>>>
>>> Really?  Some BookE CPUs support 1KiB page size..
>>
>> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
>
> Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
> it's been done for CPU MMU I wouldn't count on it not being done for
> IOMMU.
>
> 1 is a safer choice.
>
>>
>>
>>>
>>>>>> +    g_assert(hiommu);
>>>>>> +    QLIST_REMOVE(hiommu, hiommu_next);
>>>>>> +}
>>>>>> +
>>>>>>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>>>   {
>>>>>>       return (!memory_region_is_ram(section->mr) &&
>>>>>> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>>>       }
>>>>>>       end = int128_get64(llend);
>>>>>>
>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>
>>>>> I think this would be clearer split out into a helper function,
>>>>> vfio_create_host_window() or something.
>>>>
>>>>
>>>> It is rather vfio_spapr_create_host_window() and we were avoiding
>>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
>>>> separate file but this usually triggers more discussion and never ends well.
>>>>
>>>>
>>>>
>>>>>> +        unsigned entries, pages;
>>>>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>>>>>> +
>>>>>> +        g_assert(section->mr->iommu_ops);
>>>>>> +        g_assert(memory_region_is_iommu(section->mr));
>>>>>
>>>>> I don't think you need these asserts.  AFAICT the same logic should
>>>>> work if a RAM MR was added directly to PCI address space - this would
>>>>> create the new host window, then the existing code for adding a RAM MR
>>>>> would map that block of RAM statically into the new window.
>>>>
>>>> In what configuration/machine can we do that on SPAPR?
>>>
>>> spapr guests won't ever do that.  But you can run an x86 guest on a
>>> powernv host and this situation could come up.
>>
>>
>> I am pretty sure VFIO won't work in this case anyway.
>
> I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.

This is not about TCG (pseries TCG guest works with VFIO on powernv host), 
this is about things like VFIO_IOMMU_GET_INFO vs. 
VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.

Should I add such support in this patchset?


>
>>> In any case there's no point asserting if the code is correct anyway.
>>
>> Assert here says (at least) "not tested" or "not expected to
>> happen".
>
> Hmmm..
>
>>
>>
>>>
>>>>>> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
>>>>>> +        /*
>>>>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>>>>>> +         * avoid bouncing all map/unmaps through qemu this way, this
>>>>>> +         * would be the right place to wire that up (tell the KVM
>>>>>> +         * device emulation the VFIO iommu handles to use).
>>>>>> +         */
>>>>>> +        create.window_size = memory_region_size(section->mr);
>>>>>> +        create.page_shift =
>>>>>> +                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
>>>>>
>>>>> Ah.. except that I guess you'd need to fall back to host page size
>>>>> here to handle a RAM MR.
>>>>
>>>> Can you give an example of such RAM MR being added to PCI AS on
>>>> SPAPR?
>>>
>>> On spapr, no.  But you can run other machine types as guests (at least
>>> with TCG) on a host with the spapr IOMMU.
>>>
>>>>>> +        /*
>>>>>> +         * SPAPR host supports multilevel TCE tables, there is some
>>>>>> +         * euristic to decide how many levels we want for our table:
>>>>>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>>>>>> +         */
>>>>>> +        entries = create.window_size >> create.page_shift;
>>>>>> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
>>>>>> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>>>>>> +
>>>>>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>> +        if (ret) {
>>>>>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>>>>>> +            goto fail;
>>>>>> +        }
>>>>>> +
>>>>>> +        if (create.start_addr != section->offset_within_address_space ||
>>>>>> +            vfio_host_iommu_lookup(container, create.start_addr,
>>>>>> +                                   create.start_addr + create.window_size - 1)) {
>>>>>
>>>>> Under what circumstances can this trigger?  Is the kernel ioctl
>>>>> allowed to return a different window start address than the one
>>>>> requested?
>>>>
>>>> You already asked this some time ago :) The userspace cannot request
>>>> address, the host kernel returns one.
>>>
>>> Ok.  For generality it would be nice if you could succeed here as long
>>> as the new host window covers the requested guest window, even if it
>>> doesn't match exactly.  And for that matter to not request the new
>>> window if the host already has a window covering the guest region.
>>
>>
>> That would be dead code - when would it possibly work? I mean I could
>> instrument an artificial test but the actual user which might appear later
>> will likely be soooo different so this won't help anyway.
>
> Hmm, I suppose.  It actually shouldn't be that hard to trigger a case
> like this, if you just bumped the bridge's dma64 base address property
> up a little bit - above the host kernel's base address, but small
> enough that you can still easily fit the guest memory in.


I can test it today for sure but once committed, we will have to support 
it. Which I am trying to avoid until we get clear picture what we are 
supporting here.


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-23  2:13   ` David Gibson
@ 2016-03-23  3:28     ` Alexey Kardashevskiy
  2016-03-23  6:11       ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-23  3:28 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/23/2016 01:13 PM, David Gibson wrote:
> On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices.
>> This reserves RTAS token numbers for DDW calls.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.5 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/ppc/Makefile.objs        |   1 +
>>   hw/ppc/spapr.c              |   7 +-
>>   hw/ppc/spapr_pci.c          |  73 ++++++++---
>>   hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/common.c            |   5 -
>>   include/hw/pci-host/spapr.h |  13 ++
>>   include/hw/ppc/spapr.h      |  16 ++-
>>   trace-events                |   4 +
>>   8 files changed, 395 insertions(+), 24 deletions(-)
>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c1ffc77..986b36f 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>   obj-y += spapr_pci_vfio.o
>>   endif
>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index d0bb423..ef4c637 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>    * pseries-2.5
>>    */
>>   #define SPAPR_COMPAT_2_5 \
>> -        HW_COMPAT_2_5
>> +        HW_COMPAT_2_5 \
>> +        {\
>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +            .property = "ddw",\
>> +            .value    = stringify(off),\
>> +        },
>>
>>   static void spapr_machine_2_5_instance_options(MachineState *machine)
>>   {
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index af99a36..3bb294a 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>       return buf;
>>   }
>>
>> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> -                                       uint32_t liobn,
>> -                                       uint32_t page_shift,
>> -                                       uint64_t window_addr,
>> -                                       uint64_t window_size,
>> -                                       Error **errp)
>> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> +                                 uint32_t liobn,
>> +                                 uint32_t page_shift,
>> +                                 uint64_t window_addr,
>> +                                 uint64_t window_size,
>> +                                 Error **errp)
>>   {
>>       sPAPRTCETable *tcet;
>>       uint32_t nb_table = window_size >> page_shift;
>> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>           return;
>>       }
>>
>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>> +        error_setg(errp,
>> +                   "Attempt to use second window when DDW is disabled on PHB");
>> +        return;
>> +    }
>
> This should never happen unless something is wrong with the tests in
> the RTAS functions, yes?  In which case it should probably be an
> assert().

This should not. But this is called from the RTAS caller so I'd really like 
to have a message rather than assert() if that condition happens, here or 
in rtas_ibm_create_pe_dma_window().


>
>>       spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
>>   }
>>
>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>   {
>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>
>> @@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       }
>>
>>       /* DMA setup */
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> -    if (!tcet) {
>> -        error_report("No default TCE table for %s", sphb->dtbusname);
>> -        return;
>> -    }
>>
>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> -                                        spapr_tce_get_iommu(tcet), 0);
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        tcet = spapr_tce_new_table(DEVICE(sphb),
>> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
>> +        if (!tcet) {
>> +            error_setg(errp, "Creating window#%d failed for %s",
>> +                       i, sphb->dtbusname);
>> +            return;
>> +        }
>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> +                                            spapr_tce_get_iommu(tcet), 0);
>> +    }
>>
>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>   }
>> @@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>
>>   void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>   {
>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>>       Error *local_err = NULL;
>> +    int i;
>>
>> -    if (tcet && tcet->enabled) {
>> -        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>> +
>> +        if (tcet && tcet->enabled) {
>> +            spapr_phb_dma_window_disable(sphb, liobn);
>> +        }
>>       }
>>
>>       /* Register default 32bit DMA window */
>> @@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
>>       /* Default DMA window is 0..1GB */
>>       DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>       DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>> +                       0x800000000000000ULL),
>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>> +    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
>> +                       SPAPR_PCI_DMA_MAX_WINDOWS),
>
> What will happen if the user tries to set 'windows' larger than
> SPAPR_PCI_DMA_MAX_WINDOWS?


Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported 
everywhere, missed that. Besides that, there will be support for more 
windows, that's it. The host VFIO IOMMU driver will fail creating more 
windows but this is expected. For emulated windows, there will be more 
windows with no other consequences.




>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>> +                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>
>> @@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       uint32_t interrupt_map_mask[] = {
>>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>> +    };
>> +    uint32_t ddw_extensions[] = {
>> +        cpu_to_be32(1),
>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>> +    };
>>       sPAPRTCETable *tcet;
>>       PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>       sPAPRFDT s_fdt;
>> @@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>
>> +    /* Dynamic DMA window */
>> +    if (phb->ddw_enabled) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>> +    }
>> +
>>       /* Build the interrupt-map, this must matches what is done
>>        * in pci_spapr_map_irq
>>        */
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..37f805f
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,300 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->enabled) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->enabled) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>> +                                 uint64_t page_mask)
>> +{
>> +    int i, j;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>> +            if ((sps[i].page_shift == masks[j].shift) &&
>> +                    (page_mask & (1ULL << masks[j].shift))) {
>> +                mask |= masks[j].mask;
>> +            }
>> +        }
>> +    }
>> +
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    CPUPPCState *env = &cpu->env;
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid, max_window_size;
>> +    uint32_t avail, addr, pgmask = 0;
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    /* Work out supported page masks */
>> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>
> There are a few potential problems here.  First you're just
> arbitrarily picking the first entry in the sps array to filter

Why first?  spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.


> against, which doesn't seem right (except by accident).  It's a little
> bit odd filtering against guest page sizes at all, although I get what
> you're really trying to do is filter against allowed host page sizes.
>
> The other problem is that silently filtering capabilities based on the
> host can be a pain for migration - I've made the mistake and had it
> bite me in the past.  I think it would be safer to just check the
> pagesizes requested in the property against what's possible and
> outright fail if they don't match.  For convenience you could also set
> according to host capabilities if the user doesn't specify anything,
> but that would require explicit handling of the "default" case.


For the migration purposes, both guests should be started with or without 
hugepages enabled; this is taken into account already. Besides that, the 
result of "query" won't differ.


> Remember that this code will be relevant for DDW with emulated
> devices, even if VFIO is not in play at all.
>
> All those considerations aside, it seems like it would make more sense
> to do this filtering during device realize, rather than leaving it
> until the guest queries.

The result will be the same, it only depends on whether hugepages are 
enabled or not and this happens at the start time. But yes, feels more 
accurate to do this in PHB realize(), I'll move it.


>
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>> +     */
>> +    max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT;
>
> Will maxram_size always be enough?  There will sometimes be an
> alignment gap between the "base" RAM and the hotpluggable RAM, meaning
> that if everything is plugged the last RAM address will be beyond
> maxram_size.  Will that require pushing this number up, or will the
> guest "repack" the RAM layout when it maps it into the TCE tables?


Hm. I do not know what the guest does to DDW on memory hotplug but this is 
a valid point... What QEMU helper does return the last available address in 
the system memory address space? Like memblock_end_of_DRAM() in the kernel, 
I would use that instead.


>> +    avail = sphb->windows_supported - spapr_phb_get_active_win_num(sphb);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +    rtas_st(rets, 2, max_window_size);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +    Error *local_err = NULL;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = spapr_phb_get_free_liobn(sphb);
>> +
>> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift)) ||
>> +        spapr_phb_get_active_win_num(sphb) == sphb->windows_supported) {
>> +        goto hw_error_exit;
>> +    }
>
> Bad page sizes should be H_PARAM, not H_HARDWARE, no?

Correct, I'll fix it.

>
> Also is no available liobns H_HARDWARE, or H_RESOURCE?

LoPAPR says this should return -1 (hardware), -2 (function), -3 
(privilege), not -16. So I pick -1 for this.


>> +
>> +    if (window_shift < page_shift) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_phb_dma_window_enable(sphb, liobn, page_shift,
>> +                                sphb->dma64_window_addr,
>> +                                1ULL << window_shift, &local_err);
>> +    if (local_err) {
>> +        error_report_err(local_err);
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>> +                                 1ULL << window_shift,
>> +                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
>> +    if (local_err || !tcet) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, liobn);
>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn;
>> +    long ret;
>> +
>> +    if ((nargs != 1) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    liobn = rtas_ld(args, 0);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>> +    if (!sphb || !sphb->ddw_enabled || !spapr_phb_get_active_win_num(sphb)) {
>
> Checking spapr_phb_get_active_win_num() seems weird.  You already know
> that this tcet is a child of the sphb.  Can't you just check its own
> enabled property.
>
>> +        goto param_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_window_disable(sphb, liobn);
>> +    trace_spapr_iommu_ddw_remove(liobn, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t addr;
>> +
>> +    if ((nargs != 3) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_phb_dma_reset(sphb);
>> +    trace_spapr_iommu_ddw_reset(buid, addr);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void spapr_rtas_ddw_init(void)
>> +{
>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +                        "ibm,query-pe-dma-window",
>> +                        rtas_ibm_query_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +                        "ibm,create-pe-dma-window",
>> +                        rtas_ibm_create_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>> +                        "ibm,remove-pe-dma-window",
>> +                        rtas_ibm_remove_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>> +                        "ibm,reset-pe-dma-window",
>> +                        rtas_ibm_reset_pe_dma_window);
>> +}
>> +
>> +type_init(spapr_rtas_ddw_init)
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 421d6eb..b0ea146 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -994,11 +994,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>               }
>>           }
>>
>> -        /*
>> -         * This only considers the host IOMMU's 32-bit window.  At
>> -         * some point we need to add support for the optional 64-bit
>> -         * window and dynamic windows
>> -         */
>
> This seems like a stray change to VFIO code in a patch that's
> otherwise only affecting spapr_pci.  Does this hunk belong in a
> different patch?


It does :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-23  3:06             ` Alexey Kardashevskiy
@ 2016-03-23  6:03               ` David Gibson
  2016-03-24  0:03                 ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-23  6:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 8340 bytes --]

On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 01:53 PM, David Gibson wrote:
> >On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/23/2016 12:08 PM, David Gibson wrote:
> >>>On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> >>>>On 03/22/2016 04:14 PM, David Gibson wrote:
> >>>>>On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>>>>>This adds ability to VFIO common code to dynamically allocate/remove
> >>>>>>DMA windows in the host kernel when new VFIO container is added/removed.
> >>>>>>
> >>>>>>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>>>>>and adds just created IOMMU into the host IOMMU list; the opposite
> >>>>>>action is taken in vfio_listener_region_del.
> >>>>>>
> >>>>>>When creating a new window, this uses euristic to decide on the TCE table
> >>>>>>levels number.
> >>>>>>
> >>>>>>This should cause no guest visible change in behavior.
> >>>>>>
> >>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>---
> >>>>>>Changes:
> >>>>>>v14:
> >>>>>>* new to the series
> >>>>>>
> >>>>>>---
> >>>>>>TODO:
> >>>>>>* export levels to PHB
> >>>>>>---
> >>>>>>  hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>>>  trace-events     |   2 ++
> >>>>>>  2 files changed, 105 insertions(+), 5 deletions(-)
> >>>>>>
> >>>>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>>index 4e873b7..421d6eb 100644
> >>>>>>--- a/hw/vfio/common.c
> >>>>>>+++ b/hw/vfio/common.c
> >>>>>>@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
> >>>>>>      return 0;
> >>>>>>  }
> >>>>>>
> >>>>>>+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
> >>>>>>+{
> >>>>>>+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
> >>>>>
> >>>>>The hard-coded 0x1000 looks dubious..
> >>>>
> >>>>Well, that's the minimal page size...
> >>>
> >>>Really?  Some BookE CPUs support 1KiB page size..
> >>
> >>Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
> >
> >Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
> >it's been done for CPU MMU I wouldn't count on it not being done for
> >IOMMU.
> >
> >1 is a safer choice.
> >
> >>
> >>
> >>>
> >>>>>>+    g_assert(hiommu);
> >>>>>>+    QLIST_REMOVE(hiommu, hiommu_next);
> >>>>>>+}
> >>>>>>+
> >>>>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>>>  {
> >>>>>>      return (!memory_region_is_ram(section->mr) &&
> >>>>>>@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>>>>>      }
> >>>>>>      end = int128_get64(llend);
> >>>>>>
> >>>>>>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>>>
> >>>>>I think this would be clearer split out into a helper function,
> >>>>>vfio_create_host_window() or something.
> >>>>
> >>>>
> >>>>It is rather vfio_spapr_create_host_window() and we were avoiding
> >>>>xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> >>>>separate file but this usually triggers more discussion and never ends well.
> >>>>
> >>>>
> >>>>
> >>>>>>+        unsigned entries, pages;
> >>>>>>+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >>>>>>+
> >>>>>>+        g_assert(section->mr->iommu_ops);
> >>>>>>+        g_assert(memory_region_is_iommu(section->mr));
> >>>>>
> >>>>>I don't think you need these asserts.  AFAICT the same logic should
> >>>>>work if a RAM MR was added directly to PCI address space - this would
> >>>>>create the new host window, then the existing code for adding a RAM MR
> >>>>>would map that block of RAM statically into the new window.
> >>>>
> >>>>In what configuration/machine can we do that on SPAPR?
> >>>
> >>>spapr guests won't ever do that.  But you can run an x86 guest on a
> >>>powernv host and this situation could come up.
> >>
> >>
> >>I am pretty sure VFIO won't work in this case anyway.
> >
> >I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.
> 
> This is not about TCG (pseries TCG guest works with VFIO on powernv host),
> this is about things like VFIO_IOMMU_GET_INFO vs.
> VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.
> 
> Should I add such support in this patchset?

Unless adding the generality is really complex, and so far I haven't
seen a reason for it to be.

> 
> 
> >
> >>>In any case there's no point asserting if the code is correct anyway.
> >>
> >>Assert here says (at least) "not tested" or "not expected to
> >>happen".
> >
> >Hmmm..
> >
> >>
> >>
> >>>
> >>>>>>+        trace_vfio_listener_region_add_iommu(iova, end - 1);
> >>>>>>+        /*
> >>>>>>+         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>>>>>+         * avoid bouncing all map/unmaps through qemu this way, this
> >>>>>>+         * would be the right place to wire that up (tell the KVM
> >>>>>>+         * device emulation the VFIO iommu handles to use).
> >>>>>>+         */
> >>>>>>+        create.window_size = memory_region_size(section->mr);
> >>>>>>+        create.page_shift =
> >>>>>>+                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
> >>>>>
> >>>>>Ah.. except that I guess you'd need to fall back to host page size
> >>>>>here to handle a RAM MR.
> >>>>
> >>>>Can you give an example of such RAM MR being added to PCI AS on
> >>>>SPAPR?
> >>>
> >>>On spapr, no.  But you can run other machine types as guests (at least
> >>>with TCG) on a host with the spapr IOMMU.
> >>>
> >>>>>>+        /*
> >>>>>>+         * SPAPR host supports multilevel TCE tables, there is some
> >>>>>>+         * euristic to decide how many levels we want for our table:
> >>>>>>+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >>>>>>+         */
> >>>>>>+        entries = create.window_size >> create.page_shift;
> >>>>>>+        pages = (entries * sizeof(uint64_t)) / getpagesize();
> >>>>>>+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> >>>>>>+
> >>>>>>+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>>>>>+        if (ret) {
> >>>>>>+            error_report("Failed to create a window, ret = %d (%m)", ret);
> >>>>>>+            goto fail;
> >>>>>>+        }
> >>>>>>+
> >>>>>>+        if (create.start_addr != section->offset_within_address_space ||
> >>>>>>+            vfio_host_iommu_lookup(container, create.start_addr,
> >>>>>>+                                   create.start_addr + create.window_size - 1)) {
> >>>>>
> >>>>>Under what circumstances can this trigger?  Is the kernel ioctl
> >>>>>allowed to return a different window start address than the one
> >>>>>requested?
> >>>>
> >>>>You already asked this some time ago :) The userspace cannot request
> >>>>address, the host kernel returns one.
> >>>
> >>>Ok.  For generality it would be nice if you could succeed here as long
> >>>as the new host window covers the requested guest window, even if it
> >>>doesn't match exactly.  And for that matter to not request the new
> >>>window if the host already has a window covering the guest region.
> >>
> >>
> >>That would be dead code - when would it possibly work? I mean I could
> >>instrument an artificial test but the actual user which might appear later
> >>will likely be soooo different so this won't help anyway.
> >
> >Hmm, I suppose.  It actually shouldn't be that hard to trigger a case
> >like this, if you just bumped the bridge's dma64 base address property
> >up a little bit - above the host kernel's base address, but small
> >enough that you can still easily fit the guest memory in.
> 
> 
> I can test it today for sure but once committed, we will have to support it.
> Which I am trying to avoid until we get clear picture what we are supporting
> here.
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-23  3:28     ` Alexey Kardashevskiy
@ 2016-03-23  6:11       ` David Gibson
  2016-03-24  2:32         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-23  6:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 26478 bytes --]

On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 01:13 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
> >>This adds support for Dynamic DMA Windows (DDW) option defined by
> >>the SPAPR specification which allows to have additional DMA window(s)
> >>
> >>This implements DDW for emulated and VFIO devices.
> >>This reserves RTAS token numbers for DDW calls.
> >>
> >>This changes the TCE table migration descriptor to support dynamic
> >>tables as from now on, PHB will create as many stub TCE table objects
> >>as PHB can possibly support but not all of them might be initialized at
> >>the time of migration because DDW might or might not be requested by
> >>the guest.
> >>
> >>The "ddw" property is enabled by default on a PHB but for compatibility
> >>the pseries-2.5 machine and older disable it.
> >>
> >>This implements DDW for VFIO. The host kernel support is required.
> >>This adds a "levels" property to PHB to control the number of levels
> >>in the actual TCE table allocated by the host kernel, 0 is the default
> >>value to tell QEMU to calculate the correct value. Current hardware
> >>supports up to 5 levels.
> >>
> >>The existing linux guests try creating one additional huge DMA window
> >>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >>property which is a bus address for the 64bit window and by default
> >>set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >>uses and this allows having emulated and VFIO devices on the same bus.
> >>
> >>This adds 4 RTAS handlers:
> >>* ibm,query-pe-dma-window
> >>* ibm,create-pe-dma-window
> >>* ibm,remove-pe-dma-window
> >>* ibm,reset-pe-dma-window
> >>These are registered from type_init() callback.
> >>
> >>These RTAS handlers are implemented in a separate file to avoid polluting
> >>spapr_iommu.c with PCI.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  hw/ppc/Makefile.objs        |   1 +
> >>  hw/ppc/spapr.c              |   7 +-
> >>  hw/ppc/spapr_pci.c          |  73 ++++++++---
> >>  hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/common.c            |   5 -
> >>  include/hw/pci-host/spapr.h |  13 ++
> >>  include/hw/ppc/spapr.h      |  16 ++-
> >>  trace-events                |   4 +
> >>  8 files changed, 395 insertions(+), 24 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>
> >>diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>index c1ffc77..986b36f 100644
> >>--- a/hw/ppc/Makefile.objs
> >>+++ b/hw/ppc/Makefile.objs
> >>@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>  obj-y += spapr_pci_vfio.o
> >>  endif
> >>+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >>diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>index d0bb423..ef4c637 100644
> >>--- a/hw/ppc/spapr.c
> >>+++ b/hw/ppc/spapr.c
> >>@@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>   * pseries-2.5
> >>   */
> >>  #define SPAPR_COMPAT_2_5 \
> >>-        HW_COMPAT_2_5
> >>+        HW_COMPAT_2_5 \
> >>+        {\
> >>+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >>+            .property = "ddw",\
> >>+            .value    = stringify(off),\
> >>+        },
> >>
> >>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >>  {
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index af99a36..3bb294a 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
> >>      return buf;
> >>  }
> >>
> >>-static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>-                                       uint32_t liobn,
> >>-                                       uint32_t page_shift,
> >>-                                       uint64_t window_addr,
> >>-                                       uint64_t window_size,
> >>-                                       Error **errp)
> >>+void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>+                                 uint32_t liobn,
> >>+                                 uint32_t page_shift,
> >>+                                 uint64_t window_addr,
> >>+                                 uint64_t window_size,
> >>+                                 Error **errp)
> >>  {
> >>      sPAPRTCETable *tcet;
> >>      uint32_t nb_table = window_size >> page_shift;
> >>@@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>          return;
> >>      }
> >>
> >>+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> >>+        error_setg(errp,
> >>+                   "Attempt to use second window when DDW is disabled on PHB");
> >>+        return;
> >>+    }
> >
> >This should never happen unless something is wrong with the tests in
> >the RTAS functions, yes?  In which case it should probably be an
> >assert().
> 
> This should not. But this is called from the RTAS caller so I'd really like
> to have a message rather than assert() if that condition happens, here or in
> rtas_ibm_create_pe_dma_window().

It should only be called from RTAS if ddw is enabled though, yes?

> 
> 
> >
> >>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
> >>  }
> >>
> >>-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>  {
> >>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>
> >>@@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>      }
> >>
> >>      /* DMA setup */
> >>-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> >>-    if (!tcet) {
> >>-        error_report("No default TCE table for %s", sphb->dtbusname);
> >>-        return;
> >>-    }
> >>
> >>-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>-                                        spapr_tce_get_iommu(tcet), 0);
> >>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>+        tcet = spapr_tce_new_table(DEVICE(sphb),
> >>+                                   SPAPR_PCI_LIOBN(sphb->index, i));
> >>+        if (!tcet) {
> >>+            error_setg(errp, "Creating window#%d failed for %s",
> >>+                       i, sphb->dtbusname);
> >>+            return;
> >>+        }
> >>+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>+                                            spapr_tce_get_iommu(tcet), 0);
> >>+    }
> >>
> >>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >>  }
> >>@@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
> >>
> >>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >>  {
> >>-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> >>      Error *local_err = NULL;
> >>+    int i;
> >>
> >>-    if (tcet && tcet->enabled) {
> >>-        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> >>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>+        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
> >>+        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>+
> >>+        if (tcet && tcet->enabled) {
> >>+            spapr_phb_dma_window_disable(sphb, liobn);
> >>+        }
> >>      }
> >>
> >>      /* Register default 32bit DMA window */
> >>@@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
> >>      /* Default DMA window is 0..1GB */
> >>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >>+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> >>+                       0x800000000000000ULL),
> >>+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >>+    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
> >>+                       SPAPR_PCI_DMA_MAX_WINDOWS),
> >
> >What will happen if the user tries to set 'windows' larger than
> >SPAPR_PCI_DMA_MAX_WINDOWS?
> 
> 
> Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported
> everywhere, missed that. Besides that, there will be support for more
> windows, that's it. The host VFIO IOMMU driver will fail creating more
> windows but this is expected. For emulated windows, there will be more
> windows with no other consequences.

Hmm.. is there actually a reason to have the windows property?  Would
you be better off just using the compile time constant for now.

> 
> 
> 
> 
> >>+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >>+                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>
> >>@@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      uint32_t interrupt_map_mask[] = {
> >>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >>+    uint32_t ddw_applicable[] = {
> >>+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >>+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >>+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >>+    };
> >>+    uint32_t ddw_extensions[] = {
> >>+        cpu_to_be32(1),
> >>+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >>+    };
> >>      sPAPRTCETable *tcet;
> >>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>      sPAPRFDT s_fdt;
> >>@@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>
> >>+    /* Dynamic DMA window */
> >>+    if (phb->ddw_enabled) {
> >>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >>+                         sizeof(ddw_applicable)));
> >>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >>+                         &ddw_extensions, sizeof(ddw_extensions)));
> >>+    }
> >>+
> >>      /* Build the interrupt-map, this must matches what is done
> >>       * in pci_spapr_map_irq
> >>       */
> >>diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >>new file mode 100644
> >>index 0000000..37f805f
> >>--- /dev/null
> >>+++ b/hw/ppc/spapr_rtas_ddw.c
> >>@@ -0,0 +1,300 @@
> >>+/*
> >>+ * QEMU sPAPR Dynamic DMA windows support
> >>+ *
> >>+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >>+ *
> >>+ *  This program is free software; you can redistribute it and/or modify
> >>+ *  it under the terms of the GNU General Public License as published by
> >>+ *  the Free Software Foundation; either version 2 of the License,
> >>+ *  or (at your option) any later version.
> >>+ *
> >>+ *  This program is distributed in the hope that it will be useful,
> >>+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>+ *  GNU General Public License for more details.
> >>+ *
> >>+ *  You should have received a copy of the GNU General Public License
> >>+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >>+ */
> >>+
> >>+#include "qemu/osdep.h"
> >>+#include "qemu/error-report.h"
> >>+#include "hw/ppc/spapr.h"
> >>+#include "hw/pci-host/spapr.h"
> >>+#include "trace.h"
> >>+
> >>+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+
> >>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>+    if (tcet && tcet->enabled) {
> >>+        ++*(unsigned *)opaque;
> >>+    }
> >>+    return 0;
> >>+}
> >>+
> >>+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >>+{
> >>+    unsigned ret = 0;
> >>+
> >>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >>+
> >>+    return ret;
> >>+}
> >>+
> >>+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+
> >>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>+    if (tcet && !tcet->enabled) {
> >>+        *(uint32_t *)opaque = tcet->liobn;
> >>+        return 1;
> >>+    }
> >>+    return 0;
> >>+}
> >>+
> >>+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >>+{
> >>+    uint32_t liobn = 0;
> >>+
> >>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >>+
> >>+    return liobn;
> >>+}
> >>+
> >>+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> >>+                                 uint64_t page_mask)
> >>+{
> >>+    int i, j;
> >>+    uint32_t mask = 0;
> >>+    const struct { int shift; uint32_t mask; } masks[] = {
> >>+        { 12, RTAS_DDW_PGSIZE_4K },
> >>+        { 16, RTAS_DDW_PGSIZE_64K },
> >>+        { 24, RTAS_DDW_PGSIZE_16M },
> >>+        { 25, RTAS_DDW_PGSIZE_32M },
> >>+        { 26, RTAS_DDW_PGSIZE_64M },
> >>+        { 27, RTAS_DDW_PGSIZE_128M },
> >>+        { 28, RTAS_DDW_PGSIZE_256M },
> >>+        { 34, RTAS_DDW_PGSIZE_16G },
> >>+    };
> >>+
> >>+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> >>+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> >>+            if ((sps[i].page_shift == masks[j].shift) &&
> >>+                    (page_mask & (1ULL << masks[j].shift))) {
> >>+                mask |= masks[j].mask;
> >>+            }
> >>+        }
> >>+    }
> >>+
> >>+    return mask;
> >>+}
> >>+
> >>+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >>+                                         sPAPRMachineState *spapr,
> >>+                                         uint32_t token, uint32_t nargs,
> >>+                                         target_ulong args,
> >>+                                         uint32_t nret, target_ulong rets)
> >>+{
> >>+    CPUPPCState *env = &cpu->env;
> >>+    sPAPRPHBState *sphb;
> >>+    uint64_t buid, max_window_size;
> >>+    uint32_t avail, addr, pgmask = 0;
> >>+
> >>+    if ((nargs != 3) || (nret != 5)) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>+    addr = rtas_ld(args, 0);
> >>+    sphb = spapr_pci_find_phb(spapr, buid);
> >>+    if (!sphb || !sphb->ddw_enabled) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    /* Work out supported page masks */
> >>+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> >
> >There are a few potential problems here.  First you're just
> >arbitrarily picking the first entry in the sps array to filter
> 
> Why first?  spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.

env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page
sizes, then again for actual page sizes.  You're only examing the
first "row" of that table.  It kinda works because the 4k base page
size is first, which lists all the actual page size options.

> >against, which doesn't seem right (except by accident).  It's a little
> >bit odd filtering against guest page sizes at all, although I get what
> >you're really trying to do is filter against allowed host page sizes.
> >
> >The other problem is that silently filtering capabilities based on the
> >host can be a pain for migration - I've made the mistake and had it
> >bite me in the past.  I think it would be safer to just check the
> >pagesizes requested in the property against what's possible and
> >outright fail if they don't match.  For convenience you could also set
> >according to host capabilities if the user doesn't specify anything,
> >but that would require explicit handling of the "default" case.
> 
> 
> For the migration purposes, both guests should be started with or without
> hugepages enabled; this is taken into account already. Besides that, the
> result of "query" won't differ.

Hmm.. if you're migrating between TCG and KVM or between PR and HV
these could change as well.  I'm not sure that works at the moment,
but I'd prefer not to introduce any more barriers to it than we have
to.

> >Remember that this code will be relevant for DDW with emulated
> >devices, even if VFIO is not in play at all.
> >
> >All those considerations aside, it seems like it would make more sense
> >to do this filtering during device realize, rather than leaving it
> >until the guest queries.
> 
> The result will be the same, it only depends on whether hugepages are
> enabled or not and this happens at the start time. But yes, feels more
> accurate to do this in PHB realize(), I'll move it.
> 
> 
> >
> >>+    /*
> >>+     * This is "Largest contiguous block of TCEs allocated specifically
> >>+     * for (that is, are reserved for) this PE".
> >>+     * Return the maximum number as maximum supported RAM size was in 4K pages.
> >>+     */
> >>+    max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT;
> >
> >Will maxram_size always be enough?  There will sometimes be an
> >alignment gap between the "base" RAM and the hotpluggable RAM, meaning
> >that if everything is plugged the last RAM address will be beyond
> >maxram_size.  Will that require pushing this number up, or will the
> >guest "repack" the RAM layout when it maps it into the TCE tables?
> 
> 
> Hm. I do not know what the guest does to DDW on memory hotplug but this is a
> valid point... What QEMU helper does return the last available address in
> the system memory address space? Like memblock_end_of_DRAM() in the kernel,
> I would use that instead.

There is a last_ram_offset() but that's in the ram_addr_t address
space, which isn't necessarily the same as the physical address space
(though it's usually similar).  You can have a look at what we check
in (the TCG/PR version of) H_ENTER which needs to check this as well.

> 
> 
> >>+    avail = sphb->windows_supported - spapr_phb_get_active_win_num(sphb);
> >>+
> >>+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>+    rtas_st(rets, 1, avail);
> >>+    rtas_st(rets, 2, max_window_size);
> >>+    rtas_st(rets, 3, pgmask);
> >>+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> >>+
> >>+    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> >>+    return;
> >>+
> >>+param_error_exit:
> >>+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >>+}
> >>+
> >>+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >>+                                          sPAPRMachineState *spapr,
> >>+                                          uint32_t token, uint32_t nargs,
> >>+                                          target_ulong args,
> >>+                                          uint32_t nret, target_ulong rets)
> >>+{
> >>+    sPAPRPHBState *sphb;
> >>+    sPAPRTCETable *tcet = NULL;
> >>+    uint32_t addr, page_shift, window_shift, liobn;
> >>+    uint64_t buid;
> >>+    Error *local_err = NULL;
> >>+
> >>+    if ((nargs != 5) || (nret != 4)) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>+    addr = rtas_ld(args, 0);
> >>+    sphb = spapr_pci_find_phb(spapr, buid);
> >>+    if (!sphb || !sphb->ddw_enabled) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    page_shift = rtas_ld(args, 3);
> >>+    window_shift = rtas_ld(args, 4);
> >>+    liobn = spapr_phb_get_free_liobn(sphb);
> >>+
> >>+    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift)) ||
> >>+        spapr_phb_get_active_win_num(sphb) == sphb->windows_supported) {
> >>+        goto hw_error_exit;
> >>+    }
> >
> >Bad page sizes should be H_PARAM, not H_HARDWARE, no?
> 
> Correct, I'll fix it.
> 
> >
> >Also is no available liobns H_HARDWARE, or H_RESOURCE?
> 
> LoPAPR says this should return -1 (hardware), -2 (function), -3 (privilege),
> not -16. So I pick -1 for this.

Ok.

> 
> 
> >>+
> >>+    if (window_shift < page_shift) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    spapr_phb_dma_window_enable(sphb, liobn, page_shift,
> >>+                                sphb->dma64_window_addr,
> >>+                                1ULL << window_shift, &local_err);
> >>+    if (local_err) {
> >>+        error_report_err(local_err);
> >>+        goto hw_error_exit;
> >>+    }
> >>+
> >>+    tcet = spapr_tce_find_by_liobn(liobn);
> >>+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> >>+                                 1ULL << window_shift,
> >>+                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
> >>+    if (local_err || !tcet) {
> >>+        goto hw_error_exit;
> >>+    }
> >>+
> >>+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>+    rtas_st(rets, 1, liobn);
> >>+    rtas_st(rets, 2, tcet->bus_offset >> 32);
> >>+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> >>+
> >>+    return;
> >>+
> >>+hw_error_exit:
> >>+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> >>+    return;
> >>+
> >>+param_error_exit:
> >>+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >>+}
> >>+
> >>+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> >>+                                          sPAPRMachineState *spapr,
> >>+                                          uint32_t token, uint32_t nargs,
> >>+                                          target_ulong args,
> >>+                                          uint32_t nret, target_ulong rets)
> >>+{
> >>+    sPAPRPHBState *sphb;
> >>+    sPAPRTCETable *tcet;
> >>+    uint32_t liobn;
> >>+    long ret;
> >>+
> >>+    if ((nargs != 1) || (nret != 1)) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    liobn = rtas_ld(args, 0);
> >>+    tcet = spapr_tce_find_by_liobn(liobn);
> >>+    if (!tcet) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> >>+    if (!sphb || !sphb->ddw_enabled || !spapr_phb_get_active_win_num(sphb)) {
> >
> >Checking spapr_phb_get_active_win_num() seems weird.  You already know
> >that this tcet is a child of the sphb.  Can't you just check its own
> >enabled property.
> >
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    ret = spapr_phb_dma_window_disable(sphb, liobn);
> >>+    trace_spapr_iommu_ddw_remove(liobn, ret);
> >>+    if (ret) {
> >>+        goto hw_error_exit;
> >>+    }
> >>+
> >>+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>+    return;
> >>+
> >>+hw_error_exit:
> >>+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> >>+    return;
> >>+
> >>+param_error_exit:
> >>+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >>+}
> >>+
> >>+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> >>+                                         sPAPRMachineState *spapr,
> >>+                                         uint32_t token, uint32_t nargs,
> >>+                                         target_ulong args,
> >>+                                         uint32_t nret, target_ulong rets)
> >>+{
> >>+    sPAPRPHBState *sphb;
> >>+    uint64_t buid;
> >>+    uint32_t addr;
> >>+
> >>+    if ((nargs != 3) || (nret != 1)) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>+    addr = rtas_ld(args, 0);
> >>+    sphb = spapr_pci_find_phb(spapr, buid);
> >>+    if (!sphb || !sphb->ddw_enabled) {
> >>+        goto param_error_exit;
> >>+    }
> >>+
> >>+    spapr_phb_dma_reset(sphb);
> >>+    trace_spapr_iommu_ddw_reset(buid, addr);
> >>+
> >>+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>+
> >>+    return;
> >>+
> >>+param_error_exit:
> >>+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >>+}
> >>+
> >>+static void spapr_rtas_ddw_init(void)
> >>+{
> >>+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> >>+                        "ibm,query-pe-dma-window",
> >>+                        rtas_ibm_query_pe_dma_window);
> >>+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> >>+                        "ibm,create-pe-dma-window",
> >>+                        rtas_ibm_create_pe_dma_window);
> >>+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> >>+                        "ibm,remove-pe-dma-window",
> >>+                        rtas_ibm_remove_pe_dma_window);
> >>+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> >>+                        "ibm,reset-pe-dma-window",
> >>+                        rtas_ibm_reset_pe_dma_window);
> >>+}
> >>+
> >>+type_init(spapr_rtas_ddw_init)
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 421d6eb..b0ea146 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -994,11 +994,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              }
> >>          }
> >>
> >>-        /*
> >>-         * This only considers the host IOMMU's 32-bit window.  At
> >>-         * some point we need to add support for the optional 64-bit
> >>-         * window and dynamic windows
> >>-         */
> >
> >This seems like a stray change to VFIO code in a patch that's
> >otherwise only affecting spapr_pci.  Does this hunk belong in a
> >different patch?
> 
> 
> It does :)
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address
  2016-03-22  3:26       ` David Gibson
  2016-03-22  4:28         ` Alexey Kardashevskiy
@ 2016-03-23 10:58         ` Paolo Bonzini
  1 sibling, 0 replies; 64+ messages in thread
From: Paolo Bonzini @ 2016-03-23 10:58 UTC (permalink / raw)
  To: David Gibson, Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel



On 22/03/2016 04:26, David Gibson wrote:
>> > >...it might be simpler to replace both the iommu and
>> > >offset_within_address_space fields here with a pointer to the
>> > >MemoryRegionSection instead, which should give you all the info you
>> > >need.
>> 
>> 
>> MemoryRegionSection is allocated on stack in listener_add_address_space()
>> and seems to be in general some sort of temporary object.

If you need the information in a MemoryRegionSection, by all means use
it.  For example users of hw/display/framebuffer.c store a
MemoryRegionSection (in that case, they get it from memory_region_find,
but it doesn't have to be that way).

Otherwise I agree with what David has said.

Paolo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-23  6:03               ` David Gibson
@ 2016-03-24  0:03                 ` Alexey Kardashevskiy
  2016-03-24  9:10                   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-24  0:03 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/23/2016 05:03 PM, David Gibson wrote:
> On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:
>> On 03/23/2016 01:53 PM, David Gibson wrote:
>>> On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/23/2016 12:08 PM, David Gibson wrote:
>>>>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/22/2016 04:14 PM, David Gibson wrote:
>>>>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>>>>>>>> This adds ability to VFIO common code to dynamically allocate/remove
>>>>>>>> DMA windows in the host kernel when new VFIO container is added/removed.
>>>>>>>>
>>>>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>>>>>>>> and adds just created IOMMU into the host IOMMU list; the opposite
>>>>>>>> action is taken in vfio_listener_region_del.
>>>>>>>>
>>>>>>>> When creating a new window, this uses euristic to decide on the TCE table
>>>>>>>> levels number.
>>>>>>>>
>>>>>>>> This should cause no guest visible change in behavior.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>> Changes:
>>>>>>>> v14:
>>>>>>>> * new to the series
>>>>>>>>
>>>>>>>> ---
>>>>>>>> TODO:
>>>>>>>> * export levels to PHB
>>>>>>>> ---
>>>>>>>>   hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>>>>>   trace-events     |   2 ++
>>>>>>>>   2 files changed, 105 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>>> index 4e873b7..421d6eb 100644
>>>>>>>> --- a/hw/vfio/common.c
>>>>>>>> +++ b/hw/vfio/common.c
>>>>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
>>>>>>>>       return 0;
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
>>>>>>>> +{
>>>>>>>> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
>>>>>>>
>>>>>>> The hard-coded 0x1000 looks dubious..
>>>>>>
>>>>>> Well, that's the minimal page size...
>>>>>
>>>>> Really?  Some BookE CPUs support 1KiB page size..
>>>>
>>>> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
>>>
>>> Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
>>> it's been done for CPU MMU I wouldn't count on it not being done for
>>> IOMMU.
>>>
>>> 1 is a safer choice.
>>>
>>>>
>>>>
>>>>>
>>>>>>>> +    g_assert(hiommu);
>>>>>>>> +    QLIST_REMOVE(hiommu, hiommu_next);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>>>>>   {
>>>>>>>>       return (!memory_region_is_ram(section->mr) &&
>>>>>>>> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>>>>>       }
>>>>>>>>       end = int128_get64(llend);
>>>>>>>>
>>>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>>>
>>>>>>> I think this would be clearer split out into a helper function,
>>>>>>> vfio_create_host_window() or something.
>>>>>>
>>>>>>
>>>>>> It is rather vfio_spapr_create_host_window() and we were avoiding
>>>>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
>>>>>> separate file but this usually triggers more discussion and never ends well.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> +        unsigned entries, pages;
>>>>>>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>>>>>>>> +
>>>>>>>> +        g_assert(section->mr->iommu_ops);
>>>>>>>> +        g_assert(memory_region_is_iommu(section->mr));
>>>>>>>
>>>>>>> I don't think you need these asserts.  AFAICT the same logic should
>>>>>>> work if a RAM MR was added directly to PCI address space - this would
>>>>>>> create the new host window, then the existing code for adding a RAM MR
>>>>>>> would map that block of RAM statically into the new window.
>>>>>>
>>>>>> In what configuration/machine can we do that on SPAPR?
>>>>>
>>>>> spapr guests won't ever do that.  But you can run an x86 guest on a
>>>>> powernv host and this situation could come up.
>>>>
>>>>
>>>> I am pretty sure VFIO won't work in this case anyway.
>>>
>>> I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.
>>
>> This is not about TCG (pseries TCG guest works with VFIO on powernv host),
>> this is about things like VFIO_IOMMU_GET_INFO vs.
>> VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.
>>
>> Should I add such support in this patchset?
>
> Unless adding the generality is really complex, and so far I haven't
> seen a reason for it to be.

Seriously? :(



-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-23  6:11       ` David Gibson
@ 2016-03-24  2:32         ` Alexey Kardashevskiy
  2016-03-29  5:22           ` David Gibson
  2016-03-31  3:19           ` David Gibson
  0 siblings, 2 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-24  2:32 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/23/2016 05:11 PM, David Gibson wrote:
> On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
>> On 03/23/2016 01:13 PM, David Gibson wrote:
>>> On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>
>>>> This implements DDW for emulated and VFIO devices.
>>>> This reserves RTAS token numbers for DDW calls.
>>>>
>>>> This changes the TCE table migration descriptor to support dynamic
>>>> tables as from now on, PHB will create as many stub TCE table objects
>>>> as PHB can possibly support but not all of them might be initialized at
>>>> the time of migration because DDW might or might not be requested by
>>>> the guest.
>>>>
>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>> the pseries-2.5 machine and older disable it.
>>>>
>>>> This implements DDW for VFIO. The host kernel support is required.
>>>> This adds a "levels" property to PHB to control the number of levels
>>>> in the actual TCE table allocated by the host kernel, 0 is the default
>>>> value to tell QEMU to calculate the correct value. Current hardware
>>>> supports up to 5 levels.
>>>>
>>>> The existing linux guests try creating one additional huge DMA window
>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>> property which is a bus address for the 64bit window and by default
>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>
>>>> This adds 4 RTAS handlers:
>>>> * ibm,query-pe-dma-window
>>>> * ibm,create-pe-dma-window
>>>> * ibm,remove-pe-dma-window
>>>> * ibm,reset-pe-dma-window
>>>> These are registered from type_init() callback.
>>>>
>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>> spapr_iommu.c with PCI.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>   hw/ppc/Makefile.objs        |   1 +
>>>>   hw/ppc/spapr.c              |   7 +-
>>>>   hw/ppc/spapr_pci.c          |  73 ++++++++---
>>>>   hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/common.c            |   5 -
>>>>   include/hw/pci-host/spapr.h |  13 ++
>>>>   include/hw/ppc/spapr.h      |  16 ++-
>>>>   trace-events                |   4 +
>>>>   8 files changed, 395 insertions(+), 24 deletions(-)
>>>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index c1ffc77..986b36f 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>   obj-y += spapr_pci_vfio.o
>>>>   endif
>>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>   # PowerPC 4xx boards
>>>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>   obj-y += ppc4xx_pci.o
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index d0bb423..ef4c637 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>>>    * pseries-2.5
>>>>    */
>>>>   #define SPAPR_COMPAT_2_5 \
>>>> -        HW_COMPAT_2_5
>>>> +        HW_COMPAT_2_5 \
>>>> +        {\
>>>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>>> +            .property = "ddw",\
>>>> +            .value    = stringify(off),\
>>>> +        },
>>>>
>>>>   static void spapr_machine_2_5_instance_options(MachineState *machine)
>>>>   {
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index af99a36..3bb294a 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>>>       return buf;
>>>>   }
>>>>
>>>> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>> -                                       uint32_t liobn,
>>>> -                                       uint32_t page_shift,
>>>> -                                       uint64_t window_addr,
>>>> -                                       uint64_t window_size,
>>>> -                                       Error **errp)
>>>> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>> +                                 uint32_t liobn,
>>>> +                                 uint32_t page_shift,
>>>> +                                 uint64_t window_addr,
>>>> +                                 uint64_t window_size,
>>>> +                                 Error **errp)
>>>>   {
>>>>       sPAPRTCETable *tcet;
>>>>       uint32_t nb_table = window_size >> page_shift;
>>>> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>           return;
>>>>       }
>>>>
>>>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>>>> +        error_setg(errp,
>>>> +                   "Attempt to use second window when DDW is disabled on PHB");
>>>> +        return;
>>>> +    }
>>>
>>> This should never happen unless something is wrong with the tests in
>>> the RTAS functions, yes?  In which case it should probably be an
>>> assert().
>>
>> This should not. But this is called from the RTAS caller so I'd really like
>> to have a message rather than assert() if that condition happens, here or in
>> rtas_ibm_create_pe_dma_window().
>
> It should only be called from RTAS if ddw is enabled though, yes?


 From RTAS and from the PHB reset handler. Well. I will get rid of 
spapr_phb_dma_window_enable/spapr_phb_dma_window_disable, they are quite 
useless when I look at them now.


>>
>>>
>>>>       spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
>>>>   }
>>>>
>>>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>>   {
>>>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>>
>>>> @@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>       }
>>>>
>>>>       /* DMA setup */
>>>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>>>> -    if (!tcet) {
>>>> -        error_report("No default TCE table for %s", sphb->dtbusname);
>>>> -        return;
>>>> -    }
>>>>
>>>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>> -                                        spapr_tce_get_iommu(tcet), 0);
>>>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>> +        tcet = spapr_tce_new_table(DEVICE(sphb),
>>>> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
>>>> +        if (!tcet) {
>>>> +            error_setg(errp, "Creating window#%d failed for %s",
>>>> +                       i, sphb->dtbusname);
>>>> +            return;
>>>> +        }
>>>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>> +                                            spapr_tce_get_iommu(tcet), 0);
>>>> +    }
>>>>
>>>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>>>   }
>>>> @@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>>>
>>>>   void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>>>   {
>>>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>>>>       Error *local_err = NULL;
>>>> +    int i;
>>>>
>>>> -    if (tcet && tcet->enabled) {
>>>> -        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>>>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>> +        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
>>>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>> +
>>>> +        if (tcet && tcet->enabled) {
>>>> +            spapr_phb_dma_window_disable(sphb, liobn);
>>>> +        }
>>>>       }
>>>>
>>>>       /* Register default 32bit DMA window */
>>>> @@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
>>>>       /* Default DMA window is 0..1GB */
>>>>       DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>       DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>>>> +                       0x800000000000000ULL),
>>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>> +    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
>>>> +                       SPAPR_PCI_DMA_MAX_WINDOWS),
>>>
>>> What will happen if the user tries to set 'windows' larger than
>>> SPAPR_PCI_DMA_MAX_WINDOWS?
>>
>>
>> Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported
>> everywhere, missed that. Besides that, there will be support for more
>> windows, that's it. The host VFIO IOMMU driver will fail creating more
>> windows but this is expected. For emulated windows, there will be more
>> windows with no other consequences.
>
> Hmm.. is there actually a reason to have the windows property?  Would
> you be better off just using the compile time constant for now.


I am afraid it is going to be 2 DMA windows forever as the other DMA 
tlb-ish facility coming does not use windows at all :)


>>
>>
>>
>>
>>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>> +                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
>>>>       DEFINE_PROP_END_OF_LIST(),
>>>>   };
>>>>
>>>> @@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>       uint32_t interrupt_map_mask[] = {
>>>>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>>>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>>>> +    uint32_t ddw_applicable[] = {
>>>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>>>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>>>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>>>> +    };
>>>> +    uint32_t ddw_extensions[] = {
>>>> +        cpu_to_be32(1),
>>>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>>>> +    };
>>>>       sPAPRTCETable *tcet;
>>>>       PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>>>       sPAPRFDT s_fdt;
>>>> @@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>>>
>>>> +    /* Dynamic DMA window */
>>>> +    if (phb->ddw_enabled) {
>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>>>> +                         sizeof(ddw_applicable)));
>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>>>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>>>> +    }
>>>> +
>>>>       /* Build the interrupt-map, this must matches what is done
>>>>        * in pci_spapr_map_irq
>>>>        */
>>>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>>>> new file mode 100644
>>>> index 0000000..37f805f
>>>> --- /dev/null
>>>> +++ b/hw/ppc/spapr_rtas_ddw.c
>>>> @@ -0,0 +1,300 @@
>>>> +/*
>>>> + * QEMU sPAPR Dynamic DMA windows support
>>>> + *
>>>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>>>> + *
>>>> + *  This program is free software; you can redistribute it and/or modify
>>>> + *  it under the terms of the GNU General Public License as published by
>>>> + *  the Free Software Foundation; either version 2 of the License,
>>>> + *  or (at your option) any later version.
>>>> + *
>>>> + *  This program is distributed in the hope that it will be useful,
>>>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + *  GNU General Public License for more details.
>>>> + *
>>>> + *  You should have received a copy of the GNU General Public License
>>>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "hw/ppc/spapr.h"
>>>> +#include "hw/pci-host/spapr.h"
>>>> +#include "trace.h"
>>>> +
>>>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>>>> +{
>>>> +    sPAPRTCETable *tcet;
>>>> +
>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>> +    if (tcet && tcet->enabled) {
>>>> +        ++*(unsigned *)opaque;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>>>> +{
>>>> +    unsigned ret = 0;
>>>> +
>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>>>> +{
>>>> +    sPAPRTCETable *tcet;
>>>> +
>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>> +    if (tcet && !tcet->enabled) {
>>>> +        *(uint32_t *)opaque = tcet->liobn;
>>>> +        return 1;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>>>> +{
>>>> +    uint32_t liobn = 0;
>>>> +
>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>>>> +
>>>> +    return liobn;
>>>> +}
>>>> +
>>>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>>>> +                                 uint64_t page_mask)
>>>> +{
>>>> +    int i, j;
>>>> +    uint32_t mask = 0;
>>>> +    const struct { int shift; uint32_t mask; } masks[] = {
>>>> +        { 12, RTAS_DDW_PGSIZE_4K },
>>>> +        { 16, RTAS_DDW_PGSIZE_64K },
>>>> +        { 24, RTAS_DDW_PGSIZE_16M },
>>>> +        { 25, RTAS_DDW_PGSIZE_32M },
>>>> +        { 26, RTAS_DDW_PGSIZE_64M },
>>>> +        { 27, RTAS_DDW_PGSIZE_128M },
>>>> +        { 28, RTAS_DDW_PGSIZE_256M },
>>>> +        { 34, RTAS_DDW_PGSIZE_16G },
>>>> +    };
>>>> +
>>>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>>>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>>>> +            if ((sps[i].page_shift == masks[j].shift) &&
>>>> +                    (page_mask & (1ULL << masks[j].shift))) {
>>>> +                mask |= masks[j].mask;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return mask;
>>>> +}
>>>> +
>>>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                         sPAPRMachineState *spapr,
>>>> +                                         uint32_t token, uint32_t nargs,
>>>> +                                         target_ulong args,
>>>> +                                         uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    CPUPPCState *env = &cpu->env;
>>>> +    sPAPRPHBState *sphb;
>>>> +    uint64_t buid, max_window_size;
>>>> +    uint32_t avail, addr, pgmask = 0;
>>>> +
>>>> +    if ((nargs != 3) || (nret != 5)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    /* Work out supported page masks */
>>>> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>>>
>>> There are a few potential problems here.  First you're just
>>> arbitrarily picking the first entry in the sps array to filter
>>
>> Why first?  spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.
>
> env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page
> sizes, then again for actual page sizes.  You're only examing the
> first "row" of that table.  It kinda works because the 4k base page
> size is first, which lists all the actual page size options.

Ah. Right. So I need to walk through all of them, ok.


>>> against, which doesn't seem right (except by accident).  It's a little
>>> bit odd filtering against guest page sizes at all, although I get what
>>> you're really trying to do is filter against allowed host page sizes.
>>>
>>> The other problem is that silently filtering capabilities based on the
>>> host can be a pain for migration - I've made the mistake and had it
>>> bite me in the past.  I think it would be safer to just check the
>>> pagesizes requested in the property against what's possible and
>>> outright fail if they don't match.  For convenience you could also set
>>> according to host capabilities if the user doesn't specify anything,
>>> but that would require explicit handling of the "default" case.


What are the host capabilities here?

There is a page mask from the host IOMMU/PE which is 4K|64K|16M and many 
other sizes, this is supported always by IODA2.
And there is PAGE_SIZE and huge pages (but only with -mempath) - so, 64K 
and 16M (with -mempath).

And there is a "ddw-query" RTAS call which tells the guest if it can use 
16M or not. How do you suggest I advertise 16M to the guest? If I always 
advertise 16M and there is no -mempath, the guest won't try smaller page size.

So - if the user wants 16M IOMMU pages, he has to use -mempath and in 
addition to that explicitely say -global spapr-pci-host-bridge.pgsz=16M, 
and by default enable only 4K and 64K (or just 4K?)? I am fine with this, 
it just means more work for libvirt folks.



>>
>> For the migration purposes, both guests should be started with or without
>> hugepages enabled; this is taken into account already. Besides that, the
>> result of "query" won't differ.
>
> Hmm.. if you're migrating between TCG and KVM or between PR and HV
> these could change as well.  I'm not sure that works at the moment,
> but I'd prefer not to introduce any more barriers to it than we have
> to.
>
>>> Remember that this code will be relevant for DDW with emulated
>>> devices, even if VFIO is not in play at all.
>>>
>>> All those considerations aside, it seems like it would make more sense
>>> to do this filtering during device realize, rather than leaving it
>>> until the guest queries.
>>
>> The result will be the same, it only depends on whether hugepages are
>> enabled or not and this happens at the start time. But yes, feels more
>> accurate to do this in PHB realize(), I'll move it.
>>
>>
>>>
>>>> +    /*
>>>> +     * This is "Largest contiguous block of TCEs allocated specifically
>>>> +     * for (that is, are reserved for) this PE".
>>>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>>>> +     */
>>>> +    max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT;
>>>
>>> Will maxram_size always be enough?  There will sometimes be an
>>> alignment gap between the "base" RAM and the hotpluggable RAM, meaning
>>> that if everything is plugged the last RAM address will be beyond
>>> maxram_size.  Will that require pushing this number up, or will the
>>> guest "repack" the RAM layout when it maps it into the TCE tables?
>>
>>
>> Hm. I do not know what the guest does to DDW on memory hotplug but this is a
>> valid point... What QEMU helper does return the last available address in
>> the system memory address space? Like memblock_end_of_DRAM() in the kernel,
>> I would use that instead.
>
> There is a last_ram_offset() but that's in the ram_addr_t address

What do you call "ram_addr_t address space"? Non-pluggable memory and 
machine->ram_size is its size?


> space, which isn't necessarily the same as the physical address space
> (though it's usually similar).


I looked at the code and it looks like machine->ram_size is always what 
came from "-m" and it won't grow if some memory was hotplugged, is that 
correct?

And hotpluggable memory does not appear in the global ram_list as a RAMBlock?


> You can have a look at what we check
> in (the TCG/PR version of) H_ENTER which needs to check this as well.

Ok, thanks for the pointer.


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-24  0:03                 ` Alexey Kardashevskiy
@ 2016-03-24  9:10                   ` Alexey Kardashevskiy
  2016-03-29  5:30                     ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-24  9:10 UTC (permalink / raw)
  To: David Gibson; +Cc: Paul Mackerras, Alex Williamson, qemu-ppc, qemu-devel

On 03/24/2016 11:03 AM, Alexey Kardashevskiy wrote:
> On 03/23/2016 05:03 PM, David Gibson wrote:
>> On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:
>>> On 03/23/2016 01:53 PM, David Gibson wrote:
>>>> On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
>>>>> On 03/23/2016 12:08 PM, David Gibson wrote:
>>>>>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
>>>>>>> On 03/22/2016 04:14 PM, David Gibson wrote:
>>>>>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window
>>>>>>>>> management.
>>>>>>>>> This adds ability to VFIO common code to dynamically allocate/remove
>>>>>>>>> DMA windows in the host kernel when new VFIO container is
>>>>>>>>> added/removed.
>>>>>>>>>
>>>>>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to
>>>>>>>>> vfio_listener_region_add
>>>>>>>>> and adds just created IOMMU into the host IOMMU list; the opposite
>>>>>>>>> action is taken in vfio_listener_region_del.
>>>>>>>>>
>>>>>>>>> When creating a new window, this uses euristic to decide on the
>>>>>>>>> TCE table
>>>>>>>>> levels number.
>>>>>>>>>
>>>>>>>>> This should cause no guest visible change in behavior.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>> Changes:
>>>>>>>>> v14:
>>>>>>>>> * new to the series
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> TODO:
>>>>>>>>> * export levels to PHB
>>>>>>>>> ---
>>>>>>>>>   hw/vfio/common.c | 108
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>>>>>>   trace-events     |   2 ++
>>>>>>>>>   2 files changed, 105 insertions(+), 5 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>>>> index 4e873b7..421d6eb 100644
>>>>>>>>> --- a/hw/vfio/common.c
>>>>>>>>> +++ b/hw/vfio/common.c
>>>>>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer
>>>>>>>>> *container,
>>>>>>>>>       return 0;
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr
>>>>>>>>> min_iova)
>>>>>>>>> +{
>>>>>>>>> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container,
>>>>>>>>> min_iova, 0x1000);
>>>>>>>>
>>>>>>>> The hard-coded 0x1000 looks dubious..
>>>>>>>
>>>>>>> Well, that's the minimal page size...
>>>>>>
>>>>>> Really?  Some BookE CPUs support 1KiB page size..
>>>>>
>>>>> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
>>>>
>>>> Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
>>>> it's been done for CPU MMU I wouldn't count on it not being done for
>>>> IOMMU.
>>>>
>>>> 1 is a safer choice.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>>> +    g_assert(hiommu);
>>>>>>>>> +    QLIST_REMOVE(hiommu, hiommu_next);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>   static bool vfio_listener_skipped_section(MemoryRegionSection
>>>>>>>>> *section)
>>>>>>>>>   {
>>>>>>>>>       return (!memory_region_is_ram(section->mr) &&
>>>>>>>>> @@ -392,6 +400,61 @@ static void
>>>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>>>       }
>>>>>>>>>       end = int128_get64(llend);
>>>>>>>>>
>>>>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>>>>
>>>>>>>> I think this would be clearer split out into a helper function,
>>>>>>>> vfio_create_host_window() or something.
>>>>>>>
>>>>>>>
>>>>>>> It is rather vfio_spapr_create_host_window() and we were avoiding
>>>>>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
>>>>>>> separate file but this usually triggers more discussion and never
>>>>>>> ends well.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> +        unsigned entries, pages;
>>>>>>>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz =
>>>>>>>>> sizeof(create) };
>>>>>>>>> +
>>>>>>>>> +        g_assert(section->mr->iommu_ops);
>>>>>>>>> +        g_assert(memory_region_is_iommu(section->mr));
>>>>>>>>
>>>>>>>> I don't think you need these asserts.  AFAICT the same logic should
>>>>>>>> work if a RAM MR was added directly to PCI address space - this would
>>>>>>>> create the new host window, then the existing code for adding a RAM MR
>>>>>>>> would map that block of RAM statically into the new window.
>>>>>>>
>>>>>>> In what configuration/machine can we do that on SPAPR?
>>>>>>
>>>>>> spapr guests won't ever do that.  But you can run an x86 guest on a
>>>>>> powernv host and this situation could come up.
>>>>>
>>>>>
>>>>> I am pretty sure VFIO won't work in this case anyway.
>>>>
>>>> I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.
>>>
>>> This is not about TCG (pseries TCG guest works with VFIO on powernv host),
>>> this is about things like VFIO_IOMMU_GET_INFO vs.
>>> VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.
>>>
>>> Should I add such support in this patchset?
>>
>> Unless adding the generality is really complex, and so far I haven't
>> seen a reason for it to be.
>
> Seriously? :(


So, I tried.

With q35 machine (pc-i440fx-2.6 have even worse memory tree), there are 
several RAM blocks - 0..0xc0000, then pc.rom, then RAM again till 2GB, then 
a gap (PCI MMIO?), then PC BIOS at 0xfffc0000 (which is RAM), then after 
4GB the rest of the RAM:

memory-region: system
   0000000000000000-ffffffffffffffff (prio 0, RW): system
     0000000000000000-000000007fffffff (prio 0, RW): alias ram-below-4g 
@pc.ram 0000000000000000-000000007fffffff
     0000000000000000-ffffffffffffffff (prio -1, RW): pci
       00000000000c0000-00000000000dffff (prio 1, RW): pc.rom
       00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios 
@pc.bios 0000000000020000-000000000003ffff
       00000000febe0000-00000000febeffff (prio 1, RW): 0003:09:00.0 BAR 0
         00000000febe0000-00000000febeffff (prio 0, RW): 0003:09:00.0 BAR 0 
mmaps[0]
       00000000febf0000-00000000febf1fff (prio 1, RW): 0003:09:00.0 BAR 2
         00000000febf0000-00000000febf007f (prio 0, RW): msix-table
         00000000febf1000-00000000febf1007 (prio 0, RW): msix-pba [disabled]
       00000000febf2000-00000000febf2fff (prio 1, RW): ahci
       00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
[...]
     0000000100000000-000000027fffffff (prio 0, RW): alias ram-above-4g 
@pc.ram 0000000080000000-00000001ffffffff


Ok, let's say we filter out "pc.bios", "pc.rom", BARs (aka "skip dump" 
regions) and we end up having RAM below 2GB and above 4GB, 2 windows. 
Problem is a window needs to be aligned to power of two.

Ok, I change my test from -m8G to -m10G to have 2 aligned regions - 2GB and 
8GB), next problem - the second window needs to start from 1<<<59, not 4G 
or any other random offset.

Ok, I change my test to start with -m1G to have a single window.
Now it fails because here is what happens:
region_add: pc.ram: 0 40000000
region_del: pc.ram: 0 40000000
region_add: pc.ram: 0 c0000
The second "add" is not power of two -> fail. And I cannot avoid this - 
"pc.rom" is still there, it is a separate region so RAM gets split into 
smaller chunks. I do not know to to fix this properly.


So, in order to make it (x86/tcg on powernv host) work anyhow, I had to 
create a single huge window (as it is a single window - it starts from @0) 
unconditionally in vfio_connect_container(), not in 
vfio_listener_region_add(); and I added filtering (skip "pc.bios", 
"pc.rom", BARs - these things start above 1GB) and then I could boot a 
x86_64 guest and even pass Mellanox ConnextX3, it would bring 2 interfaces 
up and dhclient assigned IPs to them (which is quite amusing that it even 
works).


So, I think I will replace assert() with:

unsigned pagesize = qemu_real_host_page_size;
if (section->mr->iommu_ops) {
     pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
}

but there is no practical use for this anyway.


What do I miss and what do I need to try more to proceed with this patch? 
Thanks.


-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-24  2:32         ` Alexey Kardashevskiy
@ 2016-03-29  5:22           ` David Gibson
  2016-03-29  6:23             ` Alexey Kardashevskiy
  2016-03-31  3:19           ` David Gibson
  1 sibling, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-29  5:22 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 24765 bytes --]

On Thu, Mar 24, 2016 at 01:32:48PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 05:11 PM, David Gibson wrote:
> >On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/23/2016 01:13 PM, David Gibson wrote:
> >>>On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
> >>>>This adds support for Dynamic DMA Windows (DDW) option defined by
> >>>>the SPAPR specification which allows to have additional DMA window(s)
> >>>>
> >>>>This implements DDW for emulated and VFIO devices.
> >>>>This reserves RTAS token numbers for DDW calls.
> >>>>
> >>>>This changes the TCE table migration descriptor to support dynamic
> >>>>tables as from now on, PHB will create as many stub TCE table objects
> >>>>as PHB can possibly support but not all of them might be initialized at
> >>>>the time of migration because DDW might or might not be requested by
> >>>>the guest.
> >>>>
> >>>>The "ddw" property is enabled by default on a PHB but for compatibility
> >>>>the pseries-2.5 machine and older disable it.
> >>>>
> >>>>This implements DDW for VFIO. The host kernel support is required.
> >>>>This adds a "levels" property to PHB to control the number of levels
> >>>>in the actual TCE table allocated by the host kernel, 0 is the default
> >>>>value to tell QEMU to calculate the correct value. Current hardware
> >>>>supports up to 5 levels.
> >>>>
> >>>>The existing linux guests try creating one additional huge DMA window
> >>>>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>>>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>>>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>>>and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >>>>property which is a bus address for the 64bit window and by default
> >>>>set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >>>>uses and this allows having emulated and VFIO devices on the same bus.
> >>>>
> >>>>This adds 4 RTAS handlers:
> >>>>* ibm,query-pe-dma-window
> >>>>* ibm,create-pe-dma-window
> >>>>* ibm,remove-pe-dma-window
> >>>>* ibm,reset-pe-dma-window
> >>>>These are registered from type_init() callback.
> >>>>
> >>>>These RTAS handlers are implemented in a separate file to avoid polluting
> >>>>spapr_iommu.c with PCI.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  hw/ppc/Makefile.objs        |   1 +
> >>>>  hw/ppc/spapr.c              |   7 +-
> >>>>  hw/ppc/spapr_pci.c          |  73 ++++++++---
> >>>>  hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>  hw/vfio/common.c            |   5 -
> >>>>  include/hw/pci-host/spapr.h |  13 ++
> >>>>  include/hw/ppc/spapr.h      |  16 ++-
> >>>>  trace-events                |   4 +
> >>>>  8 files changed, 395 insertions(+), 24 deletions(-)
> >>>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>>>
> >>>>diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>>>index c1ffc77..986b36f 100644
> >>>>--- a/hw/ppc/Makefile.objs
> >>>>+++ b/hw/ppc/Makefile.objs
> >>>>@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>>>  obj-y += spapr_pci_vfio.o
> >>>>  endif
> >>>>+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>>>  # PowerPC 4xx boards
> >>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>>>  obj-y += ppc4xx_pci.o
> >>>>diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>>index d0bb423..ef4c637 100644
> >>>>--- a/hw/ppc/spapr.c
> >>>>+++ b/hw/ppc/spapr.c
> >>>>@@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>>>   * pseries-2.5
> >>>>   */
> >>>>  #define SPAPR_COMPAT_2_5 \
> >>>>-        HW_COMPAT_2_5
> >>>>+        HW_COMPAT_2_5 \
> >>>>+        {\
> >>>>+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >>>>+            .property = "ddw",\
> >>>>+            .value    = stringify(off),\
> >>>>+        },
> >>>>
> >>>>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >>>>  {
> >>>>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>>>index af99a36..3bb294a 100644
> >>>>--- a/hw/ppc/spapr_pci.c
> >>>>+++ b/hw/ppc/spapr_pci.c
> >>>>@@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
> >>>>      return buf;
> >>>>  }
> >>>>
> >>>>-static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>>>-                                       uint32_t liobn,
> >>>>-                                       uint32_t page_shift,
> >>>>-                                       uint64_t window_addr,
> >>>>-                                       uint64_t window_size,
> >>>>-                                       Error **errp)
> >>>>+void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>>>+                                 uint32_t liobn,
> >>>>+                                 uint32_t page_shift,
> >>>>+                                 uint64_t window_addr,
> >>>>+                                 uint64_t window_size,
> >>>>+                                 Error **errp)
> >>>>  {
> >>>>      sPAPRTCETable *tcet;
> >>>>      uint32_t nb_table = window_size >> page_shift;
> >>>>@@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>>>          return;
> >>>>      }
> >>>>
> >>>>+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> >>>>+        error_setg(errp,
> >>>>+                   "Attempt to use second window when DDW is disabled on PHB");
> >>>>+        return;
> >>>>+    }
> >>>
> >>>This should never happen unless something is wrong with the tests in
> >>>the RTAS functions, yes?  In which case it should probably be an
> >>>assert().
> >>
> >>This should not. But this is called from the RTAS caller so I'd really like
> >>to have a message rather than assert() if that condition happens, here or in
> >>rtas_ibm_create_pe_dma_window().
> >
> >It should only be called from RTAS if ddw is enabled though, yes?
> 
> 
> From RTAS and from the PHB reset handler. Well. I will get rid of
> spapr_phb_dma_window_enable/spapr_phb_dma_window_disable, they are quite
> useless when I look at them now.

Ok.

> >>>>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
> >>>>  }
> >>>>
> >>>>-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>>>+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>>>  {
> >>>>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>>>
> >>>>@@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>      }
> >>>>
> >>>>      /* DMA setup */
> >>>>-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> >>>>-    if (!tcet) {
> >>>>-        error_report("No default TCE table for %s", sphb->dtbusname);
> >>>>-        return;
> >>>>-    }
> >>>>
> >>>>-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>>>-                                        spapr_tce_get_iommu(tcet), 0);
> >>>>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>>>+        tcet = spapr_tce_new_table(DEVICE(sphb),
> >>>>+                                   SPAPR_PCI_LIOBN(sphb->index, i));
> >>>>+        if (!tcet) {
> >>>>+            error_setg(errp, "Creating window#%d failed for %s",
> >>>>+                       i, sphb->dtbusname);
> >>>>+            return;
> >>>>+        }
> >>>>+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>>>+                                            spapr_tce_get_iommu(tcet), 0);
> >>>>+    }
> >>>>
> >>>>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >>>>  }
> >>>>@@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
> >>>>
> >>>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >>>>  {
> >>>>-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> >>>>      Error *local_err = NULL;
> >>>>+    int i;
> >>>>
> >>>>-    if (tcet && tcet->enabled) {
> >>>>-        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> >>>>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>>>+        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
> >>>>+        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>>>+
> >>>>+        if (tcet && tcet->enabled) {
> >>>>+            spapr_phb_dma_window_disable(sphb, liobn);
> >>>>+        }
> >>>>      }
> >>>>
> >>>>      /* Register default 32bit DMA window */
> >>>>@@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
> >>>>      /* Default DMA window is 0..1GB */
> >>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >>>>+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> >>>>+                       0x800000000000000ULL),
> >>>>+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >>>>+    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
> >>>>+                       SPAPR_PCI_DMA_MAX_WINDOWS),
> >>>
> >>>What will happen if the user tries to set 'windows' larger than
> >>>SPAPR_PCI_DMA_MAX_WINDOWS?
> >>
> >>
> >>Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported
> >>everywhere, missed that. Besides that, there will be support for more
> >>windows, that's it. The host VFIO IOMMU driver will fail creating more
> >>windows but this is expected. For emulated windows, there will be more
> >>windows with no other consequences.
> >
> >Hmm.. is there actually a reason to have the windows property?  Would
> >you be better off just using the compile time constant for now.
> 
> 
> I am afraid it is going to be 2 DMA windows forever as the other DMA tlb-ish
> facility coming does not use windows at all :)

That sounds like a reason not to have the property and leave it as a
compile time constant.

> >>>>+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >>>>+                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
> >>>>      DEFINE_PROP_END_OF_LIST(),
> >>>>  };
> >>>>
> >>>>@@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>>>      uint32_t interrupt_map_mask[] = {
> >>>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >>>>+    uint32_t ddw_applicable[] = {
> >>>>+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >>>>+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >>>>+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >>>>+    };
> >>>>+    uint32_t ddw_extensions[] = {
> >>>>+        cpu_to_be32(1),
> >>>>+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >>>>+    };
> >>>>      sPAPRTCETable *tcet;
> >>>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>>>      sPAPRFDT s_fdt;
> >>>>@@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>>>
> >>>>+    /* Dynamic DMA window */
> >>>>+    if (phb->ddw_enabled) {
> >>>>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >>>>+                         sizeof(ddw_applicable)));
> >>>>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >>>>+                         &ddw_extensions, sizeof(ddw_extensions)));
> >>>>+    }
> >>>>+
> >>>>      /* Build the interrupt-map, this must matches what is done
> >>>>       * in pci_spapr_map_irq
> >>>>       */
> >>>>diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >>>>new file mode 100644
> >>>>index 0000000..37f805f
> >>>>--- /dev/null
> >>>>+++ b/hw/ppc/spapr_rtas_ddw.c
> >>>>@@ -0,0 +1,300 @@
> >>>>+/*
> >>>>+ * QEMU sPAPR Dynamic DMA windows support
> >>>>+ *
> >>>>+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >>>>+ *
> >>>>+ *  This program is free software; you can redistribute it and/or modify
> >>>>+ *  it under the terms of the GNU General Public License as published by
> >>>>+ *  the Free Software Foundation; either version 2 of the License,
> >>>>+ *  or (at your option) any later version.
> >>>>+ *
> >>>>+ *  This program is distributed in the hope that it will be useful,
> >>>>+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>>+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>>>+ *  GNU General Public License for more details.
> >>>>+ *
> >>>>+ *  You should have received a copy of the GNU General Public License
> >>>>+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >>>>+ */
> >>>>+
> >>>>+#include "qemu/osdep.h"
> >>>>+#include "qemu/error-report.h"
> >>>>+#include "hw/ppc/spapr.h"
> >>>>+#include "hw/pci-host/spapr.h"
> >>>>+#include "trace.h"
> >>>>+
> >>>>+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >>>>+{
> >>>>+    sPAPRTCETable *tcet;
> >>>>+
> >>>>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>>>+    if (tcet && tcet->enabled) {
> >>>>+        ++*(unsigned *)opaque;
> >>>>+    }
> >>>>+    return 0;
> >>>>+}
> >>>>+
> >>>>+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >>>>+{
> >>>>+    unsigned ret = 0;
> >>>>+
> >>>>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >>>>+
> >>>>+    return ret;
> >>>>+}
> >>>>+
> >>>>+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >>>>+{
> >>>>+    sPAPRTCETable *tcet;
> >>>>+
> >>>>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>>>+    if (tcet && !tcet->enabled) {
> >>>>+        *(uint32_t *)opaque = tcet->liobn;
> >>>>+        return 1;
> >>>>+    }
> >>>>+    return 0;
> >>>>+}
> >>>>+
> >>>>+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >>>>+{
> >>>>+    uint32_t liobn = 0;
> >>>>+
> >>>>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >>>>+
> >>>>+    return liobn;
> >>>>+}
> >>>>+
> >>>>+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> >>>>+                                 uint64_t page_mask)
> >>>>+{
> >>>>+    int i, j;
> >>>>+    uint32_t mask = 0;
> >>>>+    const struct { int shift; uint32_t mask; } masks[] = {
> >>>>+        { 12, RTAS_DDW_PGSIZE_4K },
> >>>>+        { 16, RTAS_DDW_PGSIZE_64K },
> >>>>+        { 24, RTAS_DDW_PGSIZE_16M },
> >>>>+        { 25, RTAS_DDW_PGSIZE_32M },
> >>>>+        { 26, RTAS_DDW_PGSIZE_64M },
> >>>>+        { 27, RTAS_DDW_PGSIZE_128M },
> >>>>+        { 28, RTAS_DDW_PGSIZE_256M },
> >>>>+        { 34, RTAS_DDW_PGSIZE_16G },
> >>>>+    };
> >>>>+
> >>>>+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> >>>>+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> >>>>+            if ((sps[i].page_shift == masks[j].shift) &&
> >>>>+                    (page_mask & (1ULL << masks[j].shift))) {
> >>>>+                mask |= masks[j].mask;
> >>>>+            }
> >>>>+        }
> >>>>+    }
> >>>>+
> >>>>+    return mask;
> >>>>+}
> >>>>+
> >>>>+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >>>>+                                         sPAPRMachineState *spapr,
> >>>>+                                         uint32_t token, uint32_t nargs,
> >>>>+                                         target_ulong args,
> >>>>+                                         uint32_t nret, target_ulong rets)
> >>>>+{
> >>>>+    CPUPPCState *env = &cpu->env;
> >>>>+    sPAPRPHBState *sphb;
> >>>>+    uint64_t buid, max_window_size;
> >>>>+    uint32_t avail, addr, pgmask = 0;
> >>>>+
> >>>>+    if ((nargs != 3) || (nret != 5)) {
> >>>>+        goto param_error_exit;
> >>>>+    }
> >>>>+
> >>>>+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>>>+    addr = rtas_ld(args, 0);
> >>>>+    sphb = spapr_pci_find_phb(spapr, buid);
> >>>>+    if (!sphb || !sphb->ddw_enabled) {
> >>>>+        goto param_error_exit;
> >>>>+    }
> >>>>+
> >>>>+    /* Work out supported page masks */
> >>>>+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> >>>
> >>>There are a few potential problems here.  First you're just
> >>>arbitrarily picking the first entry in the sps array to filter
> >>
> >>Why first?  spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.
> >
> >env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page
> >sizes, then again for actual page sizes.  You're only examing the
> >first "row" of that table.  It kinda works because the 4k base page
> >size is first, which lists all the actual page size options.
> 
> Ah. Right. So I need to walk through all of them, ok.

Well.. that would be closer to right, but even then, I don't think
that's really right.  What you want to advertise here is all guest
capabilities that are possible on the host.

So, the first question is what's the actual size of the chunks that
the host will need to map.  Let's call that the "effective IOMMU page
size".  That's going to be whichever is smaller of the guest IOMMU
page size, and the (host) RAM page size.

e.g. For a guest using 4k IOMMU mappings on a host with 64k page size,
the effective page size is 4kiB.

For a guest using 16MiB IOMMU mappings on a host with 64kiB page size,
the effective page size is 64kiB (because a 16MiB guest "page" could
be broken up into different real pages on the host).

The next question is what can you actually map in the host IOMMU.  It
should be able to handle any effective page size that's equal to, or
smaller than the smallest page size supported on the host.  That might
involve the host inserting several host TCEs for a single effective
page, but the VFIO DMA_MAP interface should already cover that because
it takes a length parameter.

> >>>against, which doesn't seem right (except by accident).  It's a little
> >>>bit odd filtering against guest page sizes at all, although I get what
> >>>you're really trying to do is filter against allowed host page sizes.
> >>>
> >>>The other problem is that silently filtering capabilities based on the
> >>>host can be a pain for migration - I've made the mistake and had it
> >>>bite me in the past.  I think it would be safer to just check the
> >>>pagesizes requested in the property against what's possible and
> >>>outright fail if they don't match.  For convenience you could also set
> >>>according to host capabilities if the user doesn't specify anything,
> >>>but that would require explicit handling of the "default" case.
> 
> 
> What are the host capabilities here?
> 
> There is a page mask from the host IOMMU/PE which is 4K|64K|16M and many
> other sizes, this is supported always by IODA2.
> And there is PAGE_SIZE and huge pages (but only with -mempath) - so, 64K and
> 16M (with -mempath).
> 
> And there is a "ddw-query" RTAS call which tells the guest if it can use 16M
> or not. How do you suggest I advertise 16M to the guest? If I always
> advertise 16M and there is no -mempath, the guest won't try smaller page
> size.

So, here's what I think you need to do:

  1. Start with the full list of IOMMU page sizes the virtual (guest)
     hardware supports (so, 4KiB, 64KiB & 16MiB, possibly modified by
     properties)
  2. For each entry in the list work out what the effective page size
     will be, by clamping to the RAM page size.
  3. Test if each effective page size is possible on the host (i.e. is
     <= at least one of the pagesizes advertised by the host IOMMU
     info ioctl).

For a first cut it's probably reasonable to only check for == (not <=)
the supported host pagesizes.  Ideally the host (inside the kernel)
would automatically create duplicate host TCEs if effective page size
< host IOMMU page size.  Obviously it would still need to use the
supplied guest page size for the guest view of the table.


> So - if the user wants 16M IOMMU pages, he has to use -mempath and in
> addition to that explicitely say -global spapr-pci-host-bridge.pgsz=16M, and
> by default enable only 4K and 64K (or just 4K?)? I am fine with this, it
> just means more work for libvirt folks.

I think that's the safest option to start with.  Require that the user
explicitly list what pagesizes the guest IOMMU will support, and just
fail if the host can't do that.

We could potentially auto-filter (as we already do for CPU/MMU
pagesizes) once we've thought a bit more carefully about the possible
migration implications.


> >>For the migration purposes, both guests should be started with or without
> >>hugepages enabled; this is taken into account already. Besides that, the
> >>result of "query" won't differ.
> >
> >Hmm.. if you're migrating between TCG and KVM or between PR and HV
> >these could change as well.  I'm not sure that works at the moment,
> >but I'd prefer not to introduce any more barriers to it than we have
> >to.
> >
> >>>Remember that this code will be relevant for DDW with emulated
> >>>devices, even if VFIO is not in play at all.
> >>>
> >>>All those considerations aside, it seems like it would make more sense
> >>>to do this filtering during device realize, rather than leaving it
> >>>until the guest queries.
> >>
> >>The result will be the same, it only depends on whether hugepages are
> >>enabled or not and this happens at the start time. But yes, feels more
> >>accurate to do this in PHB realize(), I'll move it.
> >>
> >>
> >>>
> >>>>+    /*
> >>>>+     * This is "Largest contiguous block of TCEs allocated specifically
> >>>>+     * for (that is, are reserved for) this PE".
> >>>>+     * Return the maximum number as maximum supported RAM size was in 4K pages.
> >>>>+     */
> >>>>+    max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT;
> >>>
> >>>Will maxram_size always be enough?  There will sometimes be an
> >>>alignment gap between the "base" RAM and the hotpluggable RAM, meaning
> >>>that if everything is plugged the last RAM address will be beyond
> >>>maxram_size.  Will that require pushing this number up, or will the
> >>>guest "repack" the RAM layout when it maps it into the TCE tables?
> >>
> >>
> >>Hm. I do not know what the guest does to DDW on memory hotplug but this is a
> >>valid point... What QEMU helper does return the last available address in
> >>the system memory address space? Like memblock_end_of_DRAM() in the kernel,
> >>I would use that instead.
> >
> >There is a last_ram_offset() but that's in the ram_addr_t address
> 
> What do you call "ram_addr_t address space"? Non-pluggable memory and
> machine->ram_size is its size?

There's an address space in which all RAMBlocks live - for internal
qemu tracking purposes - which is *different* from the guest physical
address space (although they'll match in simpler cases).  Variables of
type ram_addr_t are in that address space, whereas variables of type
hwaddr are in the guest physical address space.  Or at least they're
supposed to be: unfortunately there are some variables with the wrong
type about :(.

I forget exactly why the ram_addr_t space exists - I think it's
usually contiguous, even if the ramblocks aren't contiguous in GPA,
and it's used for tracking qemu's internal dirty bitmaps amongst other
things.

> >space, which isn't necessarily the same as the physical address space
> >(though it's usually similar).
> 
> 
> I looked at the code and it looks like machine->ram_size is always what came
> from "-m" and it won't grow if some memory was hotplugged, is that correct?

Yes, that's right.

> And hotpluggable memory does not appear in the global ram_list as a RAMBlock?

Well.. hotplugged memory should appear there.  Hotpluggable, but not
yet plugged memory won't, if that's what you mean.

> >You can have a look at what we check
> >in (the TCG/PR version of) H_ENTER which needs to check this as well.
> 
> Ok, thanks for the pointer.
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-24  9:10                   ` Alexey Kardashevskiy
@ 2016-03-29  5:30                     ` David Gibson
  2016-03-29  5:44                       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 64+ messages in thread
From: David Gibson @ 2016-03-29  5:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 9527 bytes --]

On Thu, Mar 24, 2016 at 08:10:44PM +1100, Alexey Kardashevskiy wrote:
> On 03/24/2016 11:03 AM, Alexey Kardashevskiy wrote:
> >On 03/23/2016 05:03 PM, David Gibson wrote:
> >>On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:
> >>>On 03/23/2016 01:53 PM, David Gibson wrote:
> >>>>On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
> >>>>>On 03/23/2016 12:08 PM, David Gibson wrote:
> >>>>>>On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>On 03/22/2016 04:14 PM, David Gibson wrote:
> >>>>>>>>On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window
> >>>>>>>>>management.
> >>>>>>>>>This adds ability to VFIO common code to dynamically allocate/remove
> >>>>>>>>>DMA windows in the host kernel when new VFIO container is
> >>>>>>>>>added/removed.
> >>>>>>>>>
> >>>>>>>>>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to
> >>>>>>>>>vfio_listener_region_add
> >>>>>>>>>and adds just created IOMMU into the host IOMMU list; the opposite
> >>>>>>>>>action is taken in vfio_listener_region_del.
> >>>>>>>>>
> >>>>>>>>>When creating a new window, this uses euristic to decide on the
> >>>>>>>>>TCE table
> >>>>>>>>>levels number.
> >>>>>>>>>
> >>>>>>>>>This should cause no guest visible change in behavior.
> >>>>>>>>>
> >>>>>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>---
> >>>>>>>>>Changes:
> >>>>>>>>>v14:
> >>>>>>>>>* new to the series
> >>>>>>>>>
> >>>>>>>>>---
> >>>>>>>>>TODO:
> >>>>>>>>>* export levels to PHB
> >>>>>>>>>---
> >>>>>>>>>  hw/vfio/common.c | 108
> >>>>>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>>>>>>  trace-events     |   2 ++
> >>>>>>>>>  2 files changed, 105 insertions(+), 5 deletions(-)
> >>>>>>>>>
> >>>>>>>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>>>>>index 4e873b7..421d6eb 100644
> >>>>>>>>>--- a/hw/vfio/common.c
> >>>>>>>>>+++ b/hw/vfio/common.c
> >>>>>>>>>@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer
> >>>>>>>>>*container,
> >>>>>>>>>      return 0;
> >>>>>>>>>  }
> >>>>>>>>>
> >>>>>>>>>+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr
> >>>>>>>>>min_iova)
> >>>>>>>>>+{
> >>>>>>>>>+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container,
> >>>>>>>>>min_iova, 0x1000);
> >>>>>>>>
> >>>>>>>>The hard-coded 0x1000 looks dubious..
> >>>>>>>
> >>>>>>>Well, that's the minimal page size...
> >>>>>>
> >>>>>>Really?  Some BookE CPUs support 1KiB page size..
> >>>>>
> >>>>>Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
> >>>>
> >>>>Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
> >>>>it's been done for CPU MMU I wouldn't count on it not being done for
> >>>>IOMMU.
> >>>>
> >>>>1 is a safer choice.
> >>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>>>>+    g_assert(hiommu);
> >>>>>>>>>+    QLIST_REMOVE(hiommu, hiommu_next);
> >>>>>>>>>+}
> >>>>>>>>>+
> >>>>>>>>>  static bool vfio_listener_skipped_section(MemoryRegionSection
> >>>>>>>>>*section)
> >>>>>>>>>  {
> >>>>>>>>>      return (!memory_region_is_ram(section->mr) &&
> >>>>>>>>>@@ -392,6 +400,61 @@ static void
> >>>>>>>>>vfio_listener_region_add(MemoryListener *listener,
> >>>>>>>>>      }
> >>>>>>>>>      end = int128_get64(llend);
> >>>>>>>>>
> >>>>>>>>>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>>>>>>
> >>>>>>>>I think this would be clearer split out into a helper function,
> >>>>>>>>vfio_create_host_window() or something.
> >>>>>>>
> >>>>>>>
> >>>>>>>It is rather vfio_spapr_create_host_window() and we were avoiding
> >>>>>>>xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> >>>>>>>separate file but this usually triggers more discussion and never
> >>>>>>>ends well.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>>+        unsigned entries, pages;
> >>>>>>>>>+        struct vfio_iommu_spapr_tce_create create = { .argsz =
> >>>>>>>>>sizeof(create) };
> >>>>>>>>>+
> >>>>>>>>>+        g_assert(section->mr->iommu_ops);
> >>>>>>>>>+        g_assert(memory_region_is_iommu(section->mr));
> >>>>>>>>
> >>>>>>>>I don't think you need these asserts.  AFAICT the same logic should
> >>>>>>>>work if a RAM MR was added directly to PCI address space - this would
> >>>>>>>>create the new host window, then the existing code for adding a RAM MR
> >>>>>>>>would map that block of RAM statically into the new window.
> >>>>>>>
> >>>>>>>In what configuration/machine can we do that on SPAPR?
> >>>>>>
> >>>>>>spapr guests won't ever do that.  But you can run an x86 guest on a
> >>>>>>powernv host and this situation could come up.
> >>>>>
> >>>>>
> >>>>>I am pretty sure VFIO won't work in this case anyway.
> >>>>
> >>>>I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.
> >>>
> >>>This is not about TCG (pseries TCG guest works with VFIO on powernv host),
> >>>this is about things like VFIO_IOMMU_GET_INFO vs.
> >>>VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.
> >>>
> >>>Should I add such support in this patchset?
> >>
> >>Unless adding the generality is really complex, and so far I haven't
> >>seen a reason for it to be.
> >
> >Seriously? :(
> 
> 
> So, I tried.
> 
> With q35 machine (pc-i440fx-2.6 have even worse memory tree), there are
> several RAM blocks - 0..0xc0000, then pc.rom, then RAM again till 2GB, then
> a gap (PCI MMIO?), then PC BIOS at 0xfffc0000 (which is RAM), then after 4GB
> the rest of the RAM:
> 
> memory-region: system
>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>     0000000000000000-000000007fffffff (prio 0, RW): alias ram-below-4g
> @pc.ram 0000000000000000-000000007fffffff
>     0000000000000000-ffffffffffffffff (prio -1, RW): pci
>       00000000000c0000-00000000000dffff (prio 1, RW): pc.rom
>       00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios
> @pc.bios 0000000000020000-000000000003ffff
>       00000000febe0000-00000000febeffff (prio 1, RW): 0003:09:00.0 BAR 0
>         00000000febe0000-00000000febeffff (prio 0, RW): 0003:09:00.0 BAR 0
> mmaps[0]
>       00000000febf0000-00000000febf1fff (prio 1, RW): 0003:09:00.0 BAR 2
>         00000000febf0000-00000000febf007f (prio 0, RW): msix-table
>         00000000febf1000-00000000febf1007 (prio 0, RW): msix-pba [disabled]
>       00000000febf2000-00000000febf2fff (prio 1, RW): ahci
>       00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
> [...]
>     0000000100000000-000000027fffffff (prio 0, RW): alias ram-above-4g
> @pc.ram 0000000080000000-00000001ffffffff
> 
> 
> Ok, let's say we filter out "pc.bios", "pc.rom", BARs (aka "skip dump"
> regions) and we end up having RAM below 2GB and above 4GB, 2 windows.
> Problem is a window needs to be aligned to power of two.

Window size?  That should be ok - you can just round up the requested
size to a power of two.  We don't have to populate the whole host window.

> Ok, I change my test from -m8G to -m10G to have 2 aligned regions - 2GB and
> 8GB), next problem - the second window needs to start from 1<<<59, not 4G or
> any other random offset.

Yeah, that's harder.  With the IODA is it always that the 32-bit
window is at 0 and the 64-bit window is at 1<<59, or can you have a
64-bit window at 0?

If it's the second, then this is theoretically possible, but the host
kernel - on seeing the second window request, would need to see that
it can instead of adding a new host window, instead extend the first
host window so that it covers both the guest windows.

That does sound reasonably tricky and I don't think we should try to
implement it any time soon.  BUT - we should design our *interfaces*
so that it's reasonable to add that in future.

> Ok, I change my test to start with -m1G to have a single window.
> Now it fails because here is what happens:
> region_add: pc.ram: 0 40000000
> region_del: pc.ram: 0 40000000
> region_add: pc.ram: 0 c0000
> The second "add" is not power of two -> fail. And I cannot avoid this -
> "pc.rom" is still there, it is a separate region so RAM gets split into
> smaller chunks. I do not know to to fix this properly.
> 
> 
> So, in order to make it (x86/tcg on powernv host) work anyhow, I had to
> create a single huge window (as it is a single window - it starts from @0)
> unconditionally in vfio_connect_container(), not in
> vfio_listener_region_add(); and I added filtering (skip "pc.bios", "pc.rom",
> BARs - these things start above 1GB) and then I could boot a x86_64 guest
> and even pass Mellanox ConnextX3, it would bring 2 interfaces up and
> dhclient assigned IPs to them (which is quite amusing that it even works).
> 
> 
> So, I think I will replace assert() with:
> 
> unsigned pagesize = qemu_real_host_page_size;
> if (section->mr->iommu_ops) {
>     pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
> }
> 
> but there is no practical use for this anyway.

So, I think what you need here is to compute the effective IOMMU page
size (see other email for details).  That will be getrampagesize() for
RAM regions, and the mininum of that and the guest IOMMU page size for
IOMMU regions.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-29  5:30                     ` David Gibson
@ 2016-03-29  5:44                       ` Alexey Kardashevskiy
  2016-03-29  6:44                         ` David Gibson
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-29  5:44 UTC (permalink / raw)
  To: David Gibson; +Cc: Paul Mackerras, Alex Williamson, qemu-ppc, qemu-devel

On 03/29/2016 04:30 PM, David Gibson wrote:
> On Thu, Mar 24, 2016 at 08:10:44PM +1100, Alexey Kardashevskiy wrote:
>> On 03/24/2016 11:03 AM, Alexey Kardashevskiy wrote:
>>> On 03/23/2016 05:03 PM, David Gibson wrote:
>>>> On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:
>>>>> On 03/23/2016 01:53 PM, David Gibson wrote:
>>>>>> On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
>>>>>>> On 03/23/2016 12:08 PM, David Gibson wrote:
>>>>>>>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>> On 03/22/2016 04:14 PM, David Gibson wrote:
>>>>>>>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window
>>>>>>>>>>> management.
>>>>>>>>>>> This adds ability to VFIO common code to dynamically allocate/remove
>>>>>>>>>>> DMA windows in the host kernel when new VFIO container is
>>>>>>>>>>> added/removed.
>>>>>>>>>>>
>>>>>>>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to
>>>>>>>>>>> vfio_listener_region_add
>>>>>>>>>>> and adds just created IOMMU into the host IOMMU list; the opposite
>>>>>>>>>>> action is taken in vfio_listener_region_del.
>>>>>>>>>>>
>>>>>>>>>>> When creating a new window, this uses euristic to decide on the
>>>>>>>>>>> TCE table
>>>>>>>>>>> levels number.
>>>>>>>>>>>
>>>>>>>>>>> This should cause no guest visible change in behavior.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>> Changes:
>>>>>>>>>>> v14:
>>>>>>>>>>> * new to the series
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> TODO:
>>>>>>>>>>> * export levels to PHB
>>>>>>>>>>> ---
>>>>>>>>>>>   hw/vfio/common.c | 108
>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>>>>>>>>   trace-events     |   2 ++
>>>>>>>>>>>   2 files changed, 105 insertions(+), 5 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>>>>>> index 4e873b7..421d6eb 100644
>>>>>>>>>>> --- a/hw/vfio/common.c
>>>>>>>>>>> +++ b/hw/vfio/common.c
>>>>>>>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer
>>>>>>>>>>> *container,
>>>>>>>>>>>       return 0;
>>>>>>>>>>>   }
>>>>>>>>>>>
>>>>>>>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr
>>>>>>>>>>> min_iova)
>>>>>>>>>>> +{
>>>>>>>>>>> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container,
>>>>>>>>>>> min_iova, 0x1000);
>>>>>>>>>>
>>>>>>>>>> The hard-coded 0x1000 looks dubious..
>>>>>>>>>
>>>>>>>>> Well, that's the minimal page size...
>>>>>>>>
>>>>>>>> Really?  Some BookE CPUs support 1KiB page size..
>>>>>>>
>>>>>>> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
>>>>>>
>>>>>> Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
>>>>>> it's been done for CPU MMU I wouldn't count on it not being done for
>>>>>> IOMMU.
>>>>>>
>>>>>> 1 is a safer choice.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>> +    g_assert(hiommu);
>>>>>>>>>>> +    QLIST_REMOVE(hiommu, hiommu_next);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>   static bool vfio_listener_skipped_section(MemoryRegionSection
>>>>>>>>>>> *section)
>>>>>>>>>>>   {
>>>>>>>>>>>       return (!memory_region_is_ram(section->mr) &&
>>>>>>>>>>> @@ -392,6 +400,61 @@ static void
>>>>>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>>>>>       }
>>>>>>>>>>>       end = int128_get64(llend);
>>>>>>>>>>>
>>>>>>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>>>>>>
>>>>>>>>>> I think this would be clearer split out into a helper function,
>>>>>>>>>> vfio_create_host_window() or something.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It is rather vfio_spapr_create_host_window() and we were avoiding
>>>>>>>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
>>>>>>>>> separate file but this usually triggers more discussion and never
>>>>>>>>> ends well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +        unsigned entries, pages;
>>>>>>>>>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz =
>>>>>>>>>>> sizeof(create) };
>>>>>>>>>>> +
>>>>>>>>>>> +        g_assert(section->mr->iommu_ops);
>>>>>>>>>>> +        g_assert(memory_region_is_iommu(section->mr));
>>>>>>>>>>
>>>>>>>>>> I don't think you need these asserts.  AFAICT the same logic should
>>>>>>>>>> work if a RAM MR was added directly to PCI address space - this would
>>>>>>>>>> create the new host window, then the existing code for adding a RAM MR
>>>>>>>>>> would map that block of RAM statically into the new window.
>>>>>>>>>
>>>>>>>>> In what configuration/machine can we do that on SPAPR?
>>>>>>>>
>>>>>>>> spapr guests won't ever do that.  But you can run an x86 guest on a
>>>>>>>> powernv host and this situation could come up.
>>>>>>>
>>>>>>>
>>>>>>> I am pretty sure VFIO won't work in this case anyway.
>>>>>>
>>>>>> I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.
>>>>>
>>>>> This is not about TCG (pseries TCG guest works with VFIO on powernv host),
>>>>> this is about things like VFIO_IOMMU_GET_INFO vs.
>>>>> VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.
>>>>>
>>>>> Should I add such support in this patchset?
>>>>
>>>> Unless adding the generality is really complex, and so far I haven't
>>>> seen a reason for it to be.
>>>
>>> Seriously? :(
>>
>>
>> So, I tried.
>>
>> With q35 machine (pc-i440fx-2.6 have even worse memory tree), there are
>> several RAM blocks - 0..0xc0000, then pc.rom, then RAM again till 2GB, then
>> a gap (PCI MMIO?), then PC BIOS at 0xfffc0000 (which is RAM), then after 4GB
>> the rest of the RAM:
>>
>> memory-region: system
>>    0000000000000000-ffffffffffffffff (prio 0, RW): system
>>      0000000000000000-000000007fffffff (prio 0, RW): alias ram-below-4g
>> @pc.ram 0000000000000000-000000007fffffff
>>      0000000000000000-ffffffffffffffff (prio -1, RW): pci
>>        00000000000c0000-00000000000dffff (prio 1, RW): pc.rom
>>        00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios
>> @pc.bios 0000000000020000-000000000003ffff
>>        00000000febe0000-00000000febeffff (prio 1, RW): 0003:09:00.0 BAR 0
>>          00000000febe0000-00000000febeffff (prio 0, RW): 0003:09:00.0 BAR 0
>> mmaps[0]
>>        00000000febf0000-00000000febf1fff (prio 1, RW): 0003:09:00.0 BAR 2
>>          00000000febf0000-00000000febf007f (prio 0, RW): msix-table
>>          00000000febf1000-00000000febf1007 (prio 0, RW): msix-pba [disabled]
>>        00000000febf2000-00000000febf2fff (prio 1, RW): ahci
>>        00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
>> [...]
>>      0000000100000000-000000027fffffff (prio 0, RW): alias ram-above-4g
>> @pc.ram 0000000080000000-00000001ffffffff
>>
>>
>> Ok, let's say we filter out "pc.bios", "pc.rom", BARs (aka "skip dump"
>> regions) and we end up having RAM below 2GB and above 4GB, 2 windows.
>> Problem is a window needs to be aligned to power of two.
>
> Window size?  That should be ok - you can just round up the requested
> size to a power of two.  We don't have to populate the whole host window.
 >
>> Ok, I change my test from -m8G to -m10G to have 2 aligned regions - 2GB and
>> 8GB), next problem - the second window needs to start from 1<<<59, not 4G or
>> any other random offset.
>
> Yeah, that's harder.  With the IODA is it always that the 32-bit
> window is at 0 and the 64-bit window is at 1<<59, or can you have a
> 64-bit window at 0?

I can have a single window at 0. This is why my (hacked) version works at all.

IODA2 allows having 2 windows, each window can be backed with TCE table 
with variable pagesize or be a bypass, the window capabilities are exactly 
the same.


> If it's the second, then this is theoretically possible, but the host
> kernel - on seeing the second window request, would need to see that
> it can instead of adding a new host window, instead extend the first
> host window so that it covers both the guest windows.
> That does sound reasonably tricky and I don't think we should try to
> implement it any time soon.  BUT - we should design our *interfaces*
> so that it's reasonable to add that in future.
>
>> Ok, I change my test to start with -m1G to have a single window.
>> Now it fails because here is what happens:
>> region_add: pc.ram: 0 40000000
>> region_del: pc.ram: 0 40000000
>> region_add: pc.ram: 0 c0000
>> The second "add" is not power of two -> fail. And I cannot avoid this -
>> "pc.rom" is still there, it is a separate region so RAM gets split into
>> smaller chunks. I do not know to to fix this properly.


Still, how to fix this? Do I need to fix this now? Practically, when do I 
create this single huge window?

>>
>>
>> So, in order to make it (x86/tcg on powernv host) work anyhow, I had to
>> create a single huge window (as it is a single window - it starts from @0)
>> unconditionally in vfio_connect_container(), not in
>> vfio_listener_region_add(); and I added filtering (skip "pc.bios", "pc.rom",
>> BARs - these things start above 1GB) and then I could boot a x86_64 guest
>> and even pass Mellanox ConnextX3, it would bring 2 interfaces up and
>> dhclient assigned IPs to them (which is quite amusing that it even works).
>>
>>
>> So, I think I will replace assert() with:
>>
>> unsigned pagesize = qemu_real_host_page_size;
>> if (section->mr->iommu_ops) {
>>      pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
>> }
>>
>> but there is no practical use for this anyway.
>
> So, I think what you need here is to compute the effective IOMMU page
> size (see other email for details).  That will be getrampagesize() for
> RAM regions, and the mininum of that and the guest IOMMU page size for
> IOMMU regions.






-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-29  5:22           ` David Gibson
@ 2016-03-29  6:23             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 64+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-29  6:23 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 03/29/2016 04:22 PM, David Gibson wrote:
> On Thu, Mar 24, 2016 at 01:32:48PM +1100, Alexey Kardashevskiy wrote:
>> On 03/23/2016 05:11 PM, David Gibson wrote:
>>> On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/23/2016 01:13 PM, David Gibson wrote:
>>>>> On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
>>>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>>>
>>>>>> This implements DDW for emulated and VFIO devices.
>>>>>> This reserves RTAS token numbers for DDW calls.
>>>>>>
>>>>>> This changes the TCE table migration descriptor to support dynamic
>>>>>> tables as from now on, PHB will create as many stub TCE table objects
>>>>>> as PHB can possibly support but not all of them might be initialized at
>>>>>> the time of migration because DDW might or might not be requested by
>>>>>> the guest.
>>>>>>
>>>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>>>> the pseries-2.5 machine and older disable it.
>>>>>>
>>>>>> This implements DDW for VFIO. The host kernel support is required.
>>>>>> This adds a "levels" property to PHB to control the number of levels
>>>>>> in the actual TCE table allocated by the host kernel, 0 is the default
>>>>>> value to tell QEMU to calculate the correct value. Current hardware
>>>>>> supports up to 5 levels.
>>>>>>
>>>>>> The existing linux guests try creating one additional huge DMA window
>>>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>>>> property which is a bus address for the 64bit window and by default
>>>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>>>
>>>>>> This adds 4 RTAS handlers:
>>>>>> * ibm,query-pe-dma-window
>>>>>> * ibm,create-pe-dma-window
>>>>>> * ibm,remove-pe-dma-window
>>>>>> * ibm,reset-pe-dma-window
>>>>>> These are registered from type_init() callback.
>>>>>>
>>>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>>>> spapr_iommu.c with PCI.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>   hw/ppc/Makefile.objs        |   1 +
>>>>>>   hw/ppc/spapr.c              |   7 +-
>>>>>>   hw/ppc/spapr_pci.c          |  73 ++++++++---
>>>>>>   hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>   hw/vfio/common.c            |   5 -
>>>>>>   include/hw/pci-host/spapr.h |  13 ++
>>>>>>   include/hw/ppc/spapr.h      |  16 ++-
>>>>>>   trace-events                |   4 +
>>>>>>   8 files changed, 395 insertions(+), 24 deletions(-)
>>>>>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>>>>>
>>>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>>>> index c1ffc77..986b36f 100644
>>>>>> --- a/hw/ppc/Makefile.objs
>>>>>> +++ b/hw/ppc/Makefile.objs
>>>>>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>>>>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>>>   obj-y += spapr_pci_vfio.o
>>>>>>   endif
>>>>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>>>   # PowerPC 4xx boards
>>>>>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>>>   obj-y += ppc4xx_pci.o
>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>> index d0bb423..ef4c637 100644
>>>>>> --- a/hw/ppc/spapr.c
>>>>>> +++ b/hw/ppc/spapr.c
>>>>>> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>>>>>    * pseries-2.5
>>>>>>    */
>>>>>>   #define SPAPR_COMPAT_2_5 \
>>>>>> -        HW_COMPAT_2_5
>>>>>> +        HW_COMPAT_2_5 \
>>>>>> +        {\
>>>>>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>>>>> +            .property = "ddw",\
>>>>>> +            .value    = stringify(off),\
>>>>>> +        },
>>>>>>
>>>>>>   static void spapr_machine_2_5_instance_options(MachineState *machine)
>>>>>>   {
>>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>>> index af99a36..3bb294a 100644
>>>>>> --- a/hw/ppc/spapr_pci.c
>>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>>> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>>>>>       return buf;
>>>>>>   }
>>>>>>
>>>>>> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>> -                                       uint32_t liobn,
>>>>>> -                                       uint32_t page_shift,
>>>>>> -                                       uint64_t window_addr,
>>>>>> -                                       uint64_t window_size,
>>>>>> -                                       Error **errp)
>>>>>> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>> +                                 uint32_t liobn,
>>>>>> +                                 uint32_t page_shift,
>>>>>> +                                 uint64_t window_addr,
>>>>>> +                                 uint64_t window_size,
>>>>>> +                                 Error **errp)
>>>>>>   {
>>>>>>       sPAPRTCETable *tcet;
>>>>>>       uint32_t nb_table = window_size >> page_shift;
>>>>>> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>>           return;
>>>>>>       }
>>>>>>
>>>>>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>>>>>> +        error_setg(errp,
>>>>>> +                   "Attempt to use second window when DDW is disabled on PHB");
>>>>>> +        return;
>>>>>> +    }
>>>>>
>>>>> This should never happen unless something is wrong with the tests in
>>>>> the RTAS functions, yes?  In which case it should probably be an
>>>>> assert().
>>>>
>>>> This should not. But this is called from the RTAS caller so I'd really like
>>>> to have a message rather than assert() if that condition happens, here or in
>>>> rtas_ibm_create_pe_dma_window().
>>>
>>> It should only be called from RTAS if ddw is enabled though, yes?
>>
>>
>>  From RTAS and from the PHB reset handler. Well. I will get rid of
>> spapr_phb_dma_window_enable/spapr_phb_dma_window_disable, they are quite
>> useless when I look at them now.
>
> Ok.
>
>>>>>>       spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
>>>>>>   }
>>>>>>
>>>>>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>>>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>>>>   {
>>>>>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>>>>
>>>>>> @@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>>>       }
>>>>>>
>>>>>>       /* DMA setup */
>>>>>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>>>>>> -    if (!tcet) {
>>>>>> -        error_report("No default TCE table for %s", sphb->dtbusname);
>>>>>> -        return;
>>>>>> -    }
>>>>>>
>>>>>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>>>> -                                        spapr_tce_get_iommu(tcet), 0);
>>>>>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>>>> +        tcet = spapr_tce_new_table(DEVICE(sphb),
>>>>>> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
>>>>>> +        if (!tcet) {
>>>>>> +            error_setg(errp, "Creating window#%d failed for %s",
>>>>>> +                       i, sphb->dtbusname);
>>>>>> +            return;
>>>>>> +        }
>>>>>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>>>> +                                            spapr_tce_get_iommu(tcet), 0);
>>>>>> +    }
>>>>>>
>>>>>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>>>>>   }
>>>>>> @@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>>>>>
>>>>>>   void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>>>>>   {
>>>>>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>>>>>>       Error *local_err = NULL;
>>>>>> +    int i;
>>>>>>
>>>>>> -    if (tcet && tcet->enabled) {
>>>>>> -        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>>>>>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>>>> +        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
>>>>>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>>>> +
>>>>>> +        if (tcet && tcet->enabled) {
>>>>>> +            spapr_phb_dma_window_disable(sphb, liobn);
>>>>>> +        }
>>>>>>       }
>>>>>>
>>>>>>       /* Register default 32bit DMA window */
>>>>>> @@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
>>>>>>       /* Default DMA window is 0..1GB */
>>>>>>       DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>>>       DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>>>>>> +                       0x800000000000000ULL),
>>>>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>>>> +    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
>>>>>> +                       SPAPR_PCI_DMA_MAX_WINDOWS),
>>>>>
>>>>> What will happen if the user tries to set 'windows' larger than
>>>>> SPAPR_PCI_DMA_MAX_WINDOWS?
>>>>
>>>>
>>>> Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported
>>>> everywhere, missed that. Besides that, there will be support for more
>>>> windows, that's it. The host VFIO IOMMU driver will fail creating more
>>>> windows but this is expected. For emulated windows, there will be more
>>>> windows with no other consequences.
>>>
>>> Hmm.. is there actually a reason to have the windows property?  Would
>>> you be better off just using the compile time constant for now.
>>
>>
>> I am afraid it is going to be 2 DMA windows forever as the other DMA tlb-ish
>> facility coming does not use windows at all :)
>
> That sounds like a reason not to have the property and leave it as a
> compile time constant.
>
>>>>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>>>> +                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
>>>>>>       DEFINE_PROP_END_OF_LIST(),
>>>>>>   };
>>>>>>
>>>>>> @@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>>>       uint32_t interrupt_map_mask[] = {
>>>>>>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>>>>>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>>>>>> +    uint32_t ddw_applicable[] = {
>>>>>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>>>>>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>>>>>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>>>>>> +    };
>>>>>> +    uint32_t ddw_extensions[] = {
>>>>>> +        cpu_to_be32(1),
>>>>>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>>>>>> +    };
>>>>>>       sPAPRTCETable *tcet;
>>>>>>       PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>>>>>       sPAPRFDT s_fdt;
>>>>>> @@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>>>>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>>>>>
>>>>>> +    /* Dynamic DMA window */
>>>>>> +    if (phb->ddw_enabled) {
>>>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>>>>>> +                         sizeof(ddw_applicable)));
>>>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>>>>>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>>>>>> +    }
>>>>>> +
>>>>>>       /* Build the interrupt-map, this must matches what is done
>>>>>>        * in pci_spapr_map_irq
>>>>>>        */
>>>>>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>>>>>> new file mode 100644
>>>>>> index 0000000..37f805f
>>>>>> --- /dev/null
>>>>>> +++ b/hw/ppc/spapr_rtas_ddw.c
>>>>>> @@ -0,0 +1,300 @@
>>>>>> +/*
>>>>>> + * QEMU sPAPR Dynamic DMA windows support
>>>>>> + *
>>>>>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>>>>>> + *
>>>>>> + *  This program is free software; you can redistribute it and/or modify
>>>>>> + *  it under the terms of the GNU General Public License as published by
>>>>>> + *  the Free Software Foundation; either version 2 of the License,
>>>>>> + *  or (at your option) any later version.
>>>>>> + *
>>>>>> + *  This program is distributed in the hope that it will be useful,
>>>>>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>>> + *  GNU General Public License for more details.
>>>>>> + *
>>>>>> + *  You should have received a copy of the GNU General Public License
>>>>>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "qemu/error-report.h"
>>>>>> +#include "hw/ppc/spapr.h"
>>>>>> +#include "hw/pci-host/spapr.h"
>>>>>> +#include "trace.h"
>>>>>> +
>>>>>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>>>>>> +{
>>>>>> +    sPAPRTCETable *tcet;
>>>>>> +
>>>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>>>> +    if (tcet && tcet->enabled) {
>>>>>> +        ++*(unsigned *)opaque;
>>>>>> +    }
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +    unsigned ret = 0;
>>>>>> +
>>>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>>>>>> +{
>>>>>> +    sPAPRTCETable *tcet;
>>>>>> +
>>>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>>>> +    if (tcet && !tcet->enabled) {
>>>>>> +        *(uint32_t *)opaque = tcet->liobn;
>>>>>> +        return 1;
>>>>>> +    }
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +    uint32_t liobn = 0;
>>>>>> +
>>>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>>>>>> +
>>>>>> +    return liobn;
>>>>>> +}
>>>>>> +
>>>>>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>>>>>> +                                 uint64_t page_mask)
>>>>>> +{
>>>>>> +    int i, j;
>>>>>> +    uint32_t mask = 0;
>>>>>> +    const struct { int shift; uint32_t mask; } masks[] = {
>>>>>> +        { 12, RTAS_DDW_PGSIZE_4K },
>>>>>> +        { 16, RTAS_DDW_PGSIZE_64K },
>>>>>> +        { 24, RTAS_DDW_PGSIZE_16M },
>>>>>> +        { 25, RTAS_DDW_PGSIZE_32M },
>>>>>> +        { 26, RTAS_DDW_PGSIZE_64M },
>>>>>> +        { 27, RTAS_DDW_PGSIZE_128M },
>>>>>> +        { 28, RTAS_DDW_PGSIZE_256M },
>>>>>> +        { 34, RTAS_DDW_PGSIZE_16G },
>>>>>> +    };
>>>>>> +
>>>>>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>>>>>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>>>>>> +            if ((sps[i].page_shift == masks[j].shift) &&
>>>>>> +                    (page_mask & (1ULL << masks[j].shift))) {
>>>>>> +                mask |= masks[j].mask;
>>>>>> +            }
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>>>>> +                                         sPAPRMachineState *spapr,
>>>>>> +                                         uint32_t token, uint32_t nargs,
>>>>>> +                                         target_ulong args,
>>>>>> +                                         uint32_t nret, target_ulong rets)
>>>>>> +{
>>>>>> +    CPUPPCState *env = &cpu->env;
>>>>>> +    sPAPRPHBState *sphb;
>>>>>> +    uint64_t buid, max_window_size;
>>>>>> +    uint32_t avail, addr, pgmask = 0;
>>>>>> +
>>>>>> +    if ((nargs != 3) || (nret != 5)) {
>>>>>> +        goto param_error_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>>>> +    addr = rtas_ld(args, 0);
>>>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>>>> +        goto param_error_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Work out supported page masks */
>>>>>> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>>>>>
>>>>> There are a few potential problems here.  First you're just
>>>>> arbitrarily picking the first entry in the sps array to filter
>>>>
>>>> Why first?  spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.
>>>
>>> env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page
>>> sizes, then again for actual page sizes.  You're only examing the
>>> first "row" of that table.  It kinda works because the 4k base page
>>> size is first, which lists all the actual page size options.
>>
>> Ah. Right. So I need to walk through all of them, ok.
>
> Well.. that would be closer to right, but even then, I don't think
> that's really right.  What you want to advertise here is all guest
> capabilities that are possible on the host.
>
> So, the first question is what's the actual size of the chunks that
> the host will need to map.  Let's call that the "effective IOMMU page
> size".  That's going to be whichever is smaller of the guest IOMMU
> page size, and the (host) RAM page size.
>
> e.g. For a guest using 4k IOMMU mappings on a host with 64k page size,
> the effective page size is 4kiB.
>
> For a guest using 16MiB IOMMU mappings on a host with 64kiB page size,
> the effective page size is 64kiB (because a 16MiB guest "page" could
> be broken up into different real pages on the host).
>
> The next question is what can you actually map in the host IOMMU.  It
> should be able to handle any effective page size that's equal to, or
> smaller than the smallest page size supported on the host.  That might
> involve the host inserting several host TCEs for a single effective
> page, but the VFIO DMA_MAP interface should already cover that because
> it takes a length parameter.
>
>>>>> against, which doesn't seem right (except by accident).  It's a little
>>>>> bit odd filtering against guest page sizes at all, although I get what
>>>>> you're really trying to do is filter against allowed host page sizes.
>>>>>
>>>>> The other problem is that silently filtering capabilities based on the
>>>>> host can be a pain for migration - I've made the mistake and had it
>>>>> bite me in the past.  I think it would be safer to just check the
>>>>> pagesizes requested in the property against what's possible and
>>>>> outright fail if they don't match.  For convenience you could also set
>>>>> according to host capabilities if the user doesn't specify anything,
>>>>> but that would require explicit handling of the "default" case.
>>
>>
>> What are the host capabilities here?
>>
>> There is a page mask from the host IOMMU/PE which is 4K|64K|16M and many
>> other sizes, this is supported always by IODA2.
>> And there is PAGE_SIZE and huge pages (but only with -mempath) - so, 64K and
>> 16M (with -mempath).
>>
>> And there is a "ddw-query" RTAS call which tells the guest if it can use 16M
>> or not. How do you suggest I advertise 16M to the guest? If I always
>> advertise 16M and there is no -mempath, the guest won't try smaller page
>> size.
>
> So, here's what I think you need to do:
>
>    1. Start with the full list of IOMMU page sizes the virtual (guest)
>       hardware supports (so, 4KiB, 64KiB & 16MiB, possibly modified by
>       properties)
>    2. For each entry in the list work out what the effective page size
>       will be, by clamping to the RAM page size.
>    3. Test if each effective page size is possible on the host (i.e. is
>       <= at least one of the pagesizes advertised by the host IOMMU
>       info ioctl).
>
> For a first cut it's probably reasonable to only check for == (not <=)
> the supported host pagesizes.  Ideally the host (inside the kernel)
> would automatically create duplicate host TCEs if effective page size
> < host IOMMU page size.  Obviously it would still need to use the
> supplied guest page size for the guest view of the table.
>
>
>> So - if the user wants 16M IOMMU pages, he has to use -mempath and in
>> addition to that explicitely say -global spapr-pci-host-bridge.pgsz=16M, and
>> by default enable only 4K and 64K (or just 4K?)? I am fine with this, it
>> just means more work for libvirt folks.
>
> I think that's the safest option to start with.  Require that the user
> explicitly list what pagesizes the guest IOMMU will support, and just
> fail if the host can't do that.
>
> We could potentially auto-filter (as we already do for CPU/MMU
> pagesizes) once we've thought a bit more carefully about the possible
> migration implications.


For now I enable 4K and 64K for a guest and then rely on the kernel's 
ability to create what the guest requested.

In v15, I'll add 3 patches to auto-add 16MB to the allowed mask list, may 
be it won't look too ugly.




-- 
Alexey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
  2016-03-29  5:44                       ` Alexey Kardashevskiy
@ 2016-03-29  6:44                         ` David Gibson
  0 siblings, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-29  6:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul Mackerras, Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 10847 bytes --]

On Tue, Mar 29, 2016 at 04:44:04PM +1100, Alexey Kardashevskiy wrote:
> On 03/29/2016 04:30 PM, David Gibson wrote:
> >On Thu, Mar 24, 2016 at 08:10:44PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/24/2016 11:03 AM, Alexey Kardashevskiy wrote:
> >>>On 03/23/2016 05:03 PM, David Gibson wrote:
> >>>>On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:
> >>>>>On 03/23/2016 01:53 PM, David Gibson wrote:
> >>>>>>On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>On 03/23/2016 12:08 PM, David Gibson wrote:
> >>>>>>>>On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>On 03/22/2016 04:14 PM, David Gibson wrote:
> >>>>>>>>>>On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window
> >>>>>>>>>>>management.
> >>>>>>>>>>>This adds ability to VFIO common code to dynamically allocate/remove
> >>>>>>>>>>>DMA windows in the host kernel when new VFIO container is
> >>>>>>>>>>>added/removed.
> >>>>>>>>>>>
> >>>>>>>>>>>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to
> >>>>>>>>>>>vfio_listener_region_add
> >>>>>>>>>>>and adds just created IOMMU into the host IOMMU list; the opposite
> >>>>>>>>>>>action is taken in vfio_listener_region_del.
> >>>>>>>>>>>
> >>>>>>>>>>>When creating a new window, this uses euristic to decide on the
> >>>>>>>>>>>TCE table
> >>>>>>>>>>>levels number.
> >>>>>>>>>>>
> >>>>>>>>>>>This should cause no guest visible change in behavior.
> >>>>>>>>>>>
> >>>>>>>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>>>---
> >>>>>>>>>>>Changes:
> >>>>>>>>>>>v14:
> >>>>>>>>>>>* new to the series
> >>>>>>>>>>>
> >>>>>>>>>>>---
> >>>>>>>>>>>TODO:
> >>>>>>>>>>>* export levels to PHB
> >>>>>>>>>>>---
> >>>>>>>>>>>  hw/vfio/common.c | 108
> >>>>>>>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>>>>>>>>  trace-events     |   2 ++
> >>>>>>>>>>>  2 files changed, 105 insertions(+), 5 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>>>>>>>index 4e873b7..421d6eb 100644
> >>>>>>>>>>>--- a/hw/vfio/common.c
> >>>>>>>>>>>+++ b/hw/vfio/common.c
> >>>>>>>>>>>@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer
> >>>>>>>>>>>*container,
> >>>>>>>>>>>      return 0;
> >>>>>>>>>>>  }
> >>>>>>>>>>>
> >>>>>>>>>>>+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr
> >>>>>>>>>>>min_iova)
> >>>>>>>>>>>+{
> >>>>>>>>>>>+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container,
> >>>>>>>>>>>min_iova, 0x1000);
> >>>>>>>>>>
> >>>>>>>>>>The hard-coded 0x1000 looks dubious..
> >>>>>>>>>
> >>>>>>>>>Well, that's the minimal page size...
> >>>>>>>>
> >>>>>>>>Really?  Some BookE CPUs support 1KiB page size..
> >>>>>>>
> >>>>>>>Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
> >>>>>>
> >>>>>>Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
> >>>>>>it's been done for CPU MMU I wouldn't count on it not being done for
> >>>>>>IOMMU.
> >>>>>>
> >>>>>>1 is a safer choice.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>>+    g_assert(hiommu);
> >>>>>>>>>>>+    QLIST_REMOVE(hiommu, hiommu_next);
> >>>>>>>>>>>+}
> >>>>>>>>>>>+
> >>>>>>>>>>>  static bool vfio_listener_skipped_section(MemoryRegionSection
> >>>>>>>>>>>*section)
> >>>>>>>>>>>  {
> >>>>>>>>>>>      return (!memory_region_is_ram(section->mr) &&
> >>>>>>>>>>>@@ -392,6 +400,61 @@ static void
> >>>>>>>>>>>vfio_listener_region_add(MemoryListener *listener,
> >>>>>>>>>>>      }
> >>>>>>>>>>>      end = int128_get64(llend);
> >>>>>>>>>>>
> >>>>>>>>>>>+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>>>>>>>>
> >>>>>>>>>>I think this would be clearer split out into a helper function,
> >>>>>>>>>>vfio_create_host_window() or something.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>It is rather vfio_spapr_create_host_window() and we were avoiding
> >>>>>>>>>xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> >>>>>>>>>separate file but this usually triggers more discussion and never
> >>>>>>>>>ends well.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>+        unsigned entries, pages;
> >>>>>>>>>>>+        struct vfio_iommu_spapr_tce_create create = { .argsz =
> >>>>>>>>>>>sizeof(create) };
> >>>>>>>>>>>+
> >>>>>>>>>>>+        g_assert(section->mr->iommu_ops);
> >>>>>>>>>>>+        g_assert(memory_region_is_iommu(section->mr));
> >>>>>>>>>>
> >>>>>>>>>>I don't think you need these asserts.  AFAICT the same logic should
> >>>>>>>>>>work if a RAM MR was added directly to PCI address space - this would
> >>>>>>>>>>create the new host window, then the existing code for adding a RAM MR
> >>>>>>>>>>would map that block of RAM statically into the new window.
> >>>>>>>>>
> >>>>>>>>>In what configuration/machine can we do that on SPAPR?
> >>>>>>>>
> >>>>>>>>spapr guests won't ever do that.  But you can run an x86 guest on a
> >>>>>>>>powernv host and this situation could come up.
> >>>>>>>
> >>>>>>>
> >>>>>>>I am pretty sure VFIO won't work in this case anyway.
> >>>>>>
> >>>>>>I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.
> >>>>>
> >>>>>This is not about TCG (pseries TCG guest works with VFIO on powernv host),
> >>>>>this is about things like VFIO_IOMMU_GET_INFO vs.
> >>>>>VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.
> >>>>>
> >>>>>Should I add such support in this patchset?
> >>>>
> >>>>Unless adding the generality is really complex, and so far I haven't
> >>>>seen a reason for it to be.
> >>>
> >>>Seriously? :(
> >>
> >>
> >>So, I tried.
> >>
> >>With q35 machine (pc-i440fx-2.6 have even worse memory tree), there are
> >>several RAM blocks - 0..0xc0000, then pc.rom, then RAM again till 2GB, then
> >>a gap (PCI MMIO?), then PC BIOS at 0xfffc0000 (which is RAM), then after 4GB
> >>the rest of the RAM:
> >>
> >>memory-region: system
> >>   0000000000000000-ffffffffffffffff (prio 0, RW): system
> >>     0000000000000000-000000007fffffff (prio 0, RW): alias ram-below-4g
> >>@pc.ram 0000000000000000-000000007fffffff
> >>     0000000000000000-ffffffffffffffff (prio -1, RW): pci
> >>       00000000000c0000-00000000000dffff (prio 1, RW): pc.rom
> >>       00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios
> >>@pc.bios 0000000000020000-000000000003ffff
> >>       00000000febe0000-00000000febeffff (prio 1, RW): 0003:09:00.0 BAR 0
> >>         00000000febe0000-00000000febeffff (prio 0, RW): 0003:09:00.0 BAR 0
> >>mmaps[0]
> >>       00000000febf0000-00000000febf1fff (prio 1, RW): 0003:09:00.0 BAR 2
> >>         00000000febf0000-00000000febf007f (prio 0, RW): msix-table
> >>         00000000febf1000-00000000febf1007 (prio 0, RW): msix-pba [disabled]
> >>       00000000febf2000-00000000febf2fff (prio 1, RW): ahci
> >>       00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
> >>[...]
> >>     0000000100000000-000000027fffffff (prio 0, RW): alias ram-above-4g
> >>@pc.ram 0000000080000000-00000001ffffffff
> >>
> >>
> >>Ok, let's say we filter out "pc.bios", "pc.rom", BARs (aka "skip dump"
> >>regions) and we end up having RAM below 2GB and above 4GB, 2 windows.
> >>Problem is a window needs to be aligned to power of two.
> >
> >Window size?  That should be ok - you can just round up the requested
> >size to a power of two.  We don't have to populate the whole host window.
> >
> >>Ok, I change my test from -m8G to -m10G to have 2 aligned regions - 2GB and
> >>8GB), next problem - the second window needs to start from 1<<<59, not 4G or
> >>any other random offset.
> >
> >Yeah, that's harder.  With the IODA is it always that the 32-bit
> >window is at 0 and the 64-bit window is at 1<<59, or can you have a
> >64-bit window at 0?
> 
> I can have a single window at 0. This is why my (hacked) version works at all.
> 
> IODA2 allows having 2 windows, each window can be backed with TCE table with
> variable pagesize or be a bypass, the window capabilities are exactly the
> same.

Ok.

> >If it's the second, then this is theoretically possible, but the host
> >kernel - on seeing the second window request, would need to see that
> >it can instead of adding a new host window, instead extend the first
> >host window so that it covers both the guest windows.
> >That does sound reasonably tricky and I don't think we should try to
> >implement it any time soon.  BUT - we should design our *interfaces*
> >so that it's reasonable to add that in future.
> >
> >>Ok, I change my test to start with -m1G to have a single window.
> >>Now it fails because here is what happens:
> >>region_add: pc.ram: 0 40000000
> >>region_del: pc.ram: 0 40000000
> >>region_add: pc.ram: 0 c0000
> >>The second "add" is not power of two -> fail. And I cannot avoid this -
> >>"pc.rom" is still there, it is a separate region so RAM gets split into
> >>smaller chunks. I do not know to to fix this properly.
> 
> 
> Still, how to fix this? Do I need to fix this now? Practically, when do I
> create this single huge window?

I don't think you want to do this now.  For now, I think qemu should
just request a window matching the MR, and if the kernel can't supply
it we fail.

In future we can look at extending the kernel so that we try harder to
satisfy VFIO requests for new windows, by merging and/or extending
their requests to cover multiple areas.

> >>So, in order to make it (x86/tcg on powernv host) work anyhow, I had to
> >>create a single huge window (as it is a single window - it starts from @0)
> >>unconditionally in vfio_connect_container(), not in
> >>vfio_listener_region_add(); and I added filtering (skip "pc.bios", "pc.rom",
> >>BARs - these things start above 1GB) and then I could boot a x86_64 guest
> >>and even pass Mellanox ConnextX3, it would bring 2 interfaces up and
> >>dhclient assigned IPs to them (which is quite amusing that it even works).
> >>
> >>
> >>So, I think I will replace assert() with:
> >>
> >>unsigned pagesize = qemu_real_host_page_size;
> >>if (section->mr->iommu_ops) {
> >>     pagesize = section->mr->iommu_ops->get_page_sizes(section->mr);
> >>}
> >>
> >>but there is no practical use for this anyway.
> >
> >So, I think what you need here is to compute the effective IOMMU page
> >size (see other email for details).  That will be getrampagesize() for
> >RAM regions, and the mininum of that and the guest IOMMU page size for
> >IOMMU regions.
> 
> 
> 
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-03-24  2:32         ` Alexey Kardashevskiy
  2016-03-29  5:22           ` David Gibson
@ 2016-03-31  3:19           ` David Gibson
  1 sibling, 0 replies; 64+ messages in thread
From: David Gibson @ 2016-03-31  3:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 23094 bytes --]

On Thu, Mar 24, 2016 at 01:32:48PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 05:11 PM, David Gibson wrote:
> >On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/23/2016 01:13 PM, David Gibson wrote:
> >>>On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
> >>>>This adds support for Dynamic DMA Windows (DDW) option defined by
> >>>>the SPAPR specification which allows to have additional DMA window(s)
> >>>>
> >>>>This implements DDW for emulated and VFIO devices.
> >>>>This reserves RTAS token numbers for DDW calls.
> >>>>
> >>>>This changes the TCE table migration descriptor to support dynamic
> >>>>tables as from now on, PHB will create as many stub TCE table objects
> >>>>as PHB can possibly support but not all of them might be initialized at
> >>>>the time of migration because DDW might or might not be requested by
> >>>>the guest.
> >>>>
> >>>>The "ddw" property is enabled by default on a PHB but for compatibility
> >>>>the pseries-2.5 machine and older disable it.
> >>>>
> >>>>This implements DDW for VFIO. The host kernel support is required.
> >>>>This adds a "levels" property to PHB to control the number of levels
> >>>>in the actual TCE table allocated by the host kernel, 0 is the default
> >>>>value to tell QEMU to calculate the correct value. Current hardware
> >>>>supports up to 5 levels.
> >>>>
> >>>>The existing linux guests try creating one additional huge DMA window
> >>>>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>>>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>>>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>>>and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >>>>property which is a bus address for the 64bit window and by default
> >>>>set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >>>>uses and this allows having emulated and VFIO devices on the same bus.
> >>>>
> >>>>This adds 4 RTAS handlers:
> >>>>* ibm,query-pe-dma-window
> >>>>* ibm,create-pe-dma-window
> >>>>* ibm,remove-pe-dma-window
> >>>>* ibm,reset-pe-dma-window
> >>>>These are registered from type_init() callback.
> >>>>
> >>>>These RTAS handlers are implemented in a separate file to avoid polluting
> >>>>spapr_iommu.c with PCI.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  hw/ppc/Makefile.objs        |   1 +
> >>>>  hw/ppc/spapr.c              |   7 +-
> >>>>  hw/ppc/spapr_pci.c          |  73 ++++++++---
> >>>>  hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>  hw/vfio/common.c            |   5 -
> >>>>  include/hw/pci-host/spapr.h |  13 ++
> >>>>  include/hw/ppc/spapr.h      |  16 ++-
> >>>>  trace-events                |   4 +
> >>>>  8 files changed, 395 insertions(+), 24 deletions(-)
> >>>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>>>
> >>>>diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>>>index c1ffc77..986b36f 100644
> >>>>--- a/hw/ppc/Makefile.objs
> >>>>+++ b/hw/ppc/Makefile.objs
> >>>>@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>>>  obj-y += spapr_pci_vfio.o
> >>>>  endif
> >>>>+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>>>  # PowerPC 4xx boards
> >>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>>>  obj-y += ppc4xx_pci.o
> >>>>diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>>index d0bb423..ef4c637 100644
> >>>>--- a/hw/ppc/spapr.c
> >>>>+++ b/hw/ppc/spapr.c
> >>>>@@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>>>   * pseries-2.5
> >>>>   */
> >>>>  #define SPAPR_COMPAT_2_5 \
> >>>>-        HW_COMPAT_2_5
> >>>>+        HW_COMPAT_2_5 \
> >>>>+        {\
> >>>>+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >>>>+            .property = "ddw",\
> >>>>+            .value    = stringify(off),\
> >>>>+        },
> >>>>
> >>>>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >>>>  {
> >>>>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>>>index af99a36..3bb294a 100644
> >>>>--- a/hw/ppc/spapr_pci.c
> >>>>+++ b/hw/ppc/spapr_pci.c
> >>>>@@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
> >>>>      return buf;
> >>>>  }
> >>>>
> >>>>-static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>>>-                                       uint32_t liobn,
> >>>>-                                       uint32_t page_shift,
> >>>>-                                       uint64_t window_addr,
> >>>>-                                       uint64_t window_size,
> >>>>-                                       Error **errp)
> >>>>+void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>>>+                                 uint32_t liobn,
> >>>>+                                 uint32_t page_shift,
> >>>>+                                 uint64_t window_addr,
> >>>>+                                 uint64_t window_size,
> >>>>+                                 Error **errp)
> >>>>  {
> >>>>      sPAPRTCETable *tcet;
> >>>>      uint32_t nb_table = window_size >> page_shift;
> >>>>@@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>>>          return;
> >>>>      }
> >>>>
> >>>>+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> >>>>+        error_setg(errp,
> >>>>+                   "Attempt to use second window when DDW is disabled on PHB");
> >>>>+        return;
> >>>>+    }
> >>>
> >>>This should never happen unless something is wrong with the tests in
> >>>the RTAS functions, yes?  In which case it should probably be an
> >>>assert().
> >>
> >>This should not. But this is called from the RTAS caller so I'd really like
> >>to have a message rather than assert() if that condition happens, here or in
> >>rtas_ibm_create_pe_dma_window().
> >
> >It should only be called from RTAS if ddw is enabled though, yes?
> 
> 
> From RTAS and from the PHB reset handler. Well. I will get rid of
> spapr_phb_dma_window_enable/spapr_phb_dma_window_disable, they are quite
> useless when I look at them now.

Ok.

> >>>>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
> >>>>  }
> >>>>
> >>>>-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>>>+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>>>  {
> >>>>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>>>
> >>>>@@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>      }
> >>>>
> >>>>      /* DMA setup */
> >>>>-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> >>>>-    if (!tcet) {
> >>>>-        error_report("No default TCE table for %s", sphb->dtbusname);
> >>>>-        return;
> >>>>-    }
> >>>>
> >>>>-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>>>-                                        spapr_tce_get_iommu(tcet), 0);
> >>>>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>>>+        tcet = spapr_tce_new_table(DEVICE(sphb),
> >>>>+                                   SPAPR_PCI_LIOBN(sphb->index, i));
> >>>>+        if (!tcet) {
> >>>>+            error_setg(errp, "Creating window#%d failed for %s",
> >>>>+                       i, sphb->dtbusname);
> >>>>+            return;
> >>>>+        }
> >>>>+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>>>+                                            spapr_tce_get_iommu(tcet), 0);
> >>>>+    }
> >>>>
> >>>>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >>>>  }
> >>>>@@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
> >>>>
> >>>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >>>>  {
> >>>>-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> >>>>      Error *local_err = NULL;
> >>>>+    int i;
> >>>>
> >>>>-    if (tcet && tcet->enabled) {
> >>>>-        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> >>>>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>>>+        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
> >>>>+        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>>>+
> >>>>+        if (tcet && tcet->enabled) {
> >>>>+            spapr_phb_dma_window_disable(sphb, liobn);
> >>>>+        }
> >>>>      }
> >>>>
> >>>>      /* Register default 32bit DMA window */
> >>>>@@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
> >>>>      /* Default DMA window is 0..1GB */
> >>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >>>>+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> >>>>+                       0x800000000000000ULL),
> >>>>+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >>>>+    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
> >>>>+                       SPAPR_PCI_DMA_MAX_WINDOWS),
> >>>
> >>>What will happen if the user tries to set 'windows' larger than
> >>>SPAPR_PCI_DMA_MAX_WINDOWS?
> >>
> >>
> >>Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported
> >>everywhere, missed that. Besides that, there will be support for more
> >>windows, that's it. The host VFIO IOMMU driver will fail creating more
> >>windows but this is expected. For emulated windows, there will be more
> >>windows with no other consequences.
> >
> >Hmm.. is there actually a reason to have the windows property?  Would
> >you be better off just using the compile time constant for now.
> 
> 
> I am afraid it is going to be 2 DMA windows forever as the other DMA tlb-ish
> facility coming does not use windows at all :)

Ok, that sounds like a vote for removing the property.

> >>>>+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >>>>+                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
> >>>>      DEFINE_PROP_END_OF_LIST(),
> >>>>  };
> >>>>
> >>>>@@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>>>      uint32_t interrupt_map_mask[] = {
> >>>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >>>>+    uint32_t ddw_applicable[] = {
> >>>>+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >>>>+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >>>>+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >>>>+    };
> >>>>+    uint32_t ddw_extensions[] = {
> >>>>+        cpu_to_be32(1),
> >>>>+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >>>>+    };
> >>>>      sPAPRTCETable *tcet;
> >>>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>>>      sPAPRFDT s_fdt;
> >>>>@@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>>>
> >>>>+    /* Dynamic DMA window */
> >>>>+    if (phb->ddw_enabled) {
> >>>>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >>>>+                         sizeof(ddw_applicable)));
> >>>>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >>>>+                         &ddw_extensions, sizeof(ddw_extensions)));
> >>>>+    }
> >>>>+
> >>>>      /* Build the interrupt-map, this must matches what is done
> >>>>       * in pci_spapr_map_irq
> >>>>       */
> >>>>diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >>>>new file mode 100644
> >>>>index 0000000..37f805f
> >>>>--- /dev/null
> >>>>+++ b/hw/ppc/spapr_rtas_ddw.c
> >>>>@@ -0,0 +1,300 @@
> >>>>+/*
> >>>>+ * QEMU sPAPR Dynamic DMA windows support
> >>>>+ *
> >>>>+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >>>>+ *
> >>>>+ *  This program is free software; you can redistribute it and/or modify
> >>>>+ *  it under the terms of the GNU General Public License as published by
> >>>>+ *  the Free Software Foundation; either version 2 of the License,
> >>>>+ *  or (at your option) any later version.
> >>>>+ *
> >>>>+ *  This program is distributed in the hope that it will be useful,
> >>>>+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>>+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>>>+ *  GNU General Public License for more details.
> >>>>+ *
> >>>>+ *  You should have received a copy of the GNU General Public License
> >>>>+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >>>>+ */
> >>>>+
> >>>>+#include "qemu/osdep.h"
> >>>>+#include "qemu/error-report.h"
> >>>>+#include "hw/ppc/spapr.h"
> >>>>+#include "hw/pci-host/spapr.h"
> >>>>+#include "trace.h"
> >>>>+
> >>>>+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >>>>+{
> >>>>+    sPAPRTCETable *tcet;
> >>>>+
> >>>>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>>>+    if (tcet && tcet->enabled) {
> >>>>+        ++*(unsigned *)opaque;
> >>>>+    }
> >>>>+    return 0;
> >>>>+}
> >>>>+
> >>>>+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >>>>+{
> >>>>+    unsigned ret = 0;
> >>>>+
> >>>>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >>>>+
> >>>>+    return ret;
> >>>>+}
> >>>>+
> >>>>+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >>>>+{
> >>>>+    sPAPRTCETable *tcet;
> >>>>+
> >>>>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>>>+    if (tcet && !tcet->enabled) {
> >>>>+        *(uint32_t *)opaque = tcet->liobn;
> >>>>+        return 1;
> >>>>+    }
> >>>>+    return 0;
> >>>>+}
> >>>>+
> >>>>+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >>>>+{
> >>>>+    uint32_t liobn = 0;
> >>>>+
> >>>>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >>>>+
> >>>>+    return liobn;
> >>>>+}
> >>>>+
> >>>>+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> >>>>+                                 uint64_t page_mask)
> >>>>+{
> >>>>+    int i, j;
> >>>>+    uint32_t mask = 0;
> >>>>+    const struct { int shift; uint32_t mask; } masks[] = {
> >>>>+        { 12, RTAS_DDW_PGSIZE_4K },
> >>>>+        { 16, RTAS_DDW_PGSIZE_64K },
> >>>>+        { 24, RTAS_DDW_PGSIZE_16M },
> >>>>+        { 25, RTAS_DDW_PGSIZE_32M },
> >>>>+        { 26, RTAS_DDW_PGSIZE_64M },
> >>>>+        { 27, RTAS_DDW_PGSIZE_128M },
> >>>>+        { 28, RTAS_DDW_PGSIZE_256M },
> >>>>+        { 34, RTAS_DDW_PGSIZE_16G },
> >>>>+    };
> >>>>+
> >>>>+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> >>>>+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> >>>>+            if ((sps[i].page_shift == masks[j].shift) &&
> >>>>+                    (page_mask & (1ULL << masks[j].shift))) {
> >>>>+                mask |= masks[j].mask;
> >>>>+            }
> >>>>+        }
> >>>>+    }
> >>>>+
> >>>>+    return mask;
> >>>>+}
> >>>>+
> >>>>+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >>>>+                                         sPAPRMachineState *spapr,
> >>>>+                                         uint32_t token, uint32_t nargs,
> >>>>+                                         target_ulong args,
> >>>>+                                         uint32_t nret, target_ulong rets)
> >>>>+{
> >>>>+    CPUPPCState *env = &cpu->env;
> >>>>+    sPAPRPHBState *sphb;
> >>>>+    uint64_t buid, max_window_size;
> >>>>+    uint32_t avail, addr, pgmask = 0;
> >>>>+
> >>>>+    if ((nargs != 3) || (nret != 5)) {
> >>>>+        goto param_error_exit;
> >>>>+    }
> >>>>+
> >>>>+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>>>+    addr = rtas_ld(args, 0);
> >>>>+    sphb = spapr_pci_find_phb(spapr, buid);
> >>>>+    if (!sphb || !sphb->ddw_enabled) {
> >>>>+        goto param_error_exit;
> >>>>+    }
> >>>>+
> >>>>+    /* Work out supported page masks */
> >>>>+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> >>>
> >>>There are a few potential problems here.  First you're just
> >>>arbitrarily picking the first entry in the sps array to filter
> >>
> >>Why first?  spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.
> >
> >env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page
> >sizes, then again for actual page sizes.  You're only examing the
> >first "row" of that table.  It kinda works because the 4k base page
> >size is first, which lists all the actual page size options.
> 
> Ah. Right. So I need to walk through all of them, ok.

Yes.. well.. except that the more I think about it, the less looking
at the target page sizes seems relevant to what IOMMU page sizes can
be supported.

The fundamental constraints are that an allowed page size needs to be:
     1. Supported by the guest IOMMU
and  2. >= than the smallest available host IOMMU page size

(1) applies always, (2) is are only relevant when VFIO is in the
picture.

This is somewhat similar to the filtering we need to do for CPU page
sizes, but I don't think it's close enough that it really makes sense
to re-use the sps table.

> >>>against, which doesn't seem right (except by accident).  It's a little
> >>>bit odd filtering against guest page sizes at all, although I get what
> >>>you're really trying to do is filter against allowed host page sizes.
> >>>
> >>>The other problem is that silently filtering capabilities based on the
> >>>host can be a pain for migration - I've made the mistake and had it
> >>>bite me in the past.  I think it would be safer to just check the
> >>>pagesizes requested in the property against what's possible and
> >>>outright fail if they don't match.  For convenience you could also set
> >>>according to host capabilities if the user doesn't specify anything,
> >>>but that would require explicit handling of the "default" case.
> 
> 
> What are the host capabilities here?
> 
> There is a page mask from the host IOMMU/PE which is 4K|64K|16M and many
> other sizes, this is supported always by IODA2.
> And there is PAGE_SIZE and huge pages (but only with -mempath) - so, 64K and
> 16M (with -mempath).
> 
> And there is a "ddw-query" RTAS call which tells the guest if it can use 16M
> or not. How do you suggest I advertise 16M to the guest? If I always
> advertise 16M and there is no -mempath, the guest won't try smaller page
> size.
> 
> So - if the user wants 16M IOMMU pages, he has to use -mempath and in
> addition to that explicitely say -global spapr-pci-host-bridge.pgsz=16M, and
> by default enable only 4K and 64K (or just 4K?)? I am fine with this, it
> just means more work for libvirt folks.

So, what complicates this is the additional constraint that 1 guest
TCE map to 1 host TCE.  That constraint isn't necessary for basic,
slow, VFIO operation: VFIO_DMA_MAP on a 16M guest page will map in
multiple TCEs on a 64k pagesized host TCE table.

The kernel acceleration does rely on the host IOMMU page size matching
the guest IOMMU page size, and the kernel DDW interface sort of does,
since we directly specify the page size (qemu could request a DDW
window with a smaller pagesize than the guest pagesize, and things
should still work ok - but the guest TCE table representation would
now have multiple host TCEs backing each entry).

Allowing those more complex cases is probably more trouble than its
worth for the time being, so I think that replaces constraint (2) with
the stronger constraint:
    2a. == to one of the host IOMMU page sizes

> >>For the migration purposes, both guests should be started with or without
> >>hugepages enabled; this is taken into account already. Besides that, the
> >>result of "query" won't differ.
> >
> >Hmm.. if you're migrating between TCG and KVM or between PR and HV
> >these could change as well.  I'm not sure that works at the moment,
> >but I'd prefer not to introduce any more barriers to it than we have
> >to.
> >
> >>>Remember that this code will be relevant for DDW with emulated
> >>>devices, even if VFIO is not in play at all.
> >>>
> >>>All those considerations aside, it seems like it would make more sense
> >>>to do this filtering during device realize, rather than leaving it
> >>>until the guest queries.
> >>
> >>The result will be the same, it only depends on whether hugepages are
> >>enabled or not and this happens at the start time. But yes, feels more
> >>accurate to do this in PHB realize(), I'll move it.
> >>
> >>
> >>>
> >>>>+    /*
> >>>>+     * This is "Largest contiguous block of TCEs allocated specifically
> >>>>+     * for (that is, are reserved for) this PE".
> >>>>+     * Return the maximum number as maximum supported RAM size was in 4K pages.
> >>>>+     */
> >>>>+    max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT;
> >>>
> >>>Will maxram_size always be enough?  There will sometimes be an
> >>>alignment gap between the "base" RAM and the hotpluggable RAM, meaning
> >>>that if everything is plugged the last RAM address will be beyond
> >>>maxram_size.  Will that require pushing this number up, or will the
> >>>guest "repack" the RAM layout when it maps it into the TCE tables?
> >>
> >>
> >>Hm. I do not know what the guest does to DDW on memory hotplug but this is a
> >>valid point... What QEMU helper does return the last available address in
> >>the system memory address space? Like memblock_end_of_DRAM() in the kernel,
> >>I would use that instead.
> >
> >There is a last_ram_offset() but that's in the ram_addr_t address
> 
> What do you call "ram_addr_t address space"? Non-pluggable memory and
> machine->ram_size is its size?
> 
> 
> >space, which isn't necessarily the same as the physical address space
> >(though it's usually similar).
> 
> 
> I looked at the code and it looks like machine->ram_size is always what came
> from "-m" and it won't grow if some memory was hotplugged, is that correct?
> 
> And hotpluggable memory does not appear in the global ram_list as a RAMBlock?
> 
> 
> >You can have a look at what we check
> >in (the TCG/PR version of) H_ENTER which needs to check this as well.
> 
> Ok, thanks for the pointer.
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2016-03-31  4:01 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address Alexey Kardashevskiy
2016-03-22  0:49   ` David Gibson
2016-03-22  3:12     ` Alexey Kardashevskiy
2016-03-22  3:26       ` David Gibson
2016-03-22  4:28         ` Alexey Kardashevskiy
2016-03-22  4:59           ` David Gibson
2016-03-22  7:19             ` Alexey Kardashevskiy
2016-03-22 23:07               ` David Gibson
2016-03-23 10:58         ` Paolo Bonzini
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 02/18] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
2016-03-22  1:02   ` David Gibson
2016-03-22  3:17     ` Alexey Kardashevskiy
2016-03-22  3:28       ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 04/18] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2016-03-22  1:11   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
2016-03-22  1:18   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration Alexey Kardashevskiy
2016-03-22  1:23   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state Alexey Kardashevskiy
2016-03-22  1:31   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 09/18] spapr_iommu: Add root memory region Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 10/18] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-03-22  3:02   ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
2016-03-22  3:05   ` David Gibson
2016-03-22 15:47     ` Alex Williamson
2016-03-23  0:43       ` David Gibson
2016-03-23  0:44       ` Alexey Kardashevskiy
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2016-03-22  4:04   ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 14/18] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities Alexey Kardashevskiy
2016-03-22  4:20   ` David Gibson
2016-03-22  6:47     ` Alexey Kardashevskiy
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
2016-03-22  4:45   ` David Gibson
2016-03-22  6:24     ` Alexey Kardashevskiy
2016-03-22 10:22       ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
2016-03-22  5:14   ` David Gibson
2016-03-22  5:54     ` Alexey Kardashevskiy
2016-03-23  1:08       ` David Gibson
2016-03-23  2:12         ` Alexey Kardashevskiy
2016-03-23  2:53           ` David Gibson
2016-03-23  3:06             ` Alexey Kardashevskiy
2016-03-23  6:03               ` David Gibson
2016-03-24  0:03                 ` Alexey Kardashevskiy
2016-03-24  9:10                   ` Alexey Kardashevskiy
2016-03-29  5:30                     ` David Gibson
2016-03-29  5:44                       ` Alexey Kardashevskiy
2016-03-29  6:44                         ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-03-23  2:13   ` David Gibson
2016-03-23  3:28     ` Alexey Kardashevskiy
2016-03-23  6:11       ` David Gibson
2016-03-24  2:32         ` Alexey Kardashevskiy
2016-03-29  5:22           ` David Gibson
2016-03-29  6:23             ` Alexey Kardashevskiy
2016-03-31  3:19           ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.