All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v17 01/12] vmstate: Define VARRAY with VMS_ALLOC
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 02/12] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

This allows dynamic allocation for migrating arrays.

Already existing VMSTATE_VARRAY_UINT32 requires an array to be
pre-allocated, however there are cases when the size is not known in
advance and there is no real need to enforce it.

This defines another variant of VMSTATE_VARRAY_UINT32 with WMS_ALLOC
flag which tells the receiving side to allocate memory for the array
before receiving the data.

The first user of it is a dynamic DMA window which existence and size
are totally dynamic.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 include/migration/vmstate.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 30ecc44..6c65811 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -386,6 +386,16 @@ extern const VMStateInfo vmstate_info_bitmap;
     .offset     = vmstate_offset_pointer(_state, _field, _type),     \
 }
 
+#define VMSTATE_VARRAY_UINT32_ALLOC(_field, _state, _field_num, _version, _info, _type) {\
+    .name       = (stringify(_field)),                               \
+    .version_id = (_version),                                        \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+    .info       = &(_info),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_VARRAY_UINT32|VMS_POINTER|VMS_ALLOC,           \
+    .offset     = vmstate_offset_pointer(_state, _field, _type),     \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, _info, _type) {\
     .name       = (stringify(_field)),                               \
     .version_id = (_version),                                        \
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 02/12] spapr_iommu: Introduce "enabled" state for TCE table
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 01/12] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 03/12] spapr_iommu: Migrate full state Alexey Kardashevskiy
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

Currently TCE tables are created once at start and their sizes never
change. We are going to change that by introducing a Dynamic DMA windows
support where DMA configuration may change during the guest execution.

This changes spapr_tce_new_table() to create an empty zero-size IOMMU
memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
It still will be called once at the owner object (VIO or PHB) creation.

This introduces an "enabled" state for TCE table objects, some
helper functions are added:
- spapr_tce_table_enable() receives TCE table parameters, stores in
sPAPRTCETable and allocates a guest view of the TCE table
(in the user space or KVM) and sets the correct size on the IOMMU MR;
- spapr_tce_table_disable() disposes the table and resets the IOMMU MR
size; it is made public as the following DDW code will be using it.

This changes the PHB reset handler to do the default DMA initialization
instead of spapr_phb_realize(). This does not make differenct now but
later with more than just one DMA window, we will have to remove them all
and create the default one on a system reset.

No visible change in behaviour is expected except the actual table
will be reallocated every reset. We might optimize this later.

The other way to implement this would be dynamically create/remove
the TCE table QOM objects but this would make migration impossible
as the migration code expects all QOM objects to exist at the receiver
so we have to have TCE table objects created when migration begins.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v17:
* spapr_tce_table_unrealize() calls spapr_tce_table_do_disable() directly
* moved spapr_tce_table_disable() to next patch as it is not used here
* removed @enabled as nb_table indicates already if the table is enabled

v15:
* made adjustments after removing spapr_phb_dma_window_enable()

v14:
* added spapr_tce_table_do_disable(), will make difference in following
patch with fully dynamic table migration
---
 hw/ppc/spapr_iommu.c   | 68 ++++++++++++++++++++++++++++++++------------------
 hw/ppc/spapr_pci.c     |  8 +++---
 hw/ppc/spapr_vio.c     |  8 +++---
 include/hw/ppc/spapr.h |  9 +++----
 4 files changed, 56 insertions(+), 37 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 96bb018..de63467 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "hw/hw.h"
 #include "qemu/log.h"
 #include "sysemu/kvm.h"
@@ -175,15 +176,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     tcet->fd = -1;
-    tcet->table = spapr_tce_alloc_table(tcet->liobn,
-                                        tcet->page_shift,
-                                        tcet->nb_table,
-                                        &tcet->fd,
-                                        tcet->need_vfio);
-
+    tcet->need_vfio = false;
     memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr",
-                             (uint64_t)tcet->nb_table << tcet->page_shift);
+                             "iommu-spapr", 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -225,14 +220,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
     tcet->table = newtable;
 }
 
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio)
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
 {
     sPAPRTCETable *tcet;
-    char tmp[64];
+    char tmp[32];
 
     if (spapr_tce_find_by_liobn(liobn)) {
         fprintf(stderr, "Attempted to create TCE table with duplicate"
@@ -240,16 +231,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
         return NULL;
     }
 
-    if (!nb_table) {
-        return NULL;
-    }
-
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
     tcet->liobn = liobn;
-    tcet->bus_offset = bus_offset;
-    tcet->page_shift = page_shift;
-    tcet->nb_table = nb_table;
-    tcet->need_vfio = need_vfio;
 
     snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
     object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
@@ -259,14 +242,51 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
     return tcet;
 }
 
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table)
+{
+    if (tcet->nb_table) {
+        error_report("Warning: trying to enable already enabled TCE table");
+        return;
+    }
+
+    tcet->bus_offset = bus_offset;
+    tcet->page_shift = page_shift;
+    tcet->nb_table = nb_table;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->page_shift,
+                                        tcet->nb_table,
+                                        &tcet->fd,
+                                        tcet->need_vfio);
+
+    memory_region_set_size(&tcet->iommu,
+                           (uint64_t)tcet->nb_table << tcet->page_shift);
+}
+
+static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+{
+    if (!tcet->nb_table) {
+        return;
+    }
+
+    memory_region_set_size(&tcet->iommu, 0);
+
+    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
+    tcet->fd = -1;
+    tcet->table = NULL;
+    tcet->bus_offset = 0;
+    tcet->page_shift = 0;
+    tcet->nb_table = 0;
+}
+
 static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
     QLIST_REMOVE(tcet, list);
 
-    spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
-    tcet->fd = -1;
+    spapr_tce_table_disable(tcet);
 }
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 856aec7..7688ae0 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1463,8 +1463,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-                               0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
+    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
                    sphb->dtbusname);
@@ -1472,7 +1471,10 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     /* Register default 32bit DMA window */
-    memory_region_add_subregion(&sphb->iommu_root, sphb->dma_win_addr,
+    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
+                           nb_table);
+
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index d084aed..3d9b9c6 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -483,11 +483,9 @@ static void spapr_vio_busdev_realize(DeviceState *qdev, Error **errp)
         memory_region_add_subregion_overlap(&dev->mrroot, 0, &dev->mrbypass, 1);
         address_space_init(&dev->as, &dev->mrroot, qdev->id);
 
-        dev->tcet = spapr_tce_new_table(qdev, liobn,
-                                        0,
-                                        SPAPR_TCE_PAGE_SHIFT,
-                                        pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT, false);
+        dev->tcet = spapr_tce_new_table(qdev, liobn);
+        spapr_tce_table_enable(dev->tcet, SPAPR_TCE_PAGE_SHIFT, 0,
+                               pc->rtce_window_size >> SPAPR_TCE_PAGE_SHIFT);
         dev->tcet->vdev = dev;
         memory_region_add_subregion_overlap(&dev->mrroot, 0,
                                             spapr_tce_get_iommu(dev->tcet), 2);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 815d5ee..26c327d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -561,11 +561,10 @@ void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(sPAPRMachineState *sm,
                                  target_ulong addr, target_ulong size,
                                  bool cpu_update, bool memory_update);
-sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
-                                   uint64_t bus_offset,
-                                   uint32_t page_shift,
-                                   uint32_t nb_table,
-                                   bool need_vfio);
+sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
+void spapr_tce_table_enable(sPAPRTCETable *tcet,
+                            uint32_t page_shift, uint64_t bus_offset,
+                            uint32_t nb_table);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 03/12] spapr_iommu: Migrate full state
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 01/12] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 02/12] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 04/12] spapr_iommu: Add root memory region Alexey Kardashevskiy
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

The source guest could have reallocated the default TCE table and
migrate bigger/smaller table. This adds reallocation in post_load()
if the default table size is different on source and destination.

This adds @bus_offset, @page_shift to the migration stream as
a subsection so when DDW is added, migration to older machines will
still be possible. As @bus_offset and @page_shift are not used yet,
this makes no change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v17:
* removed @enabled from migration stream
* reworked spapr_tce_table_post_load()
* removed sob because of rework

v15:
* squashed "migrate full state" into this
* added missing tcet->mig_nb_table initialization in spapr_tce_table_pre_save()
* instead of bumping the version, moved extra parameters to subsection

v14:
* new to the series
---
 hw/ppc/spapr_iommu.c   | 65 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/hw/ppc/spapr.h |  3 +++
 trace-events           |  2 ++
 3 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index de63467..28991bc 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -138,33 +138,92 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->mig_table = tcet->table;
+    tcet->mig_nb_table = tcet->nb_table;
+
+    trace_spapr_iommu_pre_save(tcet->liobn, tcet->mig_nb_table,
+                               tcet->bus_offset, tcet->page_shift);
+}
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+    uint32_t old_nb_table = tcet->nb_table;
+    uint64_t old_bus_offset = tcet->bus_offset;
+    uint32_t old_page_shift = tcet->page_shift;
 
     if (tcet->vdev) {
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (tcet->mig_nb_table != tcet->nb_table) {
+        spapr_tce_table_disable(tcet);
+    }
+
+    if (tcet->mig_nb_table) {
+        if (!tcet->nb_table) {
+            spapr_tce_table_enable(tcet, old_page_shift, old_bus_offset,
+                                   tcet->mig_nb_table);
+        }
+
+        memcpy(tcet->table, tcet->mig_table,
+               tcet->nb_table * sizeof(tcet->table[0]));
+
+        free(tcet->mig_table);
+        tcet->mig_table = NULL;
+    }
+
+    trace_spapr_iommu_post_load(tcet->liobn, old_nb_table, tcet->nb_table,
+                                tcet->bus_offset, tcet->page_shift);
+
     return 0;
 }
 
+static bool spapr_tce_table_ex_needed(void *opaque)
+{
+    sPAPRTCETable *tcet = opaque;
+
+    return tcet->bus_offset || tcet->page_shift != 0xC;
+}
+
+static const VMStateDescription vmstate_spapr_tce_table_ex = {
+    .name = "spapr_iommu_ex",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_tce_table_ex_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(bus_offset, sPAPRTCETable),
+        VMSTATE_UINT32(page_shift, sPAPRTCETable),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
     .version_id = 2,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, mig_nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
+    .subsections = (const VMStateDescription*[]) {
+        &vmstate_spapr_tce_table_ex,
+        NULL
+    }
 };
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
@@ -264,7 +323,7 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
 }
 
-static void spapr_tce_table_disable(sPAPRTCETable *tcet)
+void spapr_tce_table_disable(sPAPRTCETable *tcet)
 {
     if (!tcet->nb_table) {
         return;
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 26c327d..f849714 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -539,6 +539,8 @@ struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint32_t mig_nb_table;
+    uint64_t *mig_table;
     bool bypass;
     bool need_vfio;
     int fd;
@@ -565,6 +567,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
 void spapr_tce_table_enable(sPAPRTCETable *tcet,
                             uint32_t page_shift, uint64_t bus_offset,
                             uint32_t nb_table);
+void spapr_tce_table_disable(sPAPRTCETable *tcet);
 void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio);
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
diff --git a/trace-events b/trace-events
index b27d1da..de42012 100644
--- a/trace-events
+++ b/trace-events
@@ -1431,6 +1431,8 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
+spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 04/12] spapr_iommu: Add root memory region
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (2 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 03/12] spapr_iommu: Migrate full state Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 05/12] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

We are going to have multiple DMA windows at different offsets on
a PCI bus. For the sake of migration, we will have as many TCE table
objects pre-created as many windows supported.
So we need a way to map windows dynamically onto a PCI bus
when migration of a table is completed but at this stage a TCE table
object does not have access to a PHB to ask it to map a DMA window
backed by just migrated TCE table.

This adds a "root" memory region (UINT64_MAX long) to the TCE object.
This new region is mapped on a PCI bus with enabled overlapping as
there will be one root MR per TCE table, each of them mapped at 0.
The actual IOMMU memory region is a subregion of the root region and
a TCE table enables/disables this subregion and maps it at
the specific offset inside the root MR which is 1:1 mapping of
a PCI address space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
---
 hw/ppc/spapr_iommu.c   | 13 ++++++++++---
 hw/ppc/spapr_pci.c     |  6 +++---
 include/hw/ppc/spapr.h |  2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 28991bc..a3cc572 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -233,11 +233,16 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+    Object *tcetobj = OBJECT(tcet);
+    char tmp[32];
 
     tcet->fd = -1;
     tcet->need_vfio = false;
-    memory_region_init_iommu(&tcet->iommu, OBJECT(dev), &spapr_iommu_ops,
-                             "iommu-spapr", 0);
+    snprintf(tmp, sizeof(tmp), "tce-root-%x", tcet->liobn);
+    memory_region_init(&tcet->root, tcetobj, tmp, UINT64_MAX);
+
+    snprintf(tmp, sizeof(tmp), "tce-iommu-%x", tcet->liobn);
+    memory_region_init_iommu(&tcet->iommu, tcetobj, &spapr_iommu_ops, tmp, 0);
 
     QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
 
@@ -321,6 +326,7 @@ void spapr_tce_table_enable(sPAPRTCETable *tcet,
 
     memory_region_set_size(&tcet->iommu,
                            (uint64_t)tcet->nb_table << tcet->page_shift);
+    memory_region_add_subregion(&tcet->root, tcet->bus_offset, &tcet->iommu);
 }
 
 void spapr_tce_table_disable(sPAPRTCETable *tcet)
@@ -329,6 +335,7 @@ void spapr_tce_table_disable(sPAPRTCETable *tcet)
         return;
     }
 
+    memory_region_del_subregion(&tcet->root, &tcet->iommu);
     memory_region_set_size(&tcet->iommu, 0);
 
     spapr_tce_free_table(tcet->table, tcet->fd, tcet->nb_table);
@@ -350,7 +357,7 @@ static void spapr_tce_table_unrealize(DeviceState *dev, Error **errp)
 
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet)
 {
-    return &tcet->iommu;
+    return &tcet->root;
 }
 
 static void spapr_tce_reset(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7688ae0..a529eff 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1470,13 +1470,13 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                        spapr_tce_get_iommu(tcet), 0);
+
     /* Register default 32bit DMA window */
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            nb_table);
 
-    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
-                                spapr_tce_get_iommu(tcet));
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index f849714..971df3d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -544,7 +544,7 @@ struct sPAPRTCETable {
     bool bypass;
     bool need_vfio;
     int fd;
-    MemoryRegion iommu;
+    MemoryRegion root, iommu;
     struct VIOsPAPRDevice *vdev; /* for @bypass migration compatibility only */
     QLIST_ENTRY(sPAPRTCETable) list;
 };
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 05/12] spapr_pci: Reset DMA config on PHB reset
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (3 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 04/12] spapr_iommu: Add root memory region Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes Alexey Kardashevskiy
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

LoPAPR dictates that during system reset all DMA windows must be removed
and the default DMA32 window must be created so does the patch.

At the moment there is just one window supported so no change in
behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v17:
* due to " spapr_iommu: Introduce "enabled" state for TCE table" rework,
instead of making spapr_tce_table_disable() public, this just adds it
---
 hw/ppc/spapr_pci.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index a529eff..4a7be4d 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1310,7 +1310,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
-    uint32_t nb_table;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
@@ -1462,7 +1461,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
     tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
@@ -1473,10 +1471,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
                                         spapr_tce_get_iommu(tcet), 0);
 
-    /* Register default 32bit DMA window */
-    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
-                           nb_table);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1493,6 +1487,17 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 static void spapr_phb_reset(DeviceState *qdev)
 {
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+
+    if (tcet && tcet->nb_table) {
+        spapr_tce_table_disable(tcet);
+    }
+
+    /* Register default 32bit DMA window */
+    spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
+                           sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
+
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (4 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 05/12] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the actual
page size(s) used by an IOMMU.

As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.

This removes vfio_container_granularity() and uses new helper in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v16:
* used memory_region_iommu_get_page_sizes() instead of
mr->iommu_ops->get_page_sizes() in memory_region_iommu_replay()

v15:
* s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes

v14:
* removed vfio_container_granularity(), changed memory_region_iommu_replay()

v4:
* s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
---
 hw/ppc/spapr_iommu.c  |  8 ++++++++
 hw/vfio/common.c      |  6 ------
 include/exec/memory.h | 18 ++++++++++++++----
 memory.c              | 16 +++++++++++++---
 4 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index a3cc572..90a45c0 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -149,6 +149,13 @@ static void spapr_tce_table_pre_save(void *opaque)
                                tcet->bus_offset, tcet->page_shift);
 }
 
+static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
+{
+    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
+
+    return 1ULL << tcet->page_shift;
+}
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -228,6 +235,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
+    .get_page_sizes = spapr_tce_get_page_sizes,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e51ed3a..f1a12b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -322,11 +322,6 @@ out:
     rcu_read_unlock();
 }
 
-static hwaddr vfio_container_granularity(VFIOContainer *container)
-{
-    return (hwaddr)1 << ctz64(container->iova_pgsizes);
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -394,7 +389,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
-                                   vfio_container_granularity(container),
                                    false);
 
         return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index f649697..bd9625f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -149,6 +149,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
 struct MemoryRegionIOMMUOps {
     /* Return a TLB entry that contains a given address. */
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
+    /* Returns supported page sizes */
+    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -571,6 +573,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
 
 
 /**
+ * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
+ *
+ * Returns %bitmap of supported page sizes for an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
+
+/**
  * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
  *
  * @mr: the memory region that was changed
@@ -594,16 +605,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
 
 /**
  * memory_region_iommu_replay: replay existing IOMMU translations to
- * a notifier
+ * a notifier with the minimum page granularity returned by
+ * mr->iommu_ops->get_page_sizes().
  *
  * @mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
- * @granularity: Minimum page granularity to replay notifications for
  * @is_write: Whether to treat the replay as a translate "write"
  *     through the iommu
  */
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write);
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
 
 /**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
diff --git a/memory.c b/memory.c
index 4e3cda8..761ae92 100644
--- a/memory.c
+++ b/memory.c
@@ -1500,12 +1500,22 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
     notifier_list_add(&mr->iommu_notify, n);
 }
 
-void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
-                                hwaddr granularity, bool is_write)
+uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
 {
-    hwaddr addr;
+    assert(memory_region_is_iommu(mr));
+    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
+        return mr->iommu_ops->get_page_sizes(mr);
+    }
+    return TARGET_PAGE_SIZE;
+}
+
+void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
+{
+    hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_page_sizes(mr));
+
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
         iotlb = mr->iommu_ops->translate(mr, addr, is_write);
         if (iotlb.perm != IOMMU_NONE) {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (5 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 08/12] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

As there is no per-IOMMU-type release() callback anymore, this stores
the IOMMU type in the container so vfio_listener_release() can determine
if it needs to unregister @prereg_listener.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This enforces guest RAM blocks to be host page size aligned; however
this is not new as KVM already requires memory slots to be host page
size aligned.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v17:
* s/prereg\.c/spapr.c/
* s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
* vfio_prereg_listener_skipped_section does hw_error() on IOMMUs

v16:
* switched to 64bit math everywhere as there is no chance to see
region_add on RAM blocks even remotely close to 1<<64bytes.

v15:
* banned unaligned sections
* added an vfio_prereg_gpa_to_ua() helper

v14:
* s/free_container_exit/listener_release_exit/g
* added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
---
 hw/vfio/Makefile.objs         |   1 +
 hw/vfio/common.c              |  38 +++++++++---
 hw/vfio/spapr.c               | 137 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |   4 ++
 trace-events                  |   2 +
 5 files changed, 172 insertions(+), 10 deletions(-)
 create mode 100644 hw/vfio/spapr.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index ceddbb8..c25e32b 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
+obj-$(CONFIG_SOFTMMU) += spapr.o
 endif
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f1a12b0..770f630 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -504,6 +504,9 @@ static const MemoryListener vfio_memory_listener = {
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
 }
 
 static struct vfio_info_cap_header *
@@ -862,8 +865,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -888,8 +891,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
             container->iova_pgsizes = info.iova_pgsizes;
         }
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
 
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
@@ -897,7 +902,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -909,11 +916,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto listener_release_exit;
+            }
         }
 
         /*
@@ -926,7 +944,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
         if (ret) {
             error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
             ret = -errno;
-            goto free_container_exit;
+            goto listener_release_exit;
         }
         container->min_iova = info.dma32_window_start;
         container->max_iova = container->min_iova + info.dma32_window_size - 1;
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
new file mode 100644
index 0000000..f339472
--- /dev/null
+++ b/hw/vfio/spapr.c
@@ -0,0 +1,137 @@
+/*
+ * DMA memory preregistration
+ *
+ * Authors:
+ *  Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "cpu.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/hw.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
+{
+    if (memory_region_is_iommu(section->mr)) {
+        hw_error("Cannot possibly preregister IOMMU memory");
+    }
+
+    return !memory_region_is_ram(section->mr) ||
+            memory_region_is_skip_dump(section->mr);
+}
+
+static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
+{
+    return memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region +
+        (gpa - section->offset_within_address_space);
+}
+
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    hwaddr end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_add_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    end = section->offset_within_address_space + int128_get64(section->size);
+    g_assert(gpa < end);
+
+    memory_region_ref(section->mr);
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
+    if (ret) {
+        /*
+         * On the initfn path, store the first error in the container so we
+         * can gracefully fail.  Runtime, there's not much we can do other
+         * than throw a hardware error.
+         */
+        if (!container->initialized) {
+            if (!container->error) {
+                container->error = ret;
+            }
+        } else {
+            hw_error("vfio: Memory registering failed, unable to continue");
+        }
+    }
+}
+
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            prereg_listener);
+    const hwaddr gpa = section->offset_within_address_space;
+    hwaddr end;
+    int ret;
+    hwaddr page_mask = qemu_real_host_page_mask;
+    struct vfio_iommu_spapr_register_memory reg = {
+        .argsz = sizeof(reg),
+        .flags = 0,
+    };
+
+    if (vfio_prereg_listener_skipped_section(section)) {
+        trace_vfio_listener_region_del_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~page_mask) ||
+                 (section->offset_within_region & ~page_mask) ||
+                 (int128_get64(section->size) & ~page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    end = section->offset_within_address_space + int128_get64(section->size);
+    if (gpa >= end) {
+        return;
+    }
+
+    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
+    reg.size = end - gpa;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
+    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+}
+
+const MemoryListener vfio_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0610377..405c3b2 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -73,6 +73,8 @@ typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
+    MemoryListener prereg_listener;
+    unsigned iommu_type;
     int error;
     bool initialized;
     /*
@@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
 int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
                              uint32_t subtype, struct vfio_region_info **info);
 #endif
+extern const MemoryListener vfio_prereg_listener;
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index de42012..ddb8676 100644
--- a/trace-events
+++ b/trace-events
@@ -1766,6 +1766,8 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 08/12] spapr_pci: Add and export DMA resetting helper
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (6 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

This will be later used by the "ibm,reset-pe-dma-window" RTAS handler
which resets the DMA configuration to the defaults.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci.c          | 10 ++++++++--
 include/hw/pci-host/spapr.h |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 4a7be4d..68de523 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1485,9 +1485,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
     return 0;
 }
 
-static void spapr_phb_reset(DeviceState *qdev)
+void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
 
     if (tcet && tcet->nb_table) {
@@ -1497,6 +1496,13 @@ static void spapr_phb_reset(DeviceState *qdev)
     /* Register default 32bit DMA window */
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
+}
+
+static void spapr_phb_reset(DeviceState *qdev)
+{
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
+
+    spapr_phb_dma_reset(sphb);
 
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 03ee006..7848366 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -147,4 +147,6 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 }
 #endif
 
+void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (7 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 08/12] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

There are going to be multiple IOMMUs per a container. This moves
the single host IOMMU parameter set to a list of VFIOHostDMAWindow.

This should cause no behavioral change and will be used later by
the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v17:
* vfio_host_win_add() uses vfio_host_win_lookup() for overlap check and
aborts if any found instead of returning an error (as recovery is not
possible anyway)
* hw_error() when overlapped iommu is detected

v16:
* adjusted commit log with changes from v15

v15:
* s/vfio_host_iommu_add/vfio_host_win_add/
* s/VFIOHostIOMMU/VFIOHostDMAWindow/
---
 hw/vfio/common.c              | 59 +++++++++++++++++++++++++++++++------------
 include/hw/vfio/vfio-common.h |  9 +++++--
 2 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 770f630..52b08fd 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "exec/memory.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
+#include "qemu/range.h"
 #include "sysemu/kvm.h"
 #ifdef CONFIG_KVM
 #include "linux/kvm.h"
@@ -242,6 +243,38 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
+static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
+                                               hwaddr min_iova, hwaddr max_iova)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {
+            return hostwin;
+        }
+    }
+
+    return NULL;
+}
+
+static void vfio_host_win_add(VFIOContainer *container,
+                             hwaddr min_iova, hwaddr max_iova,
+                             uint64_t iova_pgsizes)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    if (vfio_host_win_lookup(container, min_iova, max_iova)) {
+        hw_error("%s: Overlapped IOMMU are not enabled", __func__);
+    }
+
+    hostwin = g_malloc0(sizeof(*hostwin));
+
+    hostwin->min_iova = min_iova;
+    hostwin->max_iova = max_iova;
+    hostwin->iova_pgsizes = iova_pgsizes;
+    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -355,7 +388,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(int128_sub(llend, int128_one()));
 
-    if ((iova < container->min_iova) || (end > container->max_iova)) {
+    if (!vfio_host_win_lookup(container, iova, end)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
                      container, iova, end);
@@ -370,10 +403,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
         trace_vfio_listener_region_add_iommu(iova, end);
         /*
-         * FIXME: We should do some checking to see if the
-         * capabilities of the host VFIO IOMMU are adequate to model
-         * the guest IOMMU
-         *
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
          * would be the right place to wire that up (tell the KVM
@@ -880,17 +909,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        container->min_iova = 0;
-        container->max_iova = (hwaddr)-1;
-
-        /* Assume just 4K IOVA page size */
-        container->iova_pgsizes = 0x1000;
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
         /* Ignore errors */
-        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
-            container->iova_pgsizes = info.iova_pgsizes;
+        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+            /* Assume 4k IOVA page size */
+            info.iova_pgsizes = 4096;
         }
+        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
     } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
                ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
         struct vfio_iommu_spapr_tce_info info;
@@ -946,11 +972,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto listener_release_exit;
         }
-        container->min_iova = info.dma32_window_start;
-        container->max_iova = container->min_iova + info.dma32_window_size - 1;
 
-        /* Assume just 4K IOVA pages for now */
-        container->iova_pgsizes = 0x1000;
+        /* The default table uses 4K pages */
+        vfio_host_win_add(container, info.dma32_window_start,
+                          info.dma32_window_start +
+                          info.dma32_window_size - 1,
+                          0x1000);
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 405c3b2..c76ddc4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -82,9 +82,8 @@ typedef struct VFIOContainer {
      * contiguous IOVA window.  We may need to generalize that in
      * future
      */
-    hwaddr min_iova, max_iova;
-    uint64_t iova_pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
@@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
+typedef struct VFIOHostDMAWindow {
+    hwaddr min_iova, max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
+} VFIOHostDMAWindow;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (8 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses heuristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v17:
* moved spapr window create/remove helpers to separate file
* added hw_error() if vfio_host_win_del() failed

v16:
* used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
* enforced no intersections between windows

v14:
* new to the series
---
 hw/vfio/common.c              | 76 +++++++++++++++++++++++++++++++++++++------
 hw/vfio/spapr.c               | 70 +++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  6 ++++
 trace-events                  |  2 ++
 4 files changed, 144 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 52b08fd..7f55c26 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -275,6 +275,18 @@ static void vfio_host_win_add(VFIOContainer *container,
     QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
 }
 
+static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
+{
+    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
+
+    if (!hostwin) {
+        return -1;
+    }
+    QLIST_REMOVE(hostwin, hostwin_next);
+
+    return 0;
+}
+
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -388,6 +400,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
     end = int128_get64(int128_sub(llend, int128_one()));
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        VFIOHostDMAWindow *hostwin;
+        hwaddr pgsize = 0;
+
+        /* For now intersections are not allowed, we may relax this later */
+        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+            if (ranges_overlap(hostwin->min_iova,
+                               hostwin->max_iova - hostwin->min_iova + 1,
+                               section->offset_within_address_space,
+                               int128_get64(section->size))) {
+                goto fail;
+            }
+        }
+
+        ret = vfio_spapr_create_window(container, section, &pgsize);
+        if (ret) {
+            goto fail;
+        }
+
+        vfio_host_win_add(container, section->offset_within_address_space,
+                          section->offset_within_address_space +
+                          int128_get64(section->size) - 1, pgsize);
+    }
+
     if (!vfio_host_win_lookup(container, iova, end)) {
         error_report("vfio: IOMMU container %p can't map guest IOVA region"
                      " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
@@ -523,6 +559,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, int128_get64(llsize), ret);
     }
+
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        vfio_spapr_remove_window(container,
+                                 section->offset_within_address_space);
+        if (vfio_host_win_del(container,
+                              section->offset_within_address_space) < 0) {
+            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
+                     __func__, section->offset_within_address_space);
+        }
+
+        trace_vfio_spapr_remove_window(section->offset_within_address_space);
+    }
 }
 
 static const MemoryListener vfio_memory_listener = {
@@ -960,11 +1008,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
@@ -973,11 +1016,24 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto listener_release_exit;
         }
 
-        /* The default table uses 4K pages */
-        vfio_host_win_add(container, info.dma32_window_start,
-                          info.dma32_window_start +
-                          info.dma32_window_size - 1,
-                          0x1000);
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
+            if (ret) {
+                goto free_container_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index f339472..0c784c4 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -135,3 +135,73 @@ const MemoryListener vfio_prereg_listener = {
     .region_add = vfio_prereg_listener_region_add,
     .region_del = vfio_prereg_listener_region_del,
 };
+
+int vfio_spapr_create_window(VFIOContainer *container,
+                             MemoryRegionSection *section,
+                             hwaddr *pgsize)
+{
+    int ret;
+    unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
+    unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
+    unsigned entries, pages;
+    struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
+
+    /*
+     * FIXME: For VFIO iommu types which have KVM acceleration to
+     * avoid bouncing all map/unmaps through qemu this way, this
+     * would be the right place to wire that up (tell the KVM
+     * device emulation the VFIO iommu handles to use).
+     */
+    create.window_size = int128_get64(section->size);
+    create.page_shift = ctz64(pagesize);
+    /*
+     * SPAPR host supports multilevel TCE tables, there is some
+     * heuristic to decide how many levels we want for our table:
+     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+     */
+    entries = create.window_size >> create.page_shift;
+    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
+    pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
+    create.levels = ctz64(pages) / 6 + 1;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+    if (ret) {
+        error_report("Failed to create a window, ret = %d (%m)", ret);
+        return -errno;
+    }
+
+    if (create.start_addr != section->offset_within_address_space) {
+        vfio_spapr_remove_window(container, create.start_addr);
+
+        error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
+                     section->offset_within_address_space,
+                     create.start_addr);
+        ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        return -EINVAL;
+    }
+    trace_vfio_spapr_create_window(create.page_shift,
+                                   create.window_size,
+                                   create.start_addr);
+    *pgsize = pagesize;
+
+    return 0;
+}
+
+int vfio_spapr_remove_window(VFIOContainer *container,
+                             hwaddr offset_within_address_space)
+{
+    struct vfio_iommu_spapr_tce_remove remove = {
+        .argsz = sizeof(remove),
+        .start_addr = offset_within_address_space,
+    };
+    int ret;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+    if (ret) {
+        error_report("Failed to remove window at %"PRIx64,
+                     remove.start_addr);
+        return -errno;
+    }
+
+    return 0;
+}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c76ddc4..7e80382 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -167,4 +167,10 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
+int vfio_spapr_create_window(VFIOContainer *container,
+                             MemoryRegionSection *section,
+                             hwaddr *pgsize);
+int vfio_spapr_remove_window(VFIOContainer *container,
+                             hwaddr offset_within_address_space);
+
 #endif /* !HW_VFIO_VFIO_COMMON_H */
diff --git a/trace-events b/trace-events
index ddb8676..ec32c20 100644
--- a/trace-events
+++ b/trace-events
@@ -1768,6 +1768,8 @@ vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sp
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (9 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 12/12] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.5 machine (TODO: update version) and older disable it.
This also creates a single DMA window for the older machines to
maintain backward migration.

This implements DDW for PHB with emulated and VFIO devices. The host
kernel support is required. The advertised IOMMU page sizes are 4K and
64K; 16M pages are supported but not advertised by default, in order to
enable them, the user has to specify "pgsz" property for PHB and
enable huge pages for RAM.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v17:
* fixed: "query" did return non-page-shifted value when memory hotplug is enabled

v16:
* s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
* s/SPAPR_PCI_LIOBN()/dma_liobn[]/

v15:
* moved page mask filtering to PHB realize(), use "-mempath" to know
if there are huge pages
* fixed error reporting in RTAS handlers
* max window size accounts now hotpluggable memory boundaries
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   5 +
 hw/ppc/spapr_pci.c          |  77 +++++++++---
 hw/ppc/spapr_rtas_ddw.c     | 293 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |   8 +-
 include/hw/ppc/spapr.h      |  16 ++-
 trace-events                |   4 +
 7 files changed, 383 insertions(+), 21 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c1ffc77..986b36f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 44e401a..6ddcda9 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2366,6 +2366,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
         .driver   = "spapr-vlan", \
         .property = "use-rx-buffer-pools", \
         .value    = "off", \
+    }, \
+    {\
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+        .property = "ddw",\
+        .value    = stringify(off),\
     },
 
 static void spapr_machine_2_5_instance_options(MachineState *machine)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 68de523..bcf0360 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -35,6 +35,7 @@
 #include "hw/ppc/spapr.h"
 #include "hw/pci-host/spapr.h"
 #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
 #include <libfdt.h>
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -45,6 +46,7 @@
 #include "hw/ppc/spapr_drc.h"
 #include "sysemu/device_tree.h"
 #include "sysemu/kvm.h"
+#include "sysemu/hostmem.h"
 
 #include "hw/vfio/vfio.h"
 
@@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     int fdt_start_offset = 0, fdt_size;
 
     if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
+        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
 
         spapr_tce_set_need_vfio(tcet, true);
     }
@@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
+    const unsigned windows_supported =
+        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
 
-        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
+        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
+            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
             || (sphb->mem_win_addr != (hwaddr)-1)
             || (sphb->io_win_addr != (hwaddr)-1)) {
             error_setg(errp, "Either \"index\" or other parameters must"
@@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
 
         sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+        for (i = 0; i < windows_supported; ++i) {
+            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
+        }
 
         windows_base = SPAPR_PCI_WINDOW_BASE
             + sphb->index * SPAPR_PCI_WINDOW_SPACING;
@@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (sphb->dma_liobn == (uint32_t)-1) {
-        error_setg(errp, "LIOBN not specified for PHB");
+    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
+        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
+        error_setg(errp, "LIOBN(s) not specified for PHB");
         return;
     }
 
@@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return;
+    /* DMA setup */
+    for (i = 0; i < windows_supported; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
     }
 
-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
-                                        spapr_tce_get_iommu(tcet), 0);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+    int i;
+    sPAPRTCETable *tcet;
 
-    if (tcet && tcet->nb_table) {
-        spapr_tce_table_disable(tcet);
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
+
+        if (tcet && tcet->nb_table) {
+            spapr_tce_table_disable(tcet);
+        }
     }
 
     /* Register default 32bit DMA window */
+    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
 }
@@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
 static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
     DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
-    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
+    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
+    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
     DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
     DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
                        SPAPR_PCI_MMIO_WIN_SIZE),
@@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
+                       (1ULL << 12) | (1ULL << 16)),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
     .post_load = spapr_pci_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
-        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
+        VMSTATE_UNUSED(4), /* dma_liobn */
         VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
@@ -1780,6 +1802,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1804,6 +1835,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
@@ -1827,7 +1866,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
-    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
+    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
     if (!tcet) {
         return -1;
     }
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..17bbae0
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,293 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "cpu.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->nb_table) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->nb_table) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
+{
+    int i;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
+        if (page_mask & (1ULL << masks[i].shift)) {
+            mask |= masks[i].mask;
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid, max_window_size;
+    uint32_t avail, addr, pgmask = 0;
+    MachineState *machine = MACHINE(spapr);
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    /* Translate page mask to LoPAPR format */
+    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as maximum supported RAM size was in 4K pages.
+     */
+    if (machine->ram_size == machine->maxram_size) {
+        max_window_size = machine->ram_size;
+    } else {
+        MemoryHotplugState *hpms = &spapr->hotplug_memory;
+
+        max_window_size = hpms->base + memory_region_size(&hpms->mr);
+    }
+
+    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
+        (window_shift < page_shift)) {
+        goto param_error_exit;
+    }
+
+    if (!liobn || !sphb->ddw_enabled ||
+        spapr_phb_get_active_win_num(sphb) == SPAPR_PCI_DMA_MAX_WINDOWS) {
+        goto hw_error_exit;
+    }
+
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
+    if (!tcet) {
+        goto hw_error_exit;
+    }
+
+    spapr_tce_table_enable(tcet, page_shift, sphb->dma64_window_addr,
+                           1ULL << (window_shift - page_shift));
+    if (!tcet->nb_table) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
+        goto param_error_exit;
+    }
+
+    spapr_tce_table_disable(tcet);
+    trace_spapr_iommu_ddw_remove(liobn);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..36a370e 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -32,6 +32,8 @@
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 typedef struct sPAPRPHBState sPAPRPHBState;
 
 typedef struct spapr_pci_msi {
@@ -56,7 +58,7 @@ struct sPAPRPHBState {
     hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
     MemoryRegion memwindow, iowindow, msiwindow;
 
-    uint32_t dma_liobn;
+    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
     hwaddr dma_win_addr, dma_win_size;
     AddressSpace iommu_as;
     MemoryRegion iommu_root;
@@ -71,6 +73,10 @@ struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_addr;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 971df3d..59fad22 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -412,6 +412,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -453,8 +463,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
diff --git a/trace-events b/trace-events
index ec32c20..dec80e4 100644
--- a/trace-events
+++ b/trace-events
@@ -1433,6 +1433,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
+spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH qemu v17 12/12] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening
       [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
                   ` (10 preceding siblings ...)
  2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
@ 2016-06-01  8:57 ` Alexey Kardashevskiy
       [not found] ` <201606010902.u518wwmb029353@mx0a-001b2d01.pphosted.com>
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-01  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, Alexander Graf, David Gibson,
	Alex Williamson

The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
a guest view of the table and a hardware TCE table. If there is no VFIO
presense in the address space, then just the guest view is used, if
this is the case, it is allocated in the KVM. However since there is no
support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
we need to move the guest view from KVM to the userspace; and we need
to do this for every IOMMU on a bus with VFIO devices.

This adds notify_started/notify_stopped callbacks in MemoryRegionIOMMUOps
to notify IOMMU that listeners were set/removed. This allows IOMMU to
take necessary steps before actual notifications happen and do proper
cleanup when the last notifier is removed.

This implements the callbacks for the sPAPR IOMMU - notify_started()
reallocated the guest view to the user space, notify_stopped() does
the opposite.

This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
path as the new callbacks do this better - they notify IOMMU at
the exact moment when the configuration is changed, and this also
includes the case of PCI hot unplug.

This adds MemoryRegion* to memory_region_unregister_iommu_notifier()
as we need iommu_ops to call notify_stopped() and Notifier* does not
store the owner.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v17:
* replaced IOMMU users counting with simple QLIST_EMPTY()
* renamed the callbacks
* removed requirement for region_del() to be called on memory_listener_unregister()

v16:
* added a use counter in VFIOAddressSpace->VFIOIOMMUMR

v15:
* s/need_vfio/vfio-Users/g
---
 hw/ppc/spapr_iommu.c  | 12 ++++++++++++
 hw/ppc/spapr_pci.c    |  6 ------
 hw/vfio/common.c      |  5 +++--
 include/exec/memory.h |  8 +++++++-
 memory.c              | 10 +++++++++-
 5 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 90a45c0..994a8a0 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -156,6 +156,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
     return 1ULL << tcet->page_shift;
 }
 
+static void spapr_tce_notify_started(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
+}
+
+static void spapr_tce_notify_stopped(MemoryRegion *iommu)
+{
+    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
+}
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -236,6 +246,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
 static MemoryRegionIOMMUOps spapr_iommu_ops = {
     .translate = spapr_tce_translate_iommu,
     .get_page_sizes = spapr_tce_get_page_sizes,
+    .notify_started = spapr_tce_notify_started,
+    .notify_stopped = spapr_tce_notify_stopped,
 };
 
 static int spapr_tce_table_realize(DeviceState *dev)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index bcf0360..06ce902 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1089,12 +1089,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     void *fdt = NULL;
     int fdt_start_offset = 0, fdt_size;
 
-    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
-
-        spapr_tce_set_need_vfio(tcet, true);
-    }
-
     fdt = create_device_tree(&fdt_size);
     fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
     if (!fdt_start_offset) {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7f55c26..356640e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -522,7 +522,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (giommu->iommu == section->mr) {
-                memory_region_unregister_iommu_notifier(&giommu->n);
+                memory_region_unregister_iommu_notifier(giommu->iommu,
+                                                        &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
                 g_free(giommu);
                 break;
@@ -1094,7 +1095,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
         QLIST_REMOVE(container, next);
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(&giommu->n);
+            memory_region_unregister_iommu_notifier(giommu->iommu, &giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
             g_free(giommu);
         }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index bd9625f..f08439b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
     IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
     /* Returns supported page sizes */
     uint64_t (*get_page_sizes)(MemoryRegion *iommu);
+    /* Called when the first notifier is set */
+    void (*notify_started)(MemoryRegion *iommu);
+    /* Called when the last notifier is removed */
+    void (*notify_stopped)(MemoryRegion *iommu);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -619,9 +623,11 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
  * memory_region_unregister_iommu_notifier: unregister a notifier for
  * changes to IOMMU translation entries.
  *
+ * @mr: the memory region which was observed and for which notity_stopped()
+ *      needs to be called
  * @n: the notifier to be removed.
  */
-void memory_region_unregister_iommu_notifier(Notifier *n);
+void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n);
 
 /**
  * memory_region_name: get a memory region's name
diff --git a/memory.c b/memory.c
index 761ae92..ee41649 100644
--- a/memory.c
+++ b/memory.c
@@ -1497,6 +1497,10 @@ bool memory_region_is_logging(MemoryRegion *mr, uint8_t client)
 
 void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
 {
+    if (mr->iommu_ops->notify_started &&
+        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
+        mr->iommu_ops->notify_started(mr);
+    }
     notifier_list_add(&mr->iommu_notify, n);
 }
 
@@ -1530,9 +1534,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
     }
 }
 
-void memory_region_unregister_iommu_notifier(Notifier *n)
+void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n)
 {
     notifier_remove(n);
+    if (mr->iommu_ops->notify_stopped &&
+        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
+        mr->iommu_ops->notify_stopped(mr);
+    }
 }
 
 void memory_region_notify_iommu(MemoryRegion *mr,
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
       [not found] ` <201606010902.u518wwmb029353@mx0a-001b2d01.pphosted.com>
@ 2016-06-02  3:35   ` David Gibson
  2016-06-06 13:31     ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-06-02  3:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6847 bytes --]

On Wed, Jun 01, 2016 at 06:57:37PM +1000, Alexey Kardashevskiy wrote:
> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> page size(s) used by an IOMMU.
> 
> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
> as fallback.
> 
> This removes vfio_container_granularity() and uses new helper in
> memory_region_iommu_replay() when replaying IOMMU mappings on added
> IOMMU memory region.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Paolo,

Looks like you were left off the CC for this one.

I think this is ready to go - do you want to merge, comment or ack and
we'll take it either through my tree or Alex's?

> ---
> Changes:
> v16:
> * used memory_region_iommu_get_page_sizes() instead of
> mr->iommu_ops->get_page_sizes() in memory_region_iommu_replay()
> 
> v15:
> * s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes
> 
> v14:
> * removed vfio_container_granularity(), changed memory_region_iommu_replay()
> 
> v4:
> * s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
> ---
>  hw/ppc/spapr_iommu.c  |  8 ++++++++
>  hw/vfio/common.c      |  6 ------
>  include/exec/memory.h | 18 ++++++++++++++----
>  memory.c              | 16 +++++++++++++---
>  4 files changed, 35 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index a3cc572..90a45c0 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -149,6 +149,13 @@ static void spapr_tce_table_pre_save(void *opaque)
>                                 tcet->bus_offset, tcet->page_shift);
>  }
>  
> +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> +{
> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> +
> +    return 1ULL << tcet->page_shift;
> +}
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -228,6 +235,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
> +    .get_page_sizes = spapr_tce_get_page_sizes,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index e51ed3a..f1a12b0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -322,11 +322,6 @@ out:
>      rcu_read_unlock();
>  }
>  
> -static hwaddr vfio_container_granularity(VFIOContainer *container)
> -{
> -    return (hwaddr)1 << ctz64(container->iova_pgsizes);
> -}
> -
>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> @@ -394,7 +389,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> -                                   vfio_container_granularity(container),
>                                     false);
>  
>          return;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f649697..bd9625f 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -149,6 +149,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>  struct MemoryRegionIOMMUOps {
>      /* Return a TLB entry that contains a given address. */
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> +    /* Returns supported page sizes */
> +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -571,6 +573,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
>  
>  
>  /**
> + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
> + *
> + * Returns %bitmap of supported page sizes for an iommu.
> + *
> + * @mr: the memory region being queried
> + */
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
> +
> +/**
>   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>   *
>   * @mr: the memory region that was changed
> @@ -594,16 +605,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
>  
>  /**
>   * memory_region_iommu_replay: replay existing IOMMU translations to
> - * a notifier
> + * a notifier with the minimum page granularity returned by
> + * mr->iommu_ops->get_page_sizes().
>   *
>   * @mr: the memory region to observe
>   * @n: the notifier to which to replay iommu mappings
> - * @granularity: Minimum page granularity to replay notifications for
>   * @is_write: Whether to treat the replay as a translate "write"
>   *     through the iommu
>   */
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -                                hwaddr granularity, bool is_write);
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
>  
>  /**
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
> diff --git a/memory.c b/memory.c
> index 4e3cda8..761ae92 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1500,12 +1500,22 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>      notifier_list_add(&mr->iommu_notify, n);
>  }
>  
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -                                hwaddr granularity, bool is_write)
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
>  {
> -    hwaddr addr;
> +    assert(memory_region_is_iommu(mr));
> +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
> +        return mr->iommu_ops->get_page_sizes(mr);
> +    }
> +    return TARGET_PAGE_SIZE;
> +}
> +
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
> +{
> +    hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_page_sizes(mr));
> +
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>          iotlb = mr->iommu_ops->translate(mr, addr, is_write);
>          if (iotlb.perm != IOMMU_NONE) {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
       [not found] ` <201606010900.u518wvH7046287@mx0a-001b2d01.pphosted.com>
@ 2016-06-02  4:18   ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-06-02  4:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 13909 bytes --]

On Wed, Jun 01, 2016 at 06:57:38PM +1000, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> As there is no per-IOMMU-type release() callback anymore, this stores
> the IOMMU type in the container so vfio_listener_release() can determine
> if it needs to unregister @prereg_listener.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This enforces guest RAM blocks to be host page size aligned; however
> this is not new as KVM already requires memory slots to be host page
> size aligned.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Alex, do you think this is ready to go now?

> ---
> Changes:
> v17:
> * s/prereg\.c/spapr.c/
> * s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
> * vfio_prereg_listener_skipped_section does hw_error() on IOMMUs
> 
> v16:
> * switched to 64bit math everywhere as there is no chance to see
> region_add on RAM blocks even remotely close to 1<<64bytes.
> 
> v15:
> * banned unaligned sections
> * added an vfio_prereg_gpa_to_ua() helper
> 
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  38 +++++++++---
>  hw/vfio/spapr.c               | 137 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   2 +
>  5 files changed, 172 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/spapr.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..c25e32b 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += spapr.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f1a12b0..770f630 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -504,6 +504,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
>  }
>  
>  static struct vfio_info_cap_header *
> @@ -862,8 +865,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -888,8 +891,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -897,7 +902,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -909,11 +916,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto listener_release_exit;
> +            }
>          }
>  
>          /*
> @@ -926,7 +944,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if (ret) {
>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>              ret = -errno;
> -            goto free_container_exit;
> +            goto listener_release_exit;
>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> new file mode 100644
> index 0000000..f339472
> --- /dev/null
> +++ b/hw/vfio/spapr.c
> @@ -0,0 +1,137 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "cpu.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/hw.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    if (memory_region_is_iommu(section->mr)) {
> +        hw_error("Cannot possibly preregister IOMMU memory");
> +    }
> +
> +    return !memory_region_is_ram(section->mr) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
> +{
> +    return memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    g_assert(gpa < end);
> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: Memory registering failed, unable to continue");
> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    if (gpa >= end) {
> +        return;
> +    }
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0610377..405c3b2 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
> +    MemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>                               uint32_t subtype, struct vfio_region_info **info);
>  #endif
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index de42012..ddb8676 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1766,6 +1766,8 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
>  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 08/12] spapr_pci: Add and export DMA resetting helper
       [not found] ` <201606010902.u518x15j023604@mx0a-001b2d01.pphosted.com>
@ 2016-06-02  4:19   ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-06-02  4:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 2237 bytes --]

On Wed, Jun 01, 2016 at 06:57:39PM +1000, Alexey Kardashevskiy wrote:
> This will be later used by the "ibm,reset-pe-dma-window" RTAS handler
> which resets the DMA configuration to the defaults.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Should be safe even without the rest of the series, so I've merged to
ppc-for-2.7.

> ---
>  hw/ppc/spapr_pci.c          | 10 ++++++++--
>  include/hw/pci-host/spapr.h |  2 ++
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 4a7be4d..68de523 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1485,9 +1485,8 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>      return 0;
>  }
>  
> -static void spapr_phb_reset(DeviceState *qdev)
> +void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>  
>      if (tcet && tcet->nb_table) {
> @@ -1497,6 +1496,13 @@ static void spapr_phb_reset(DeviceState *qdev)
>      /* Register default 32bit DMA window */
>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
> +}
> +
> +static void spapr_phb_reset(DeviceState *qdev)
> +{
> +    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
> +
> +    spapr_phb_dma_reset(sphb);
>  
>      /* Reset the IOMMU state */
>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 03ee006..7848366 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -147,4 +147,6 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>  }
>  #endif
>  
> +void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> +
>  #endif /* __HW_SPAPR_PCI_H__ */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities
       [not found] ` <201606010901.u518wwEL029369@mx0a-001b2d01.pphosted.com>
@ 2016-06-03  7:23   ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-06-03  7:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 7122 bytes --]

On Wed, Jun 01, 2016 at 06:57:40PM +1000, Alexey Kardashevskiy wrote:
> There are going to be multiple IOMMUs per a container. This moves
> the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
> 
> This should cause no behavioral change and will be used later by
> the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v17:
> * vfio_host_win_add() uses vfio_host_win_lookup() for overlap check and
> aborts if any found instead of returning an error (as recovery is not
> possible anyway)
> * hw_error() when overlapped iommu is detected
> 
> v16:
> * adjusted commit log with changes from v15
> 
> v15:
> * s/vfio_host_iommu_add/vfio_host_win_add/
> * s/VFIOHostIOMMU/VFIOHostDMAWindow/
> ---
>  hw/vfio/common.c              | 59 +++++++++++++++++++++++++++++++------------
>  include/hw/vfio/vfio-common.h |  9 +++++--
>  2 files changed, 50 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 770f630..52b08fd 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "exec/memory.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
> +#include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #ifdef CONFIG_KVM
>  #include "linux/kvm.h"
> @@ -242,6 +243,38 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> +static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> +                                               hwaddr min_iova, hwaddr max_iova)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {

This is not an overlaps test, but a strictly includes test..

> +            return hostwin;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static void vfio_host_win_add(VFIOContainer *container,
> +                             hwaddr min_iova, hwaddr max_iova,
> +                             uint64_t iova_pgsizes)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    if (vfio_host_win_lookup(container, min_iova, max_iova)) {

..which means this no longer catches (partially) overlapping regions.

> +        hw_error("%s: Overlapped IOMMU are not enabled", __func__);
> +    }
> +
> +    hostwin = g_malloc0(sizeof(*hostwin));
> +
> +    hostwin->min_iova = min_iova;
> +    hostwin->max_iova = max_iova;
> +    hostwin->iova_pgsizes = iova_pgsizes;
> +    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -355,7 +388,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> -    if ((iova < container->min_iova) || (end > container->max_iova)) {
> +    if (!vfio_host_win_lookup(container, iova, end)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>                       container, iova, end);
> @@ -370,10 +403,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
>          /*
> -         * FIXME: We should do some checking to see if the
> -         * capabilities of the host VFIO IOMMU are adequate to model
> -         * the guest IOMMU
> -         *
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
>           * would be the right place to wire that up (tell the KVM
> @@ -880,17 +909,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        container->min_iova = 0;
> -        container->max_iova = (hwaddr)-1;
> -
> -        /* Assume just 4K IOVA page size */
> -        container->iova_pgsizes = 0x1000;
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>          /* Ignore errors */
> -        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> -            container->iova_pgsizes = info.iova_pgsizes;
> +        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> +            /* Assume 4k IOVA page size */
> +            info.iova_pgsizes = 4096;
>          }
> +        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
>      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> @@ -946,11 +972,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto listener_release_exit;
>          }
> -        container->min_iova = info.dma32_window_start;
> -        container->max_iova = container->min_iova + info.dma32_window_size - 1;
>  
> -        /* Assume just 4K IOVA pages for now */
> -        container->iova_pgsizes = 0x1000;
> +        /* The default table uses 4K pages */
> +        vfio_host_win_add(container, info.dma32_window_start,
> +                          info.dma32_window_start +
> +                          info.dma32_window_size - 1,
> +                          0x1000);
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 405c3b2..c76ddc4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -82,9 +82,8 @@ typedef struct VFIOContainer {
>       * contiguous IOVA window.  We may need to generalize that in
>       * future
>       */
> -    hwaddr min_iova, max_iova;
> -    uint64_t iova_pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
> @@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;
>  
> +typedef struct VFIOHostDMAWindow {
> +    hwaddr min_iova, max_iova;
> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> +} VFIOHostDMAWindow;
> +
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  
>  typedef struct VFIODevice {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
       [not found] ` <201606011012.u51A9A6i023070@mx0a-001b2d01.pphosted.com>
@ 2016-06-03  7:37   ` David Gibson
  2016-06-06  6:45     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-06-03  7:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 11327 bytes --]

On Wed, Jun 01, 2016 at 06:57:41PM +1000, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses heuristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v17:
> * moved spapr window create/remove helpers to separate file
> * added hw_error() if vfio_host_win_del() failed
> 
> v16:
> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> * enforced no intersections between windows
> 
> v14:
> * new to the series
> ---
>  hw/vfio/common.c              | 76 +++++++++++++++++++++++++++++++++++++------
>  hw/vfio/spapr.c               | 70 +++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  6 ++++
>  trace-events                  |  2 ++
>  4 files changed, 144 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 52b08fd..7f55c26 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -275,6 +275,18 @@ static void vfio_host_win_add(VFIOContainer *container,
>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
>  }
>  
> +static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> +{
> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);

Hrm.. and for this case I think you want exact match, rather than
looking for range inclusion.

> +
> +    if (!hostwin) {
> +        return -1;
> +    }
> +    QLIST_REMOVE(hostwin, hostwin_next);
> +
> +    return 0;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -388,6 +400,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        VFIOHostDMAWindow *hostwin;
> +        hwaddr pgsize = 0;
> +
> +        /* For now intersections are not allowed, we may relax this later */
> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +            if (ranges_overlap(hostwin->min_iova,
> +                               hostwin->max_iova - hostwin->min_iova + 1,
> +                               section->offset_within_address_space,
> +                               int128_get64(section->size))) {
> +                goto fail;
> +            }
> +        }
> +
> +        ret = vfio_spapr_create_window(container, section, &pgsize);
> +        if (ret) {
> +            goto fail;
> +        }
> +
> +        vfio_host_win_add(container, section->offset_within_address_space,
> +                          section->offset_within_address_space +
> +                          int128_get64(section->size) - 1, pgsize);
> +    }
> +
>      if (!vfio_host_win_lookup(container, iova, end)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> @@ -523,6 +559,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       "0x%"HWADDR_PRIx") = %d (%m)",
>                       container, iova, int128_get64(llsize), ret);
>      }
> +
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        vfio_spapr_remove_window(container,
> +                                 section->offset_within_address_space);

Should check for error here.

> +        if (vfio_host_win_del(container,
> +                              section->offset_within_address_space) < 0) {
> +            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> +                     __func__, section->offset_within_address_space);

Personally I think assert() would be better here, but Alex doesn't
like them so I'm ok with this.

> +        }
> +
> +        trace_vfio_spapr_remove_window(section->offset_within_address_space);
> +    }
>  }
>  
>  static const MemoryListener vfio_memory_listener = {
> @@ -960,11 +1008,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -973,11 +1016,24 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto listener_release_exit;
>          }
>  
> -        /* The default table uses 4K pages */
> -        vfio_host_win_add(container, info.dma32_window_start,
> -                          info.dma32_window_start +
> -                          info.dma32_window_size - 1,
> -                          0x1000);
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
> +            if (ret) {
> +                goto free_container_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            vfio_host_win_add(container, info.dma32_window_start,
> +                              info.dma32_window_start +
> +                              info.dma32_window_size - 1,
> +                              0x1000);
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index f339472..0c784c4 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -135,3 +135,73 @@ const MemoryListener vfio_prereg_listener = {
>      .region_add = vfio_prereg_listener_region_add,
>      .region_del = vfio_prereg_listener_region_del,
>  };
> +
> +int vfio_spapr_create_window(VFIOContainer *container,
> +                             MemoryRegionSection *section,
> +                             hwaddr *pgsize)
> +{
> +    int ret;
> +    unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
> +    unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
> +    unsigned entries, pages;
> +    struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> +
> +    /*
> +     * FIXME: For VFIO iommu types which have KVM acceleration to
> +     * avoid bouncing all map/unmaps through qemu this way, this
> +     * would be the right place to wire that up (tell the KVM
> +     * device emulation the VFIO iommu handles to use).
> +     */
> +    create.window_size = int128_get64(section->size);
> +    create.page_shift = ctz64(pagesize);

Doing a ctz on a value which is defined as 1 << n seems a bit
perverse.

> +    /*
> +     * SPAPR host supports multilevel TCE tables, there is some
> +     * heuristic to decide how many levels we want for our table:
> +     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +     */
> +    entries = create.window_size >> create.page_shift;
> +    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> +    pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
> +    create.levels = ctz64(pages) / 6 + 1;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    if (ret) {
> +        error_report("Failed to create a window, ret = %d (%m)", ret);
> +        return -errno;
> +    }
> +
> +    if (create.start_addr != section->offset_within_address_space) {
> +        vfio_spapr_remove_window(container, create.start_addr);
> +
> +        error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> +                     section->offset_within_address_space,
> +                     create.start_addr);
> +        ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        return -EINVAL;
> +    }
> +    trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.window_size,
> +                                   create.start_addr);
> +    *pgsize = pagesize;
> +
> +    return 0;
> +}
> +
> +int vfio_spapr_remove_window(VFIOContainer *container,
> +                             hwaddr offset_within_address_space)
> +{
> +    struct vfio_iommu_spapr_tce_remove remove = {
> +        .argsz = sizeof(remove),
> +        .start_addr = offset_within_address_space,
> +    };
> +    int ret;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +    if (ret) {
> +        error_report("Failed to remove window at %"PRIx64,
> +                     remove.start_addr);
> +        return -errno;
> +    }
> +
> +    return 0;
> +}
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c76ddc4..7e80382 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -167,4 +167,10 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
>  
> +int vfio_spapr_create_window(VFIOContainer *container,
> +                             MemoryRegionSection *section,
> +                             hwaddr *pgsize);
> +int vfio_spapr_remove_window(VFIOContainer *container,
> +                             hwaddr offset_within_address_space);
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index ddb8676..ec32c20 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1768,6 +1768,8 @@ vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sp
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
       [not found] ` <201606010902.u51902Zl007699@mx0a-001b2d01.pphosted.com>
@ 2016-06-03 15:37   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-06-03 15:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed,  1 Jun 2016 18:57:37 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> page size(s) used by an IOMMU.
> 
> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
> as fallback.
> 
> This removes vfio_container_granularity() and uses new helper in
> memory_region_iommu_replay() when replaying IOMMU mappings on added
> IOMMU memory region.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v16:
> * used memory_region_iommu_get_page_sizes() instead of
> mr->iommu_ops->get_page_sizes() in memory_region_iommu_replay()
> 
> v15:
> * s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes
> 
> v14:
> * removed vfio_container_granularity(), changed memory_region_iommu_replay()
> 
> v4:
> * s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
> ---
>  hw/ppc/spapr_iommu.c  |  8 ++++++++
>  hw/vfio/common.c      |  6 ------
>  include/exec/memory.h | 18 ++++++++++++++----
>  memory.c              | 16 +++++++++++++---
>  4 files changed, 35 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index a3cc572..90a45c0 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -149,6 +149,13 @@ static void spapr_tce_table_pre_save(void *opaque)
>                                 tcet->bus_offset, tcet->page_shift);
>  }
>  
> +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> +{
> +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> +
> +    return 1ULL << tcet->page_shift;
> +}
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -228,6 +235,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
> +    .get_page_sizes = spapr_tce_get_page_sizes,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index e51ed3a..f1a12b0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -322,11 +322,6 @@ out:
>      rcu_read_unlock();
>  }
>  
> -static hwaddr vfio_container_granularity(VFIOContainer *container)
> -{
> -    return (hwaddr)1 << ctz64(container->iova_pgsizes);
> -}
> -
>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> @@ -394,7 +389,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> -                                   vfio_container_granularity(container),
>                                     false);

nit, fix the now unnecessary line wrap.  Otherwise,

Acked-by: Alex Williamson <alex.williamson@redhat.com>

>  
>          return;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f649697..bd9625f 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -149,6 +149,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>  struct MemoryRegionIOMMUOps {
>      /* Return a TLB entry that contains a given address. */
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> +    /* Returns supported page sizes */
> +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -571,6 +573,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
>  
>  
>  /**
> + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
> + *
> + * Returns %bitmap of supported page sizes for an iommu.
> + *
> + * @mr: the memory region being queried
> + */
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
> +
> +/**
>   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>   *
>   * @mr: the memory region that was changed
> @@ -594,16 +605,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
>  
>  /**
>   * memory_region_iommu_replay: replay existing IOMMU translations to
> - * a notifier
> + * a notifier with the minimum page granularity returned by
> + * mr->iommu_ops->get_page_sizes().
>   *
>   * @mr: the memory region to observe
>   * @n: the notifier to which to replay iommu mappings
> - * @granularity: Minimum page granularity to replay notifications for
>   * @is_write: Whether to treat the replay as a translate "write"
>   *     through the iommu
>   */
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -                                hwaddr granularity, bool is_write);
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
>  
>  /**
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
> diff --git a/memory.c b/memory.c
> index 4e3cda8..761ae92 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1500,12 +1500,22 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>      notifier_list_add(&mr->iommu_notify, n);
>  }
>  
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -                                hwaddr granularity, bool is_write)
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
>  {
> -    hwaddr addr;
> +    assert(memory_region_is_iommu(mr));
> +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
> +        return mr->iommu_ops->get_page_sizes(mr);
> +    }
> +    return TARGET_PAGE_SIZE;
> +}
> +
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
> +{
> +    hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_page_sizes(mr));
> +
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>          iotlb = mr->iommu_ops->translate(mr, addr, is_write);
>          if (iotlb.perm != IOMMU_NONE) {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
       [not found] ` <201606010900.u51900Om007391@mx0a-001b2d01.pphosted.com>
@ 2016-06-03 16:13   ` Alex Williamson
  2016-06-06  6:04     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2016-06-03 16:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed,  1 Jun 2016 18:57:38 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> As there is no per-IOMMU-type release() callback anymore, this stores
> the IOMMU type in the container so vfio_listener_release() can determine
> if it needs to unregister @prereg_listener.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This enforces guest RAM blocks to be host page size aligned; however
> this is not new as KVM already requires memory slots to be host page
> size aligned.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v17:
> * s/prereg\.c/spapr.c/
> * s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
> * vfio_prereg_listener_skipped_section does hw_error() on IOMMUs
> 
> v16:
> * switched to 64bit math everywhere as there is no chance to see
> region_add on RAM blocks even remotely close to 1<<64bytes.
> 
> v15:
> * banned unaligned sections
> * added an vfio_prereg_gpa_to_ua() helper
> 
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs         |   1 +
>  hw/vfio/common.c              |  38 +++++++++---
>  hw/vfio/spapr.c               | 137 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events                  |   2 +
>  5 files changed, 172 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/spapr.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..c25e32b 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += spapr.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f1a12b0..770f630 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -504,6 +504,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>      memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
>  }
>  
>  static struct vfio_info_cap_header *
> @@ -862,8 +865,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>  
> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -888,8 +891,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              container->iova_pgsizes = info.iova_pgsizes;
>          }
> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>          if (ret) {
> @@ -897,7 +902,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto free_container_exit;
>          }
> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        container->iommu_type =
> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>          if (ret) {
>              error_report("vfio: failed to set iommu for container: %m");
>              ret = -errno;
> @@ -909,11 +916,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * when container fd is closed so we do not call it explicitly
>           * in this file.
>           */
> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -        if (ret) {
> -            error_report("vfio: failed to enable container: %m");
> -            ret = -errno;
> -            goto free_container_exit;
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_report("vfio: failed to enable container: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                error_report("vfio: RAM memory listener initialization failed for container");
> +                goto listener_release_exit;

Why doesn't this goto free_container_exit?  registration failure should
not need an unregister.

> +            }
>          }
>  
>          /*
> @@ -926,7 +944,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>          if (ret) {
>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>              ret = -errno;
> -            goto free_container_exit;
> +            goto listener_release_exit;

Looks like this will cause much badness when we try to do
memory_listener_unregister() on an empty listener struct for the main
listener.

>          }
>          container->min_iova = info.dma32_window_start;
>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> new file mode 100644
> index 0000000..f339472
> --- /dev/null
> +++ b/hw/vfio/spapr.c
> @@ -0,0 +1,137 @@
> +/*
> + * DMA memory preregistration
> + *
> + * Authors:
> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "cpu.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/hw.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    if (memory_region_is_iommu(section->mr)) {
> +        hw_error("Cannot possibly preregister IOMMU memory");
> +    }
> +
> +    return !memory_region_is_ram(section->mr) ||
> +            memory_region_is_skip_dump(section->mr);
> +}
> +
> +static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
> +{
> +    return memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region +
> +        (gpa - section->offset_within_address_space);
> +}
> +
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_add_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));

How will we know if this trace is related to the main listener or the
prereg listener?

> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    g_assert(gpa < end);

This would imply a zero-sized region, can't you simply return?

> +
> +    memory_region_ref(section->mr);
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> +    if (ret) {
> +        /*
> +         * On the initfn path, store the first error in the container so we
> +         * can gracefully fail.  Runtime, there's not much we can do other
> +         * than throw a hardware error.
> +         */
> +        if (!container->initialized) {
> +            if (!container->error) {
> +                container->error = ret;
> +            }
> +        } else {
> +            hw_error("vfio: Memory registering failed, unable to continue");
> +        }
> +    }
> +}
> +
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            prereg_listener);
> +    const hwaddr gpa = section->offset_within_address_space;
> +    hwaddr end;
> +    int ret;
> +    hwaddr page_mask = qemu_real_host_page_mask;
> +    struct vfio_iommu_spapr_register_memory reg = {
> +        .argsz = sizeof(reg),
> +        .flags = 0,
> +    };
> +
> +    if (vfio_prereg_listener_skipped_section(section)) {
> +        trace_vfio_listener_region_del_skip(
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                int128_get64(int128_sub(section->size, int128_one())));

Again, indistinguishable from main listener trace.

> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> +                 (section->offset_within_region & ~page_mask) ||
> +                 (int128_get64(section->size) & ~page_mask))) {
> +        error_report("%s received unaligned region", __func__);
> +        return;
> +    }
> +
> +    end = section->offset_within_address_space + int128_get64(section->size);
> +    if (gpa >= end) {
> +        return;

We simply return here, not sure why we need to g_assert above.

> +    }
> +
> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> +    reg.size = end - gpa;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> +}
> +
> +const MemoryListener vfio_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0610377..405c3b2 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      VFIOAddressSpace *space;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
> +    MemoryListener prereg_listener;
> +    unsigned iommu_type;
>      int error;
>      bool initialized;
>      /*
> @@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>                               uint32_t subtype, struct vfio_region_info **info);
>  #endif
> +extern const MemoryListener vfio_prereg_listener;
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index de42012..ddb8676 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1766,6 +1766,8 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
>  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"

This file loosely calls out which file the trace is in, these are not
in common.c.

>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities
       [not found] ` <201606010901.u518x843001647@mx0a-001b2d01.pphosted.com>
@ 2016-06-03 16:32   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-06-03 16:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed,  1 Jun 2016 18:57:40 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> There are going to be multiple IOMMUs per a container. This moves
> the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
> 
> This should cause no behavioral change and will be used later by
> the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v17:
> * vfio_host_win_add() uses vfio_host_win_lookup() for overlap check and
> aborts if any found instead of returning an error (as recovery is not
> possible anyway)
> * hw_error() when overlapped iommu is detected
> 
> v16:
> * adjusted commit log with changes from v15
> 
> v15:
> * s/vfio_host_iommu_add/vfio_host_win_add/
> * s/VFIOHostIOMMU/VFIOHostDMAWindow/
> ---
>  hw/vfio/common.c              | 59 +++++++++++++++++++++++++++++++------------
>  include/hw/vfio/vfio-common.h |  9 +++++--
>  2 files changed, 50 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 770f630..52b08fd 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "exec/memory.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
> +#include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #ifdef CONFIG_KVM
>  #include "linux/kvm.h"
> @@ -242,6 +243,38 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> +static VFIOHostDMAWindow *vfio_host_win_lookup(VFIOContainer *container,
> +                                               hwaddr min_iova, hwaddr max_iova)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (hostwin->min_iova <= min_iova && max_iova <= hostwin->max_iova) {
> +            return hostwin;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static void vfio_host_win_add(VFIOContainer *container,
> +                             hwaddr min_iova, hwaddr max_iova,
> +                             uint64_t iova_pgsizes)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +
> +    if (vfio_host_win_lookup(container, min_iova, max_iova)) {
> +        hw_error("%s: Overlapped IOMMU are not enabled", __func__);
> +    }
> +
> +    hostwin = g_malloc0(sizeof(*hostwin));
> +
> +    hostwin->min_iova = min_iova;
> +    hostwin->max_iova = max_iova;
> +    hostwin->iova_pgsizes = iova_pgsizes;
> +    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -355,7 +388,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> -    if ((iova < container->min_iova) || (end > container->max_iova)) {
> +    if (!vfio_host_win_lookup(container, iova, end)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>                       container, iova, end);
> @@ -370,10 +403,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
>          /*
> -         * FIXME: We should do some checking to see if the
> -         * capabilities of the host VFIO IOMMU are adequate to model
> -         * the guest IOMMU
> -         *
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
>           * would be the right place to wire that up (tell the KVM
> @@ -880,17 +909,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        container->min_iova = 0;
> -        container->max_iova = (hwaddr)-1;
> -
> -        /* Assume just 4K IOVA page size */
> -        container->iova_pgsizes = 0x1000;
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>          /* Ignore errors */
> -        if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> -            container->iova_pgsizes = info.iova_pgsizes;
> +        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> +            /* Assume 4k IOVA page size */
> +            info.iova_pgsizes = 4096;
>          }
> +        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
>      } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>                 ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>          struct vfio_iommu_spapr_tce_info info;
> @@ -946,11 +972,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              ret = -errno;
>              goto listener_release_exit;
>          }
> -        container->min_iova = info.dma32_window_start;
> -        container->max_iova = container->min_iova + info.dma32_window_size - 1;
>  
> -        /* Assume just 4K IOVA pages for now */
> -        container->iova_pgsizes = 0x1000;
> +        /* The default table uses 4K pages */
> +        vfio_host_win_add(container, info.dma32_window_start,
> +                          info.dma32_window_start +
> +                          info.dma32_window_size - 1,
> +                          0x1000);
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 405c3b2..c76ddc4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -82,9 +82,8 @@ typedef struct VFIOContainer {
>       * contiguous IOVA window.  We may need to generalize that in
>       * future
>       */
> -    hwaddr min_iova, max_iova;
> -    uint64_t iova_pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
> @@ -97,6 +96,12 @@ typedef struct VFIOGuestIOMMU {
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;
>  
> +typedef struct VFIOHostDMAWindow {
> +    hwaddr min_iova, max_iova;

nit, let's not obscure structure entries on the same line like we do
function variables.  Modulo the bug David identified in
vfio_host_win_add(), this looks ok to me.

> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> +} VFIOHostDMAWindow;
> +
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  
>  typedef struct VFIODevice {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
       [not found] ` <201606010901.u518x7AQ001537@mx0a-001b2d01.pphosted.com>
@ 2016-06-03 16:50   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-06-03 16:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed,  1 Jun 2016 18:57:41 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses heuristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v17:
> * moved spapr window create/remove helpers to separate file
> * added hw_error() if vfio_host_win_del() failed
> 
> v16:
> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> * enforced no intersections between windows
> 
> v14:
> * new to the series
> ---
>  hw/vfio/common.c              | 76 +++++++++++++++++++++++++++++++++++++------
>  hw/vfio/spapr.c               | 70 +++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  6 ++++
>  trace-events                  |  2 ++
>  4 files changed, 144 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 52b08fd..7f55c26 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -275,6 +275,18 @@ static void vfio_host_win_add(VFIOContainer *container,
>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
>  }
>  
> +static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> +{
> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> +

+1 David's comment for exact match.

> +    if (!hostwin) {
> +        return -1;
> +    }
> +    QLIST_REMOVE(hostwin, hostwin_next);
> +
> +    return 0;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
> @@ -388,6 +400,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>      end = int128_get64(int128_sub(llend, int128_one()));
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        VFIOHostDMAWindow *hostwin;
> +        hwaddr pgsize = 0;
> +
> +        /* For now intersections are not allowed, we may relax this later */
> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +            if (ranges_overlap(hostwin->min_iova,
> +                               hostwin->max_iova - hostwin->min_iova + 1,
> +                               section->offset_within_address_space,
> +                               int128_get64(section->size))) {
> +                goto fail;
> +            }
> +        }
> +
> +        ret = vfio_spapr_create_window(container, section, &pgsize);
> +        if (ret) {
> +            goto fail;
> +        }
> +
> +        vfio_host_win_add(container, section->offset_within_address_space,
> +                          section->offset_within_address_space +
> +                          int128_get64(section->size) - 1, pgsize);
> +    }
> +
>      if (!vfio_host_win_lookup(container, iova, end)) {
>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> @@ -523,6 +559,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                       "0x%"HWADDR_PRIx") = %d (%m)",
>                       container, iova, int128_get64(llsize), ret);
>      }
> +
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        vfio_spapr_remove_window(container,
> +                                 section->offset_within_address_space);
> +        if (vfio_host_win_del(container,
> +                              section->offset_within_address_space) < 0) {
> +            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> +                     __func__, section->offset_within_address_space);
> +        }
> +
> +        trace_vfio_spapr_remove_window(section->offset_within_address_space);


Trace within the function like you do on the create_window side.


> +    }
>  }
>  
>  static const MemoryListener vfio_memory_listener = {
> @@ -960,11 +1008,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -973,11 +1016,24 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto listener_release_exit;
>          }
>  
> -        /* The default table uses 4K pages */
> -        vfio_host_win_add(container, info.dma32_window_start,
> -                          info.dma32_window_start +
> -                          info.dma32_window_size - 1,
> -                          0x1000);
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
> +            if (ret) {
> +                goto free_container_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            vfio_host_win_add(container, info.dma32_window_start,
> +                              info.dma32_window_start +
> +                              info.dma32_window_size - 1,
> +                              0x1000);
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index f339472..0c784c4 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -135,3 +135,73 @@ const MemoryListener vfio_prereg_listener = {
>      .region_add = vfio_prereg_listener_region_add,
>      .region_del = vfio_prereg_listener_region_del,
>  };
> +
> +int vfio_spapr_create_window(VFIOContainer *container,
> +                             MemoryRegionSection *section,
> +                             hwaddr *pgsize)
> +{
> +    int ret;
> +    unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
> +    unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
> +    unsigned entries, pages;
> +    struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> +
> +    /*
> +     * FIXME: For VFIO iommu types which have KVM acceleration to
> +     * avoid bouncing all map/unmaps through qemu this way, this
> +     * would be the right place to wire that up (tell the KVM
> +     * device emulation the VFIO iommu handles to use).
> +     */
> +    create.window_size = int128_get64(section->size);
> +    create.page_shift = ctz64(pagesize);
> +    /*
> +     * SPAPR host supports multilevel TCE tables, there is some
> +     * heuristic to decide how many levels we want for our table:
> +     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +     */
> +    entries = create.window_size >> create.page_shift;
> +    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> +    pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
> +    create.levels = ctz64(pages) / 6 + 1;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    if (ret) {
> +        error_report("Failed to create a window, ret = %d (%m)", ret);
> +        return -errno;
> +    }
> +
> +    if (create.start_addr != section->offset_within_address_space) {
> +        vfio_spapr_remove_window(container, create.start_addr);
> +
> +        error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> +                     section->offset_within_address_space,
> +                     create.start_addr);
> +        ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        return -EINVAL;
> +    }
> +    trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.window_size,
> +                                   create.start_addr);
> +    *pgsize = pagesize;
> +
> +    return 0;
> +}
> +
> +int vfio_spapr_remove_window(VFIOContainer *container,
> +                             hwaddr offset_within_address_space)
> +{
> +    struct vfio_iommu_spapr_tce_remove remove = {
> +        .argsz = sizeof(remove),
> +        .start_addr = offset_within_address_space,
> +    };
> +    int ret;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +    if (ret) {
> +        error_report("Failed to remove window at %"PRIx64,
> +                     remove.start_addr);
> +        return -errno;
> +    }
> +
> +    return 0;
> +}
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c76ddc4..7e80382 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -167,4 +167,10 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
>  
> +int vfio_spapr_create_window(VFIOContainer *container,
> +                             MemoryRegionSection *section,
> +                             hwaddr *pgsize);
> +int vfio_spapr_remove_window(VFIOContainer *container,
> +                             hwaddr offset_within_address_space);
> +
>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> diff --git a/trace-events b/trace-events
> index ddb8676..ec32c20 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1768,6 +1768,8 @@ vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sp
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64


Also not in common.c


>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 12/12] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening
       [not found] ` <201606011013.u51A9ALx023064@mx0a-001b2d01.pphosted.com>
@ 2016-06-03 16:59   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-06-03 16:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Wed,  1 Jun 2016 18:57:43 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> a guest view of the table and a hardware TCE table. If there is no VFIO
> presense in the address space, then just the guest view is used, if
> this is the case, it is allocated in the KVM. However since there is no
> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> we need to move the guest view from KVM to the userspace; and we need
> to do this for every IOMMU on a bus with VFIO devices.
> 
> This adds notify_started/notify_stopped callbacks in MemoryRegionIOMMUOps
> to notify IOMMU that listeners were set/removed. This allows IOMMU to
> take necessary steps before actual notifications happen and do proper
> cleanup when the last notifier is removed.
> 
> This implements the callbacks for the sPAPR IOMMU - notify_started()
> reallocated the guest view to the user space, notify_stopped() does
> the opposite.
> 
> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> path as the new callbacks do this better - they notify IOMMU at
> the exact moment when the configuration is changed, and this also
> includes the case of PCI hot unplug.
> 
> This adds MemoryRegion* to memory_region_unregister_iommu_notifier()
> as we need iommu_ops to call notify_stopped() and Notifier* does not
> store the owner.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Changes:
> v17:
> * replaced IOMMU users counting with simple QLIST_EMPTY()
> * renamed the callbacks
> * removed requirement for region_del() to be called on memory_listener_unregister()
> 
> v16:
> * added a use counter in VFIOAddressSpace->VFIOIOMMUMR
> 
> v15:
> * s/need_vfio/vfio-Users/g
> ---

This should be two separate patches, patch #2:

>  hw/ppc/spapr_iommu.c  | 12 ++++++++++++
>  hw/ppc/spapr_pci.c    |  6 ------

patch #1:

>  hw/vfio/common.c      |  5 +++--
>  include/exec/memory.h |  8 +++++++-
>  memory.c              | 10 +++++++++-
>  5 files changed, 31 insertions(+), 10 deletions(-)

Otherwise

Acked-by: Alex Williamson <alex.williamson@redhat.com>

> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 90a45c0..994a8a0 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -156,6 +156,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_notify_started(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> +}
> +
> +static void spapr_tce_notify_stopped(MemoryRegion *iommu)
> +{
> +    spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), false);
> +}
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -236,6 +246,8 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>      .translate = spapr_tce_translate_iommu,
>      .get_page_sizes = spapr_tce_get_page_sizes,
> +    .notify_started = spapr_tce_notify_started,
> +    .notify_stopped = spapr_tce_notify_stopped,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index bcf0360..06ce902 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1089,12 +1089,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      void *fdt = NULL;
>      int fdt_start_offset = 0, fdt_size;
>  
> -    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> -
> -        spapr_tce_set_need_vfio(tcet, true);
> -    }
> -
>      fdt = create_device_tree(&fdt_size);
>      fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
>      if (!fdt_start_offset) {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7f55c26..356640e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -522,7 +522,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (giommu->iommu == section->mr) {
> -                memory_region_unregister_iommu_notifier(&giommu->n);
> +                memory_region_unregister_iommu_notifier(giommu->iommu,
> +                                                        &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
>                  g_free(giommu);
>                  break;
> @@ -1094,7 +1095,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          QLIST_REMOVE(container, next);
>  
>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(&giommu->n);
> +            memory_region_unregister_iommu_notifier(giommu->iommu, &giommu->n);
>              QLIST_REMOVE(giommu, giommu_next);
>              g_free(giommu);
>          }
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index bd9625f..f08439b 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -151,6 +151,10 @@ struct MemoryRegionIOMMUOps {
>      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>      /* Returns supported page sizes */
>      uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> +    /* Called when the first notifier is set */
> +    void (*notify_started)(MemoryRegion *iommu);
> +    /* Called when the last notifier is removed */
> +    void (*notify_stopped)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -619,9 +623,11 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
>   * memory_region_unregister_iommu_notifier: unregister a notifier for
>   * changes to IOMMU translation entries.
>   *
> + * @mr: the memory region which was observed and for which notity_stopped()
> + *      needs to be called
>   * @n: the notifier to be removed.
>   */
> -void memory_region_unregister_iommu_notifier(Notifier *n);
> +void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n);
>  
>  /**
>   * memory_region_name: get a memory region's name
> diff --git a/memory.c b/memory.c
> index 761ae92..ee41649 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1497,6 +1497,10 @@ bool memory_region_is_logging(MemoryRegion *mr, uint8_t client)
>  
>  void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>  {
> +    if (mr->iommu_ops->notify_started &&
> +        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
> +        mr->iommu_ops->notify_started(mr);
> +    }
>      notifier_list_add(&mr->iommu_notify, n);
>  }
>  
> @@ -1530,9 +1534,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
>      }
>  }
>  
> -void memory_region_unregister_iommu_notifier(Notifier *n)
> +void memory_region_unregister_iommu_notifier(MemoryRegion *mr, Notifier *n)
>  {
>      notifier_remove(n);
> +    if (mr->iommu_ops->notify_stopped &&
> +        QLIST_EMPTY(&mr->iommu_notify.notifiers)) {
> +        mr->iommu_ops->notify_stopped(mr);
> +    }
>  }
>  
>  void memory_region_notify_iommu(MemoryRegion *mr,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
       [not found] ` <201606010901.u518x1wF023568@mx0a-001b2d01.pphosted.com>
@ 2016-06-06  5:57   ` David Gibson
  2016-06-06  8:12     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-06-06  5:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 27300 bytes --]

On Wed, Jun 01, 2016 at 06:57:42PM +1000, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.5 machine (TODO: update version) and older disable it.

Looks like your todo is now todone, but you need to update the commit
message.

> This also creates a single DMA window for the older machines to
> maintain backward migration.
> 
> This implements DDW for PHB with emulated and VFIO devices. The host
> kernel support is required. The advertised IOMMU page sizes are 4K and
> 64K; 16M pages are supported but not advertised by default, in order to
> enable them, the user has to specify "pgsz" property for PHB and
> enable huge pages for RAM.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Looks pretty close to ready.

There are a handful of nits and one real error noted below.

> ---
> Changes:
> v17:
> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
> 
> v16:
> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> 
> v15:
> * moved page mask filtering to PHB realize(), use "-mempath" to know
> if there are huge pages
> * fixed error reporting in RTAS handlers
> * max window size accounts now hotpluggable memory boundaries
> ---
>  hw/ppc/Makefile.objs        |   1 +
>  hw/ppc/spapr.c              |   5 +
>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>  hw/ppc/spapr_rtas_ddw.c     | 293 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |   8 +-
>  include/hw/ppc/spapr.h      |  16 ++-
>  trace-events                |   4 +
>  7 files changed, 383 insertions(+), 21 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index c1ffc77..986b36f 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 44e401a..6ddcda9 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2366,6 +2366,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>          .driver   = "spapr-vlan", \
>          .property = "use-rx-buffer-pools", \
>          .value    = "off", \
> +    }, \
> +    {\
> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +        .property = "ddw",\
> +        .value    = stringify(off),\
>      },
>  
>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 68de523..bcf0360 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -35,6 +35,7 @@
>  #include "hw/ppc/spapr.h"
>  #include "hw/pci-host/spapr.h"
>  #include "exec/address-spaces.h"
> +#include "exec/ram_addr.h"
>  #include <libfdt.h>
>  #include "trace.h"
>  #include "qemu/error-report.h"
> @@ -45,6 +46,7 @@
>  #include "hw/ppc/spapr_drc.h"
>  #include "sysemu/device_tree.h"
>  #include "sysemu/kvm.h"
> +#include "sysemu/hostmem.h"
>  
>  #include "hw/vfio/vfio.h"
>  
> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      int fdt_start_offset = 0, fdt_size;
>  
>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>  
>          spapr_tce_set_need_vfio(tcet, true);
>      }

Hang on.. I thought you'd got rid of the need for this explicit
set_need_vfio() stuff.

> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
>      sPAPRTCETable *tcet;
> +    const unsigned windows_supported =
> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
>  
> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
>              || (sphb->mem_win_addr != (hwaddr)-1)
>              || (sphb->io_win_addr != (hwaddr)-1)) {
>              error_setg(errp, "Either \"index\" or other parameters must"
> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>  
>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> +        for (i = 0; i < windows_supported; ++i) {
> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> +        }
>  
>          windows_base = SPAPR_PCI_WINDOW_BASE
>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (sphb->dma_liobn == (uint32_t)-1) {
> -        error_setg(errp, "LIOBN not specified for PHB");
> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>          return;
>      }

Hrm.. there's a bit of false generality here, since this would break
if windows_supported > 2, and dma_liobn[2] was not specified.  Not
urgent for the initial commit though.

> @@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -        error_setg(errp, "Unable to create TCE table for %s",
> -                   sphb->dtbusname);
> -        return;
> +    /* DMA setup */
> +    for (i = 0; i < windows_supported; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
> +        if (!tcet) {
> +            error_setg(errp, "Creating window#%d failed for %s",
> +                       i, sphb->dtbusname);
> +            return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
>      }
>  
> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> -                                        spapr_tce_get_iommu(tcet), 0);
> -
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>  
> @@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> +    int i;
> +    sPAPRTCETable *tcet;
>  
> -    if (tcet && tcet->nb_table) {
> -        spapr_tce_table_disable(tcet);
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
> +
> +        if (tcet && tcet->nb_table) {
> +            spapr_tce_table_disable(tcet);
> +        }
>      }
>  
>      /* Register default 32bit DMA window */
> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>  }
> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>  static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>                         SPAPR_PCI_MMIO_WIN_SIZE),
> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>      /* Default DMA window is 0..1GB */
>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> +                       0x800000000000000ULL),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> +                       (1ULL << 12) | (1ULL << 16)),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>      .post_load = spapr_pci_post_load,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> +        VMSTATE_UNUSED(4), /* dma_liobn */
>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> @@ -1780,6 +1802,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1804,6 +1835,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> @@ -1827,7 +1866,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>                       sizeof(interrupt_map)));
>  
> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>      if (!tcet) {
>          return -1;
>      }
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..17bbae0
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,293 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "cpu.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->nb_table) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->nb_table) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> +{
> +    int i;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> +        if (page_mask & (1ULL << masks[i].shift)) {
> +            mask |= masks[i].mask;
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid, max_window_size;
> +    uint32_t avail, addr, pgmask = 0;
> +    MachineState *machine = MACHINE(spapr);
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    /* Translate page mask to LoPAPR format */
> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> +     */
> +    if (machine->ram_size == machine->maxram_size) {
> +        max_window_size = machine->ram_size;
> +    } else {
> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> +
> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> +    }
> +
> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +
> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
> +        (window_shift < page_shift)) {
> +        goto param_error_exit;
> +    }
> +
> +    if (!liobn || !sphb->ddw_enabled ||
> +        spapr_phb_get_active_win_num(sphb) == SPAPR_PCI_DMA_MAX_WINDOWS) {
> +        goto hw_error_exit;
> +    }
> +
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift,
> +                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
> +    if (!tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    spapr_tce_table_enable(tcet, page_shift, sphb->dma64_window_addr,

This looks like it's assuming you're creating the second 64-bit
window.  If the guest removed the default window then tried to
recreate it, that might not be the case.

> +                           1ULL << (window_shift - page_shift));
> +    if (!tcet->nb_table) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_tce_table_disable(tcet);
> +    trace_spapr_iommu_ddw_remove(liobn);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +
> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 7848366..36a370e 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -32,6 +32,8 @@
>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>  
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
>  typedef struct sPAPRPHBState sPAPRPHBState;
>  
>  typedef struct spapr_pci_msi {
> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>      MemoryRegion memwindow, iowindow, msiwindow;
>  
> -    uint32_t dma_liobn;
> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>      hwaddr dma_win_addr, dma_win_size;
>      AddressSpace iommu_as;
>      MemoryRegion iommu_root;
> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>      spapr_pci_msi_mig *msi_devs;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
> +
> +    bool ddw_enabled;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_window_addr;
>  };
>  
>  #define SPAPR_PCI_MAX_INDEX          255
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 971df3d..59fad22 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -412,6 +412,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -453,8 +463,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> diff --git a/trace-events b/trace-events
> index ec32c20..dec80e4 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1433,6 +1433,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-06-03 16:13   ` [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alex Williamson
@ 2016-06-06  6:04     ` Alexey Kardashevskiy
  2016-06-06 17:20       ` Alex Williamson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-06  6:04 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On 04/06/16 02:13, Alex Williamson wrote:
> On Wed,  1 Jun 2016 18:57:38 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> This makes use of the new "memory registering" feature. The idea is
>> to provide the userspace ability to notify the host kernel about pages
>> which are going to be used for DMA. Having this information, the host
>> kernel can pin them all once per user process, do locked pages
>> accounting (once) and not spent time on doing that in real time with
>> possible failures which cannot be handled nicely in some cases.
>>
>> This adds a prereg memory listener which listens on address_space_memory
>> and notifies a VFIO container about memory which needs to be
>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>
>> As there is no per-IOMMU-type release() callback anymore, this stores
>> the IOMMU type in the container so vfio_listener_release() can determine
>> if it needs to unregister @prereg_listener.
>>
>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>> not call it when v2 is detected and enabled.
>>
>> This enforces guest RAM blocks to be host page size aligned; however
>> this is not new as KVM already requires memory slots to be host page
>> size aligned.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v17:
>> * s/prereg\.c/spapr.c/
>> * s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
>> * vfio_prereg_listener_skipped_section does hw_error() on IOMMUs
>>
>> v16:
>> * switched to 64bit math everywhere as there is no chance to see
>> region_add on RAM blocks even remotely close to 1<<64bytes.
>>
>> v15:
>> * banned unaligned sections
>> * added an vfio_prereg_gpa_to_ua() helper
>>
>> v14:
>> * s/free_container_exit/listener_release_exit/g
>> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
>> ---
>>  hw/vfio/Makefile.objs         |   1 +
>>  hw/vfio/common.c              |  38 +++++++++---
>>  hw/vfio/spapr.c               | 137 ++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |   4 ++
>>  trace-events                  |   2 +
>>  5 files changed, 172 insertions(+), 10 deletions(-)
>>  create mode 100644 hw/vfio/spapr.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index ceddbb8..c25e32b 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>  obj-$(CONFIG_SOFTMMU) += platform.o
>>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>> +obj-$(CONFIG_SOFTMMU) += spapr.o
>>  endif
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index f1a12b0..770f630 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -504,6 +504,9 @@ static const MemoryListener vfio_memory_listener = {
>>  static void vfio_listener_release(VFIOContainer *container)
>>  {
>>      memory_listener_unregister(&container->listener);
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        memory_listener_unregister(&container->prereg_listener);
>> +    }
>>  }
>>  
>>  static struct vfio_info_cap_header *
>> @@ -862,8 +865,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              goto free_container_exit;
>>          }
>>  
>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>          if (ret) {
>>              error_report("vfio: failed to set iommu for container: %m");
>>              ret = -errno;
>> @@ -888,8 +891,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>              container->iova_pgsizes = info.iova_pgsizes;
>>          }
>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>          struct vfio_iommu_spapr_tce_info info;
>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>  
>>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>          if (ret) {
>> @@ -897,7 +902,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              ret = -errno;
>>              goto free_container_exit;
>>          }
>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        container->iommu_type =
>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>          if (ret) {
>>              error_report("vfio: failed to set iommu for container: %m");
>>              ret = -errno;
>> @@ -909,11 +916,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>           * when container fd is closed so we do not call it explicitly
>>           * in this file.
>>           */
>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> -        if (ret) {
>> -            error_report("vfio: failed to enable container: %m");
>> -            ret = -errno;
>> -            goto free_container_exit;
>> +        if (!v2) {
>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> +            if (ret) {
>> +                error_report("vfio: failed to enable container: %m");
>> +                ret = -errno;
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            container->prereg_listener = vfio_prereg_listener;
>> +
>> +            memory_listener_register(&container->prereg_listener,
>> +                                     &address_space_memory);
>> +            if (container->error) {
>> +                error_report("vfio: RAM memory listener initialization failed for container");
>> +                goto listener_release_exit;
> 
> Why doesn't this goto free_container_exit?  registration failure should
> not need an unregister.


The listener registration cannot possibly fail, it adds a listener into the
memory_listeners list, no matter what region_add() does.

I'll add an explicit
memory_listener_unregister(&container->prereg_listener) here.


> 
>> +            }
>>          }
>>  
>>          /*
>> @@ -926,7 +944,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>          if (ret) {
>>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>>              ret = -errno;
>> -            goto free_container_exit;
>> +            goto listener_release_exit;
> 
> Looks like this will cause much badness when we try to do
> memory_listener_unregister() on an empty listener struct for the main
> listener.


Oh. Bug. I'll add
 memory_listener_unregister(&container->prereg_listener) here.


> 
>>          }
>>          container->min_iova = info.dma32_window_start;
>>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>> new file mode 100644
>> index 0000000..f339472
>> --- /dev/null
>> +++ b/hw/vfio/spapr.c
>> @@ -0,0 +1,137 @@
>> +/*
>> + * DMA memory preregistration
>> + *
>> + * Authors:
>> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "cpu.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "hw/hw.h"
>> +#include "qemu/error-report.h"
>> +#include "trace.h"
>> +
>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
>> +{
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        hw_error("Cannot possibly preregister IOMMU memory");
>> +    }
>> +
>> +    return !memory_region_is_ram(section->mr) ||
>> +            memory_region_is_skip_dump(section->mr);
>> +}
>> +
>> +static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>> +{
>> +    return memory_region_get_ram_ptr(section->mr) +
>> +        section->offset_within_region +
>> +        (gpa - section->offset_within_address_space);
>> +}
>> +
>> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            prereg_listener);
>> +    const hwaddr gpa = section->offset_within_address_space;
>> +    hwaddr end;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_add_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
> 
> How will we know if this trace is related to the main listener or the
> prereg listener?


By addresses it prints :)

Fair point, one question though:

trace_vfio_prereg_listener_region_add_skip or
trace_vfio_spapr_listener_region_add_skip ?

Should all symbols in this file get "spapr" instead of "prereg"?


> 
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>> +                 (section->offset_within_region & ~page_mask) ||
>> +                 (int128_get64(section->size) & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    end = section->offset_within_address_space + int128_get64(section->size);
>> +    g_assert(gpa < end);
> 
> This would imply a zero-sized region, can't you simply return?

Zero-sized region or overflow, no?

When I copied this from vfio_listener_region_add(), I thought it is an
overflow check (which imho should have been assert() or hwerror(), is not
it? What do I miss?


> 
>> +
>> +    memory_region_ref(section->mr);
>> +
>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
>> +    reg.size = end - gpa;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
>> +    if (ret) {
>> +        /*
>> +         * On the initfn path, store the first error in the container so we
>> +         * can gracefully fail.  Runtime, there's not much we can do other
>> +         * than throw a hardware error.
>> +         */
>> +        if (!container->initialized) {
>> +            if (!container->error) {
>> +                container->error = ret;
>> +            }
>> +        } else {
>> +            hw_error("vfio: Memory registering failed, unable to continue");
>> +        }
>> +    }
>> +}
>> +
>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            prereg_listener);
>> +    const hwaddr gpa = section->offset_within_address_space;
>> +    hwaddr end;
>> +    int ret;
>> +    hwaddr page_mask = qemu_real_host_page_mask;
>> +    struct vfio_iommu_spapr_register_memory reg = {
>> +        .argsz = sizeof(reg),
>> +        .flags = 0,
>> +    };
>> +
>> +    if (vfio_prereg_listener_skipped_section(section)) {
>> +        trace_vfio_listener_region_del_skip(
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                int128_get64(int128_sub(section->size, int128_one())));
> 
> Again, indistinguishable from main listener trace.
> 
>> +        return;
>> +    }
>> +
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>> +                 (section->offset_within_region & ~page_mask) ||
>> +                 (int128_get64(section->size) & ~page_mask))) {
>> +        error_report("%s received unaligned region", __func__);
>> +        return;
>> +    }
>> +
>> +    end = section->offset_within_address_space + int128_get64(section->size);
>> +    if (gpa >= end) {
>> +        return;
> 
> We simply return here, not sure why we need to g_assert above.

Well, we won't get this far if this is the case - region_add() would fail
first.


> 
>> +    }
>> +
>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
>> +    reg.size = end - gpa;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
>> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
>> +}
>> +
>> +const MemoryListener vfio_prereg_listener = {
>> +    .region_add = vfio_prereg_listener_region_add,
>> +    .region_del = vfio_prereg_listener_region_del,
>> +};
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 0610377..405c3b2 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>>      VFIOAddressSpace *space;
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      MemoryListener listener;
>> +    MemoryListener prereg_listener;
>> +    unsigned iommu_type;
>>      int error;
>>      bool initialized;
>>      /*
>> @@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>                               uint32_t subtype, struct vfio_region_info **info);
>>  #endif
>> +extern const MemoryListener vfio_prereg_listener;
>> +
>>  #endif /* !HW_VFIO_VFIO_COMMON_H */
>> diff --git a/trace-events b/trace-events
>> index de42012..ddb8676 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1766,6 +1766,8 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
>>  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
>>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> 
> This file loosely calls out which file the trace is in, these are not
> in common.c.
> 
>>  
>>  # hw/vfio/platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-06-03  7:37   ` [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) David Gibson
@ 2016-06-06  6:45     ` Alexey Kardashevskiy
  2016-06-08  5:56       ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-06  6:45 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 11991 bytes --]

On 03/06/16 17:37, David Gibson wrote:
> On Wed, Jun 01, 2016 at 06:57:41PM +1000, Alexey Kardashevskiy wrote:
>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>> This adds ability to VFIO common code to dynamically allocate/remove
>> DMA windows in the host kernel when new VFIO container is added/removed.
>>
>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>> and adds just created IOMMU into the host IOMMU list; the opposite
>> action is taken in vfio_listener_region_del.
>>
>> When creating a new window, this uses heuristic to decide on the TCE table
>> levels number.
>>
>> This should cause no guest visible change in behavior.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v17:
>> * moved spapr window create/remove helpers to separate file
>> * added hw_error() if vfio_host_win_del() failed
>>
>> v16:
>> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
>> * enforced no intersections between windows
>>
>> v14:
>> * new to the series
>> ---
>>  hw/vfio/common.c              | 76 +++++++++++++++++++++++++++++++++++++------
>>  hw/vfio/spapr.c               | 70 +++++++++++++++++++++++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  6 ++++
>>  trace-events                  |  2 ++
>>  4 files changed, 144 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 52b08fd..7f55c26 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -275,6 +275,18 @@ static void vfio_host_win_add(VFIOContainer *container,
>>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
>>  }
>>  
>> +static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
>> +{
>> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> 
> Hrm.. and for this case I think you want exact match, rather than
> looking for range inclusion.

I suppose so, I'll change this.


>> +
>> +    if (!hostwin) {
>> +        return -1;
>> +    }
>> +    QLIST_REMOVE(hostwin, hostwin_next);
>> +
>> +    return 0;
>> +}
>> +
>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>  {
>>      return (!memory_region_is_ram(section->mr) &&
>> @@ -388,6 +400,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>      }
>>      end = int128_get64(int128_sub(llend, int128_one()));
>>  
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        VFIOHostDMAWindow *hostwin;
>> +        hwaddr pgsize = 0;
>> +
>> +        /* For now intersections are not allowed, we may relax this later */
>> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +            if (ranges_overlap(hostwin->min_iova,
>> +                               hostwin->max_iova - hostwin->min_iova + 1,
>> +                               section->offset_within_address_space,
>> +                               int128_get64(section->size))) {
>> +                goto fail;
>> +            }
>> +        }
>> +
>> +        ret = vfio_spapr_create_window(container, section, &pgsize);
>> +        if (ret) {
>> +            goto fail;
>> +        }
>> +
>> +        vfio_host_win_add(container, section->offset_within_address_space,
>> +                          section->offset_within_address_space +
>> +                          int128_get64(section->size) - 1, pgsize);
>> +    }
>> +
>>      if (!vfio_host_win_lookup(container, iova, end)) {
>>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
>>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>> @@ -523,6 +559,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                       "0x%"HWADDR_PRIx") = %d (%m)",
>>                       container, iova, int128_get64(llsize), ret);
>>      }
>> +
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        vfio_spapr_remove_window(container,
>> +                                 section->offset_within_address_space);
> 
> Should check for error here.


And do what here? vfio_spapr_remove_window() calls error_report() already
and I still want to remove the host window here.


> 
>> +        if (vfio_host_win_del(container,
>> +                              section->offset_within_address_space) < 0) {
>> +            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
>> +                     __func__, section->offset_within_address_space);
> 
> Personally I think assert() would be better here, but Alex doesn't
> like them so I'm ok with this.
> 
>> +        }
>> +
>> +        trace_vfio_spapr_remove_window(section->offset_within_address_space);
>> +    }
>>  }
>>  
>>  static const MemoryListener vfio_memory_listener = {
>> @@ -960,11 +1008,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              }
>>          }
>>  
>> -        /*
>> -         * This only considers the host IOMMU's 32-bit window.  At
>> -         * some point we need to add support for the optional 64-bit
>> -         * window and dynamic windows
>> -         */
>>          info.argsz = sizeof(info);
>>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>>          if (ret) {
>> @@ -973,11 +1016,24 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>              goto listener_release_exit;
>>          }
>>  
>> -        /* The default table uses 4K pages */
>> -        vfio_host_win_add(container, info.dma32_window_start,
>> -                          info.dma32_window_start +
>> -                          info.dma32_window_size - 1,
>> -                          0x1000);
>> +        if (v2) {
>> +            /*
>> +             * There is a default window in just created container.
>> +             * To make region_add/del simpler, we better remove this
>> +             * window now and let those iommu_listener callbacks
>> +             * create/remove them when needed.
>> +             */
>> +            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
>> +            if (ret) {
>> +                goto free_container_exit;
>> +            }
>> +        } else {
>> +            /* The default table uses 4K pages */
>> +            vfio_host_win_add(container, info.dma32_window_start,
>> +                              info.dma32_window_start +
>> +                              info.dma32_window_size - 1,
>> +                              0x1000);
>> +        }
>>      } else {
>>          error_report("vfio: No available IOMMU models");
>>          ret = -EINVAL;
>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>> index f339472..0c784c4 100644
>> --- a/hw/vfio/spapr.c
>> +++ b/hw/vfio/spapr.c
>> @@ -135,3 +135,73 @@ const MemoryListener vfio_prereg_listener = {
>>      .region_add = vfio_prereg_listener_region_add,
>>      .region_del = vfio_prereg_listener_region_del,
>>  };
>> +
>> +int vfio_spapr_create_window(VFIOContainer *container,
>> +                             MemoryRegionSection *section,
>> +                             hwaddr *pgsize)
>> +{
>> +    int ret;
>> +    unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
>> +    unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
>> +    unsigned entries, pages;
>> +    struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>> +
>> +    /*
>> +     * FIXME: For VFIO iommu types which have KVM acceleration to
>> +     * avoid bouncing all map/unmaps through qemu this way, this
>> +     * would be the right place to wire that up (tell the KVM
>> +     * device emulation the VFIO iommu handles to use).
>> +     */
>> +    create.window_size = int128_get64(section->size);
>> +    create.page_shift = ctz64(pagesize);
> 
> Doing a ctz on a value which is defined as 1 << n seems a bit
> perverse.

Well, this way it felt more obvious that pagesize is a single page size,
not a mask. Not sure if memory_region_iommu_get_page_sizes() returning a
mask (rather than a page size) is a good idea after all...

I'll make it:

 create.page_shift = ctz64(pagesizes);

and (below):

 *pgsize = 1ULL << create.page_shift;


and remove pagesize.

>> +    /*
>> +     * SPAPR host supports multilevel TCE tables, there is some
>> +     * heuristic to decide how many levels we want for our table:
>> +     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>> +     */
>> +    entries = create.window_size >> create.page_shift;
>> +    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
>> +    pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
>> +    create.levels = ctz64(pages) / 6 + 1;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +    if (ret) {
>> +        error_report("Failed to create a window, ret = %d (%m)", ret);
>> +        return -errno;
>> +    }
>> +
>> +    if (create.start_addr != section->offset_within_address_space) {
>> +        vfio_spapr_remove_window(container, create.start_addr);
>> +
>> +        error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
>> +                     section->offset_within_address_space,
>> +                     create.start_addr);
>> +        ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +        return -EINVAL;
>> +    }
>> +    trace_vfio_spapr_create_window(create.page_shift,
>> +                                   create.window_size,
>> +                                   create.start_addr);
>> +    *pgsize = pagesize;
>> +
>> +    return 0;
>> +}
>> +
>> +int vfio_spapr_remove_window(VFIOContainer *container,
>> +                             hwaddr offset_within_address_space)
>> +{
>> +    struct vfio_iommu_spapr_tce_remove remove = {
>> +        .argsz = sizeof(remove),
>> +        .start_addr = offset_within_address_space,
>> +    };
>> +    int ret;
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
>> +    if (ret) {
>> +        error_report("Failed to remove window at %"PRIx64,
>> +                     remove.start_addr);
>> +        return -errno;
>> +    }
>> +
>> +    return 0;
>> +}
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index c76ddc4..7e80382 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -167,4 +167,10 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>  #endif
>>  extern const MemoryListener vfio_prereg_listener;
>>  
>> +int vfio_spapr_create_window(VFIOContainer *container,
>> +                             MemoryRegionSection *section,
>> +                             hwaddr *pgsize);
>> +int vfio_spapr_remove_window(VFIOContainer *container,
>> +                             hwaddr offset_within_address_space);
>> +
>>  #endif /* !HW_VFIO_VFIO_COMMON_H */
>> diff --git a/trace-events b/trace-events
>> index ddb8676..ec32c20 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1768,6 +1768,8 @@ vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sp
>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
>> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>>  
>>  # hw/vfio/platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-06  5:57   ` [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) David Gibson
@ 2016-06-06  8:12     ` Alexey Kardashevskiy
  2016-06-08  5:57       ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-06  8:12 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 28138 bytes --]

On 06/06/16 15:57, David Gibson wrote:
> On Wed, Jun 01, 2016 at 06:57:42PM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.5 machine (TODO: update version) and older disable it.
> 
> Looks like your todo is now todone, but you need to update the commit
> message.
> 
>> This also creates a single DMA window for the older machines to
>> maintain backward migration.
>>
>> This implements DDW for PHB with emulated and VFIO devices. The host
>> kernel support is required. The advertised IOMMU page sizes are 4K and
>> 64K; 16M pages are supported but not advertised by default, in order to
>> enable them, the user has to specify "pgsz" property for PHB and
>> enable huge pages for RAM.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Looks pretty close to ready.
> 
> There are a handful of nits and one real error noted below.
> 
>> ---
>> Changes:
>> v17:
>> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
>>
>> v16:
>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>>
>> v15:
>> * moved page mask filtering to PHB realize(), use "-mempath" to know
>> if there are huge pages
>> * fixed error reporting in RTAS handlers
>> * max window size accounts now hotpluggable memory boundaries
>> ---
>>  hw/ppc/Makefile.objs        |   1 +
>>  hw/ppc/spapr.c              |   5 +
>>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>>  hw/ppc/spapr_rtas_ddw.c     | 293 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/pci-host/spapr.h |   8 +-
>>  include/hw/ppc/spapr.h      |  16 ++-
>>  trace-events                |   4 +
>>  7 files changed, 383 insertions(+), 21 deletions(-)
>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c1ffc77..986b36f 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>  obj-y += spapr_pci_vfio.o
>>  endif
>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>  # PowerPC 4xx boards
>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>  obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 44e401a..6ddcda9 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2366,6 +2366,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>          .driver   = "spapr-vlan", \
>>          .property = "use-rx-buffer-pools", \
>>          .value    = "off", \
>> +    }, \
>> +    {\
>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +        .property = "ddw",\
>> +        .value    = stringify(off),\
>>      },
>>  
>>  static void spapr_machine_2_5_instance_options(MachineState *machine)
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 68de523..bcf0360 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -35,6 +35,7 @@
>>  #include "hw/ppc/spapr.h"
>>  #include "hw/pci-host/spapr.h"
>>  #include "exec/address-spaces.h"
>> +#include "exec/ram_addr.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>>  #include "qemu/error-report.h"
>> @@ -45,6 +46,7 @@
>>  #include "hw/ppc/spapr_drc.h"
>>  #include "sysemu/device_tree.h"
>>  #include "sysemu/kvm.h"
>> +#include "sysemu/hostmem.h"
>>  
>>  #include "hw/vfio/vfio.h"
>>  
>> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>      int fdt_start_offset = 0, fdt_size;
>>  
>>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>  
>>          spapr_tce_set_need_vfio(tcet, true);
>>      }
> 
> Hang on.. I thought you'd got rid of the need for this explicit
> set_need_vfio() stuff.


It is in 12/12 (which I'll split in 2 halves when I respin this), I moved
it to the end as it is not essential for DDW itself.


> 
>> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>      PCIBus *bus;
>>      uint64_t msi_window_size = 4096;
>>      sPAPRTCETable *tcet;
>> +    const unsigned windows_supported =
>> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>>  
>>      if (sphb->index != (uint32_t)-1) {
>>          hwaddr windows_base;
>>  
>> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
>> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>> +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
>>              || (sphb->mem_win_addr != (hwaddr)-1)
>>              || (sphb->io_win_addr != (hwaddr)-1)) {
>>              error_setg(errp, "Either \"index\" or other parameters must"
>> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>  
>>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>> +        for (i = 0; i < windows_supported; ++i) {
>> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
>> +        }
>>  
>>          windows_base = SPAPR_PCI_WINDOW_BASE
>>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          return;
>>      }
>>  
>> -    if (sphb->dma_liobn == (uint32_t)-1) {
>> -        error_setg(errp, "LIOBN not specified for PHB");
>> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
>> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
>> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>>          return;
>>      }
> 
> Hrm.. there's a bit of false generality here, since this would break
> if windows_supported > 2, and dma_liobn[2] was not specified.  Not
> urgent for the initial commit though.


Is s/windows_supported > 1/windows_supported == 2/ any better here?


> 
>> @@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>      }
>>  
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> -    if (!tcet) {
>> -        error_setg(errp, "Unable to create TCE table for %s",
>> -                   sphb->dtbusname);
>> -        return;
>> +    /* DMA setup */
>> +    for (i = 0; i < windows_supported; ++i) {
>> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
>> +        if (!tcet) {
>> +            error_setg(errp, "Creating window#%d failed for %s",
>> +                       i, sphb->dtbusname);
>> +            return;
>> +        }
>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> +                                            spapr_tce_get_iommu(tcet), 0);
>>      }
>>  
>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> -                                        spapr_tce_get_iommu(tcet), 0);
>> -
>>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>  }
>>  
>> @@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>  
>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>  {
>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>> +    int i;
>> +    sPAPRTCETable *tcet;
>>  
>> -    if (tcet && tcet->nb_table) {
>> -        spapr_tce_table_disable(tcet);
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
>> +
>> +        if (tcet && tcet->nb_table) {
>> +            spapr_tce_table_disable(tcet);
>> +        }
>>      }
>>  
>>      /* Register default 32bit DMA window */
>> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>>  }
>> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>  static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>      /* Default DMA window is 0..1GB */
>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>> +                       0x800000000000000ULL),
>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>> +                       (1ULL << 12) | (1ULL << 16)),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>      .post_load = spapr_pci_post_load,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
>> @@ -1780,6 +1802,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      uint32_t interrupt_map_mask[] = {
>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>> +    };
>> +    uint32_t ddw_extensions[] = {
>> +        cpu_to_be32(1),
>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>> +    };
>>      sPAPRTCETable *tcet;
>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>      sPAPRFDT s_fdt;
>> @@ -1804,6 +1835,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>  
>> +    /* Dynamic DMA window */
>> +    if (phb->ddw_enabled) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>> +    }
>> +
>>      /* Build the interrupt-map, this must matches what is done
>>       * in pci_spapr_map_irq
>>       */
>> @@ -1827,7 +1866,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>>                       sizeof(interrupt_map)));
>>  
>> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>      if (!tcet) {
>>          return -1;
>>      }
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..17bbae0
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,293 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "cpu.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->nb_table) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->nb_table) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
>> +{
>> +    int i;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
>> +        if (page_mask & (1ULL << masks[i].shift)) {
>> +            mask |= masks[i].mask;
>> +        }
>> +    }
>> +
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid, max_window_size;
>> +    uint32_t avail, addr, pgmask = 0;
>> +    MachineState *machine = MACHINE(spapr);
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    /* Translate page mask to LoPAPR format */
>> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>> +
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>> +     */
>> +    if (machine->ram_size == machine->maxram_size) {
>> +        max_window_size = machine->ram_size;
>> +    } else {
>> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
>> +
>> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
>> +    }
>> +
>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = spapr_phb_get_free_liobn(sphb);
>> +
>> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
>> +        (window_shift < page_shift)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    if (!liobn || !sphb->ddw_enabled ||
>> +        spapr_phb_get_active_win_num(sphb) == SPAPR_PCI_DMA_MAX_WINDOWS) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>> +                                 1ULL << window_shift,
>> +                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
>> +    if (!tcet) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    spapr_tce_table_enable(tcet, page_shift, sphb->dma64_window_addr,
> 
> This looks like it's assuming you're creating the second 64-bit
> window.  If the guest removed the default window then tried to
> recreate it, that might not be the case.

Yup, bug, was sitting there for a long time...





> 
>> +                           1ULL << (window_shift - page_shift));
>> +    if (!tcet->nb_table) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, liobn);
>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn;
>> +
>> +    if ((nargs != 1) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    liobn = rtas_ld(args, 0);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_tce_table_disable(tcet);
>> +    trace_spapr_iommu_ddw_remove(liobn);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t addr;
>> +
>> +    if ((nargs != 3) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_phb_dma_reset(sphb);
>> +    trace_spapr_iommu_ddw_reset(buid, addr);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void spapr_rtas_ddw_init(void)
>> +{
>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +                        "ibm,query-pe-dma-window",
>> +                        rtas_ibm_query_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +                        "ibm,create-pe-dma-window",
>> +                        rtas_ibm_create_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>> +                        "ibm,remove-pe-dma-window",
>> +                        rtas_ibm_remove_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>> +                        "ibm,reset-pe-dma-window",
>> +                        rtas_ibm_reset_pe_dma_window);
>> +}
>> +
>> +type_init(spapr_rtas_ddw_init)
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 7848366..36a370e 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -32,6 +32,8 @@
>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>  
>> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
>> +
>>  typedef struct sPAPRPHBState sPAPRPHBState;
>>  
>>  typedef struct spapr_pci_msi {
>> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>>      MemoryRegion memwindow, iowindow, msiwindow;
>>  
>> -    uint32_t dma_liobn;
>> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>>      hwaddr dma_win_addr, dma_win_size;
>>      AddressSpace iommu_as;
>>      MemoryRegion iommu_root;
>> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>>      spapr_pci_msi_mig *msi_devs;
>>  
>>      QLIST_ENTRY(sPAPRPHBState) list;
>> +
>> +    bool ddw_enabled;
>> +    uint64_t page_size_mask;
>> +    uint64_t dma64_window_addr;
>>  };
>>  
>>  #define SPAPR_PCI_MAX_INDEX          255
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 971df3d..59fad22 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -412,6 +412,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>>  
>> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
>> +#define RTAS_DDW_PGSIZE_4K       0x01
>> +#define RTAS_DDW_PGSIZE_64K      0x02
>> +#define RTAS_DDW_PGSIZE_16M      0x04
>> +#define RTAS_DDW_PGSIZE_32M      0x08
>> +#define RTAS_DDW_PGSIZE_64M      0x10
>> +#define RTAS_DDW_PGSIZE_128M     0x20
>> +#define RTAS_DDW_PGSIZE_256M     0x40
>> +#define RTAS_DDW_PGSIZE_16G      0x80
>> +
>>  /* RTAS tokens */
>>  #define RTAS_TOKEN_BASE      0x2000
>>  
>> @@ -453,8 +463,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
>> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
>> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
>> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
>> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>>  
>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>>  
>>  /* RTAS ibm,get-system-parameter token values */
>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>> diff --git a/trace-events b/trace-events
>> index ec32c20..dec80e4 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1433,6 +1433,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
>> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
>> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
>> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>>  
>>  # hw/ppc/ppc.c
>>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
  2016-06-02  3:35   ` [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes David Gibson
@ 2016-06-06 13:31     ` Paolo Bonzini
  2016-06-07  3:42       ` Alexey Kardashevskiy
  2016-06-08  6:00       ` David Gibson
  0 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2016-06-06 13:31 UTC (permalink / raw)
  To: David Gibson, Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson



On 02/06/2016 05:35, David Gibson wrote:
> On Wed, Jun 01, 2016 at 06:57:37PM +1000, Alexey Kardashevskiy wrote:
>> > Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
>> > uses when translating, however this information is not available outside
>> > the translate context for various checks.
>> > 
>> > This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
>> > a wrapper for it so IOMMU users (such as VFIO) can know the actual
>> > page size(s) used by an IOMMU.
>> > 
>> > As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
>> > as fallback.
>> > 
>> > This removes vfio_container_granularity() and uses new helper in
>> > memory_region_iommu_replay() when replaying IOMMU mappings on added
>> > IOMMU memory region.
>> > 
>> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Paolo,
> 
> Looks like you were left off the CC for this one.
> 
> I think this is ready to go - do you want to merge, comment or ack and
> we'll take it either through my tree or Alex's?

It's okay for you to merge, but the callback should be called
"get_page_size" or "get_replay_granularity".  The plural is weird.

Thanks,

Paolo


>> > ---
>> > Changes:
>> > v16:
>> > * used memory_region_iommu_get_page_sizes() instead of
>> > mr->iommu_ops->get_page_sizes() in memory_region_iommu_replay()
>> > 
>> > v15:
>> > * s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes
>> > 
>> > v14:
>> > * removed vfio_container_granularity(), changed memory_region_iommu_replay()
>> > 
>> > v4:
>> > * s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
>> > ---
>> >  hw/ppc/spapr_iommu.c  |  8 ++++++++
>> >  hw/vfio/common.c      |  6 ------
>> >  include/exec/memory.h | 18 ++++++++++++++----
>> >  memory.c              | 16 +++++++++++++---
>> >  4 files changed, 35 insertions(+), 13 deletions(-)
>> > 
>> > diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> > index a3cc572..90a45c0 100644
>> > --- a/hw/ppc/spapr_iommu.c
>> > +++ b/hw/ppc/spapr_iommu.c
>> > @@ -149,6 +149,13 @@ static void spapr_tce_table_pre_save(void *opaque)
>> >                                 tcet->bus_offset, tcet->page_shift);
>> >  }
>> >  
>> > +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>> > +{
>> > +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
>> > +
>> > +    return 1ULL << tcet->page_shift;
>> > +}
>> > +
>> >  static int spapr_tce_table_post_load(void *opaque, int version_id)
>> >  {
>> >      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> > @@ -228,6 +235,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
>> >  
>> >  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>> >      .translate = spapr_tce_translate_iommu,
>> > +    .get_page_sizes = spapr_tce_get_page_sizes,
>> >  };
>> >  
>> >  static int spapr_tce_table_realize(DeviceState *dev)
>> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> > index e51ed3a..f1a12b0 100644
>> > --- a/hw/vfio/common.c
>> > +++ b/hw/vfio/common.c
>> > @@ -322,11 +322,6 @@ out:
>> >      rcu_read_unlock();
>> >  }
>> >  
>> > -static hwaddr vfio_container_granularity(VFIOContainer *container)
>> > -{
>> > -    return (hwaddr)1 << ctz64(container->iova_pgsizes);
>> > -}
>> > -
>> >  static void vfio_listener_region_add(MemoryListener *listener,
>> >                                       MemoryRegionSection *section)
>> >  {
>> > @@ -394,7 +389,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>> >  
>> >          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> >          memory_region_iommu_replay(giommu->iommu, &giommu->n,
>> > -                                   vfio_container_granularity(container),
>> >                                     false);
>> >  
>> >          return;
>> > diff --git a/include/exec/memory.h b/include/exec/memory.h
>> > index f649697..bd9625f 100644
>> > --- a/include/exec/memory.h
>> > +++ b/include/exec/memory.h
>> > @@ -149,6 +149,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>> >  struct MemoryRegionIOMMUOps {
>> >      /* Return a TLB entry that contains a given address. */
>> >      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
>> > +    /* Returns supported page sizes */
>> > +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>> >  };
>> >  
>> >  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>> > @@ -571,6 +573,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
>> >  
>> >  
>> >  /**
>> > + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
>> > + *
>> > + * Returns %bitmap of supported page sizes for an iommu.
>> > + *
>> > + * @mr: the memory region being queried
>> > + */
>> > +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
>> > +
>> > +/**
>> >   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>> >   *
>> >   * @mr: the memory region that was changed
>> > @@ -594,16 +605,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
>> >  
>> >  /**
>> >   * memory_region_iommu_replay: replay existing IOMMU translations to
>> > - * a notifier
>> > + * a notifier with the minimum page granularity returned by
>> > + * mr->iommu_ops->get_page_sizes().
>> >   *
>> >   * @mr: the memory region to observe
>> >   * @n: the notifier to which to replay iommu mappings
>> > - * @granularity: Minimum page granularity to replay notifications for
>> >   * @is_write: Whether to treat the replay as a translate "write"
>> >   *     through the iommu
>> >   */
>> > -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
>> > -                                hwaddr granularity, bool is_write);
>> > +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
>> >  
>> >  /**
>> >   * memory_region_unregister_iommu_notifier: unregister a notifier for
>> > diff --git a/memory.c b/memory.c
>> > index 4e3cda8..761ae92 100644
>> > --- a/memory.c
>> > +++ b/memory.c
>> > @@ -1500,12 +1500,22 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>> >      notifier_list_add(&mr->iommu_notify, n);
>> >  }
>> >  
>> > -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
>> > -                                hwaddr granularity, bool is_write)
>> > +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
>> >  {
>> > -    hwaddr addr;
>> > +    assert(memory_region_is_iommu(mr));
>> > +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
>> > +        return mr->iommu_ops->get_page_sizes(mr);
>> > +    }
>> > +    return TARGET_PAGE_SIZE;
>> > +}
>> > +
>> > +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
>> > +{
>> > +    hwaddr addr, granularity;
>> >      IOMMUTLBEntry iotlb;
>> >  
>> > +    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_page_sizes(mr));
>> > +
>> >      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-06-06  6:04     ` Alexey Kardashevskiy
@ 2016-06-06 17:20       ` Alex Williamson
  2016-06-07  3:10         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2016-06-06 17:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On Mon, 6 Jun 2016 16:04:57 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 04/06/16 02:13, Alex Williamson wrote:
> > On Wed,  1 Jun 2016 18:57:38 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> This makes use of the new "memory registering" feature. The idea is
> >> to provide the userspace ability to notify the host kernel about pages
> >> which are going to be used for DMA. Having this information, the host
> >> kernel can pin them all once per user process, do locked pages
> >> accounting (once) and not spent time on doing that in real time with
> >> possible failures which cannot be handled nicely in some cases.
> >>
> >> This adds a prereg memory listener which listens on address_space_memory
> >> and notifies a VFIO container about memory which needs to be
> >> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> >>
> >> As there is no per-IOMMU-type release() callback anymore, this stores
> >> the IOMMU type in the container so vfio_listener_release() can determine
> >> if it needs to unregister @prereg_listener.
> >>
> >> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> >> not call it when v2 is detected and enabled.
> >>
> >> This enforces guest RAM blocks to be host page size aligned; however
> >> this is not new as KVM already requires memory slots to be host page
> >> size aligned.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v17:
> >> * s/prereg\.c/spapr.c/
> >> * s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
> >> * vfio_prereg_listener_skipped_section does hw_error() on IOMMUs
> >>
> >> v16:
> >> * switched to 64bit math everywhere as there is no chance to see
> >> region_add on RAM blocks even remotely close to 1<<64bytes.
> >>
> >> v15:
> >> * banned unaligned sections
> >> * added an vfio_prereg_gpa_to_ua() helper
> >>
> >> v14:
> >> * s/free_container_exit/listener_release_exit/g
> >> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> >> ---
> >>  hw/vfio/Makefile.objs         |   1 +
> >>  hw/vfio/common.c              |  38 +++++++++---
> >>  hw/vfio/spapr.c               | 137 ++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/vfio/vfio-common.h |   4 ++
> >>  trace-events                  |   2 +
> >>  5 files changed, 172 insertions(+), 10 deletions(-)
> >>  create mode 100644 hw/vfio/spapr.c
> >>
> >> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> >> index ceddbb8..c25e32b 100644
> >> --- a/hw/vfio/Makefile.objs
> >> +++ b/hw/vfio/Makefile.objs
> >> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
> >>  obj-$(CONFIG_SOFTMMU) += platform.o
> >>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
> >>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> >> +obj-$(CONFIG_SOFTMMU) += spapr.o
> >>  endif
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index f1a12b0..770f630 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -504,6 +504,9 @@ static const MemoryListener vfio_memory_listener = {
> >>  static void vfio_listener_release(VFIOContainer *container)
> >>  {
> >>      memory_listener_unregister(&container->listener);
> >> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >> +        memory_listener_unregister(&container->prereg_listener);
> >> +    }
> >>  }
> >>  
> >>  static struct vfio_info_cap_header *
> >> @@ -862,8 +865,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto free_container_exit;
> >>          }
> >>  
> >> -        ret = ioctl(fd, VFIO_SET_IOMMU,
> >> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> >> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>          if (ret) {
> >>              error_report("vfio: failed to set iommu for container: %m");
> >>              ret = -errno;
> >> @@ -888,8 +891,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >>              container->iova_pgsizes = info.iova_pgsizes;
> >>          }
> >> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> >> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> >>          struct vfio_iommu_spapr_tce_info info;
> >> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
> >>  
> >>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>          if (ret) {
> >> @@ -897,7 +902,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              ret = -errno;
> >>              goto free_container_exit;
> >>          }
> >> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >> +        container->iommu_type =
> >> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
> >>          if (ret) {
> >>              error_report("vfio: failed to set iommu for container: %m");
> >>              ret = -errno;
> >> @@ -909,11 +916,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>           * when container fd is closed so we do not call it explicitly
> >>           * in this file.
> >>           */
> >> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> -        if (ret) {
> >> -            error_report("vfio: failed to enable container: %m");
> >> -            ret = -errno;
> >> -            goto free_container_exit;
> >> +        if (!v2) {
> >> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> +            if (ret) {
> >> +                error_report("vfio: failed to enable container: %m");
> >> +                ret = -errno;
> >> +                goto free_container_exit;
> >> +            }
> >> +        } else {
> >> +            container->prereg_listener = vfio_prereg_listener;
> >> +
> >> +            memory_listener_register(&container->prereg_listener,
> >> +                                     &address_space_memory);
> >> +            if (container->error) {
> >> +                error_report("vfio: RAM memory listener initialization failed for container");
> >> +                goto listener_release_exit;  
> > 
> > Why doesn't this goto free_container_exit?  registration failure should
> > not need an unregister.  
> 
> 
> The listener registration cannot possibly fail, it adds a listener into the
> memory_listeners list, no matter what region_add() does.

Oops, right.

> 
> I'll add an explicit
> memory_listener_unregister(&container->prereg_listener) here.

Ok.

> >   
> >> +            }
> >>          }
> >>  
> >>          /*
> >> @@ -926,7 +944,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>          if (ret) {
> >>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
> >>              ret = -errno;
> >> -            goto free_container_exit;
> >> +            goto listener_release_exit;  
> > 
> > Looks like this will cause much badness when we try to do
> > memory_listener_unregister() on an empty listener struct for the main
> > listener.  
> 
> 
> Oh. Bug. I'll add
>  memory_listener_unregister(&container->prereg_listener) here.
> 
> 
> >   
> >>          }
> >>          container->min_iova = info.dma32_window_start;
> >>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
> >> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> >> new file mode 100644
> >> index 0000000..f339472
> >> --- /dev/null
> >> +++ b/hw/vfio/spapr.c
> >> @@ -0,0 +1,137 @@
> >> +/*
> >> + * DMA memory preregistration
> >> + *
> >> + * Authors:
> >> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> >> + * the COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "cpu.h"
> >> +#include <sys/ioctl.h>
> >> +#include <linux/vfio.h>
> >> +
> >> +#include "hw/vfio/vfio-common.h"
> >> +#include "hw/hw.h"
> >> +#include "qemu/error-report.h"
> >> +#include "trace.h"
> >> +
> >> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
> >> +{
> >> +    if (memory_region_is_iommu(section->mr)) {
> >> +        hw_error("Cannot possibly preregister IOMMU memory");
> >> +    }
> >> +
> >> +    return !memory_region_is_ram(section->mr) ||
> >> +            memory_region_is_skip_dump(section->mr);
> >> +}
> >> +
> >> +static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
> >> +{
> >> +    return memory_region_get_ram_ptr(section->mr) +
> >> +        section->offset_within_region +
> >> +        (gpa - section->offset_within_address_space);
> >> +}
> >> +
> >> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> >> +                                            MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            prereg_listener);
> >> +    const hwaddr gpa = section->offset_within_address_space;
> >> +    hwaddr end;
> >> +    int ret;
> >> +    hwaddr page_mask = qemu_real_host_page_mask;
> >> +    struct vfio_iommu_spapr_register_memory reg = {
> >> +        .argsz = sizeof(reg),
> >> +        .flags = 0,
> >> +    };
> >> +
> >> +    if (vfio_prereg_listener_skipped_section(section)) {
> >> +        trace_vfio_listener_region_add_skip(
> >> +                section->offset_within_address_space,
> >> +                section->offset_within_address_space +
> >> +                int128_get64(int128_sub(section->size, int128_one())));  
> > 
> > How will we know if this trace is related to the main listener or the
> > prereg listener?  
> 
> 
> By addresses it prints :)
> 
> Fair point, one question though:
> 
> trace_vfio_prereg_listener_region_add_skip or
> trace_vfio_spapr_listener_region_add_skip ?
> 
> Should all symbols in this file get "spapr" instead of "prereg"?

I prefer the trace match the function name.  I'm not convinced that
prereg won't become more pervasive, possibly used for some future type1
variant, but the current code is only partially generic in that sense,
hard coding spapr ioctls, which is why I objected to trying to pass it
off as generic.  However, I'm not sure it's worth spending much more
time renaming each function that can be done when a second user arrives
and we try harder to really make it a general interface.
 
> >> +        return;
> >> +    }
> >> +
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> >> +                 (section->offset_within_region & ~page_mask) ||
> >> +                 (int128_get64(section->size) & ~page_mask))) {
> >> +        error_report("%s received unaligned region", __func__);
> >> +        return;
> >> +    }
> >> +
> >> +    end = section->offset_within_address_space + int128_get64(section->size);
> >> +    g_assert(gpa < end);  
> > 
> > This would imply a zero-sized region, can't you simply return?  
> 
> Zero-sized region or overflow, no?

Yes, but doesn't that imply a bogus MemoryRegionSection from the memory
API?  Can that happen or are we pointlessly re-sanitizing a condition
that cannot occur?

> When I copied this from vfio_listener_region_add(), I thought it is an
> overflow check (which imho should have been assert() or hwerror(), is not
> it? What do I miss?

That sort of consistency test that would justify an assert or hwerror
doesn't seem like it belongs in a consumer of the API, the API should
enforce it elsewhere.
 
> >   
> >> +
> >> +    memory_region_ref(section->mr);
> >> +
> >> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> >> +    reg.size = end - gpa;
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
> >> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
> >> +    if (ret) {
> >> +        /*
> >> +         * On the initfn path, store the first error in the container so we
> >> +         * can gracefully fail.  Runtime, there's not much we can do other
> >> +         * than throw a hardware error.
> >> +         */
> >> +        if (!container->initialized) {
> >> +            if (!container->error) {
> >> +                container->error = ret;
> >> +            }
> >> +        } else {
> >> +            hw_error("vfio: Memory registering failed, unable to continue");
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> >> +                                            MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> >> +                                            prereg_listener);
> >> +    const hwaddr gpa = section->offset_within_address_space;
> >> +    hwaddr end;
> >> +    int ret;
> >> +    hwaddr page_mask = qemu_real_host_page_mask;
> >> +    struct vfio_iommu_spapr_register_memory reg = {
> >> +        .argsz = sizeof(reg),
> >> +        .flags = 0,
> >> +    };
> >> +
> >> +    if (vfio_prereg_listener_skipped_section(section)) {
> >> +        trace_vfio_listener_region_del_skip(
> >> +                section->offset_within_address_space,
> >> +                section->offset_within_address_space +
> >> +                int128_get64(int128_sub(section->size, int128_one())));  
> > 
> > Again, indistinguishable from main listener trace.
> >   
> >> +        return;
> >> +    }
> >> +
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
> >> +                 (section->offset_within_region & ~page_mask) ||
> >> +                 (int128_get64(section->size) & ~page_mask))) {
> >> +        error_report("%s received unaligned region", __func__);
> >> +        return;
> >> +    }
> >> +
> >> +    end = section->offset_within_address_space + int128_get64(section->size);
> >> +    if (gpa >= end) {
> >> +        return;  
> > 
> > We simply return here, not sure why we need to g_assert above.  
> 
> Well, we won't get this far if this is the case - region_add() would fail
> first.

Then why test it at all or why not make it an assert?  The point is if
we can skip it here, why couldn't we skip it above.  If we assert
above, why can we skip it here even though seeing it here would be
another unexpected inconsistency.  IMO, we can skip it in both places
just like the existing listener does.

> >   
> >> +    }
> >> +
> >> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
> >> +    reg.size = end - gpa;
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
> >> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
> >> +}
> >> +
> >> +const MemoryListener vfio_prereg_listener = {
> >> +    .region_add = vfio_prereg_listener_region_add,
> >> +    .region_del = vfio_prereg_listener_region_del,
> >> +};
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 0610377..405c3b2 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
> >>      VFIOAddressSpace *space;
> >>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >>      MemoryListener listener;
> >> +    MemoryListener prereg_listener;
> >> +    unsigned iommu_type;
> >>      int error;
> >>      bool initialized;
> >>      /*
> >> @@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>                               uint32_t subtype, struct vfio_region_info **info);
> >>  #endif
> >> +extern const MemoryListener vfio_prereg_listener;
> >> +
> >>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> >> diff --git a/trace-events b/trace-events
> >> index de42012..ddb8676 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1766,6 +1766,8 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
> >>  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
> >>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
> >>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
> >> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"  
> > 
> > This file loosely calls out which file the trace is in, these are not
> > in common.c.
> >   
> >>  
> >>  # hw/vfio/platform.c
> >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"  
> >   
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
  2016-06-06 17:20       ` Alex Williamson
@ 2016-06-07  3:10         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-07  3:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, David Gibson

On 07/06/16 03:20, Alex Williamson wrote:
> On Mon, 6 Jun 2016 16:04:57 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 04/06/16 02:13, Alex Williamson wrote:
>>> On Wed,  1 Jun 2016 18:57:38 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> This makes use of the new "memory registering" feature. The idea is
>>>> to provide the userspace ability to notify the host kernel about pages
>>>> which are going to be used for DMA. Having this information, the host
>>>> kernel can pin them all once per user process, do locked pages
>>>> accounting (once) and not spent time on doing that in real time with
>>>> possible failures which cannot be handled nicely in some cases.
>>>>
>>>> This adds a prereg memory listener which listens on address_space_memory
>>>> and notifies a VFIO container about memory which needs to be
>>>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
>>>>
>>>> As there is no per-IOMMU-type release() callback anymore, this stores
>>>> the IOMMU type in the container so vfio_listener_release() can determine
>>>> if it needs to unregister @prereg_listener.
>>>>
>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
>>>> not call it when v2 is detected and enabled.
>>>>
>>>> This enforces guest RAM blocks to be host page size aligned; however
>>>> this is not new as KVM already requires memory slots to be host page
>>>> size aligned.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v17:
>>>> * s/prereg\.c/spapr.c/
>>>> * s/vfio_prereg_gpa_to_ua/vfio_prereg_gpa_to_vaddr/
>>>> * vfio_prereg_listener_skipped_section does hw_error() on IOMMUs
>>>>
>>>> v16:
>>>> * switched to 64bit math everywhere as there is no chance to see
>>>> region_add on RAM blocks even remotely close to 1<<64bytes.
>>>>
>>>> v15:
>>>> * banned unaligned sections
>>>> * added an vfio_prereg_gpa_to_ua() helper
>>>>
>>>> v14:
>>>> * s/free_container_exit/listener_release_exit/g
>>>> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
>>>> ---
>>>>  hw/vfio/Makefile.objs         |   1 +
>>>>  hw/vfio/common.c              |  38 +++++++++---
>>>>  hw/vfio/spapr.c               | 137 ++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/vfio/vfio-common.h |   4 ++
>>>>  trace-events                  |   2 +
>>>>  5 files changed, 172 insertions(+), 10 deletions(-)
>>>>  create mode 100644 hw/vfio/spapr.c
>>>>
>>>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>>>> index ceddbb8..c25e32b 100644
>>>> --- a/hw/vfio/Makefile.objs
>>>> +++ b/hw/vfio/Makefile.objs
>>>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>>>>  obj-$(CONFIG_SOFTMMU) += platform.o
>>>>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>>>>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
>>>> +obj-$(CONFIG_SOFTMMU) += spapr.o
>>>>  endif
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index f1a12b0..770f630 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -504,6 +504,9 @@ static const MemoryListener vfio_memory_listener = {
>>>>  static void vfio_listener_release(VFIOContainer *container)
>>>>  {
>>>>      memory_listener_unregister(&container->listener);
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +        memory_listener_unregister(&container->prereg_listener);
>>>> +    }
>>>>  }
>>>>  
>>>>  static struct vfio_info_cap_header *
>>>> @@ -862,8 +865,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>              goto free_container_exit;
>>>>          }
>>>>  
>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU,
>>>> -                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
>>>> +        container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>>          if (ret) {
>>>>              error_report("vfio: failed to set iommu for container: %m");
>>>>              ret = -errno;
>>>> @@ -888,8 +891,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>          if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>>>              container->iova_pgsizes = info.iova_pgsizes;
>>>>          }
>>>> -    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
>>>> +               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>>>>          struct vfio_iommu_spapr_tce_info info;
>>>> +        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>>>>  
>>>>          ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>>>          if (ret) {
>>>> @@ -897,7 +902,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>              ret = -errno;
>>>>              goto free_container_exit;
>>>>          }
>>>> -        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>>>> +        container->iommu_type =
>>>> +            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>>>>          if (ret) {
>>>>              error_report("vfio: failed to set iommu for container: %m");
>>>>              ret = -errno;
>>>> @@ -909,11 +916,22 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>           * when container fd is closed so we do not call it explicitly
>>>>           * in this file.
>>>>           */
>>>> -        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>> -        if (ret) {
>>>> -            error_report("vfio: failed to enable container: %m");
>>>> -            ret = -errno;
>>>> -            goto free_container_exit;
>>>> +        if (!v2) {
>>>> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>> +            if (ret) {
>>>> +                error_report("vfio: failed to enable container: %m");
>>>> +                ret = -errno;
>>>> +                goto free_container_exit;
>>>> +            }
>>>> +        } else {
>>>> +            container->prereg_listener = vfio_prereg_listener;
>>>> +
>>>> +            memory_listener_register(&container->prereg_listener,
>>>> +                                     &address_space_memory);
>>>> +            if (container->error) {
>>>> +                error_report("vfio: RAM memory listener initialization failed for container");
>>>> +                goto listener_release_exit;  
>>>
>>> Why doesn't this goto free_container_exit?  registration failure should
>>> not need an unregister.  
>>
>>
>> The listener registration cannot possibly fail, it adds a listener into the
>> memory_listeners list, no matter what region_add() does.
> 
> Oops, right.
> 
>>
>> I'll add an explicit
>> memory_listener_unregister(&container->prereg_listener) here.
> 
> Ok.
> 
>>>   
>>>> +            }
>>>>          }
>>>>  
>>>>          /*
>>>> @@ -926,7 +944,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>>>          if (ret) {
>>>>              error_report("vfio: VFIO_IOMMU_SPAPR_TCE_GET_INFO failed: %m");
>>>>              ret = -errno;
>>>> -            goto free_container_exit;
>>>> +            goto listener_release_exit;  
>>>
>>> Looks like this will cause much badness when we try to do
>>> memory_listener_unregister() on an empty listener struct for the main
>>> listener.  
>>
>>
>> Oh. Bug. I'll add
>>  memory_listener_unregister(&container->prereg_listener) here.
>>
>>
>>>   
>>>>          }
>>>>          container->min_iova = info.dma32_window_start;
>>>>          container->max_iova = container->min_iova + info.dma32_window_size - 1;
>>>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>>>> new file mode 100644
>>>> index 0000000..f339472
>>>> --- /dev/null
>>>> +++ b/hw/vfio/spapr.c
>>>> @@ -0,0 +1,137 @@
>>>> +/*
>>>> + * DMA memory preregistration
>>>> + *
>>>> + * Authors:
>>>> + *  Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>>> + * the COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "cpu.h"
>>>> +#include <sys/ioctl.h>
>>>> +#include <linux/vfio.h>
>>>> +
>>>> +#include "hw/vfio/vfio-common.h"
>>>> +#include "hw/hw.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "trace.h"
>>>> +
>>>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
>>>> +{
>>>> +    if (memory_region_is_iommu(section->mr)) {
>>>> +        hw_error("Cannot possibly preregister IOMMU memory");
>>>> +    }
>>>> +
>>>> +    return !memory_region_is_ram(section->mr) ||
>>>> +            memory_region_is_skip_dump(section->mr);
>>>> +}
>>>> +
>>>> +static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>>>> +{
>>>> +    return memory_region_get_ram_ptr(section->mr) +
>>>> +        section->offset_within_region +
>>>> +        (gpa - section->offset_within_address_space);
>>>> +}
>>>> +
>>>> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>>> +                                            MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> +                                            prereg_listener);
>>>> +    const hwaddr gpa = section->offset_within_address_space;
>>>> +    hwaddr end;
>>>> +    int ret;
>>>> +    hwaddr page_mask = qemu_real_host_page_mask;
>>>> +    struct vfio_iommu_spapr_register_memory reg = {
>>>> +        .argsz = sizeof(reg),
>>>> +        .flags = 0,
>>>> +    };
>>>> +
>>>> +    if (vfio_prereg_listener_skipped_section(section)) {
>>>> +        trace_vfio_listener_region_add_skip(
>>>> +                section->offset_within_address_space,
>>>> +                section->offset_within_address_space +
>>>> +                int128_get64(int128_sub(section->size, int128_one())));  
>>>
>>> How will we know if this trace is related to the main listener or the
>>> prereg listener?  
>>
>>
>> By addresses it prints :)
>>
>> Fair point, one question though:
>>
>> trace_vfio_prereg_listener_region_add_skip or
>> trace_vfio_spapr_listener_region_add_skip ?
>>
>> Should all symbols in this file get "spapr" instead of "prereg"?
> 
> I prefer the trace match the function name.  I'm not convinced that
> prereg won't become more pervasive, possibly used for some future type1
> variant, but the current code is only partially generic in that sense,
> hard coding spapr ioctls, which is why I objected to trying to pass it
> off as generic.  However, I'm not sure it's worth spending much more
> time renaming each function that can be done when a second user arrives
> and we try harder to really make it a general interface.


Ok, so I'll make it trace_vfio_prereg_listener_region_add_skip  and keep
"prereg" in functions which already have it.



>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>>>> +                 (section->offset_within_region & ~page_mask) ||
>>>> +                 (int128_get64(section->size) & ~page_mask))) {
>>>> +        error_report("%s received unaligned region", __func__);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    end = section->offset_within_address_space + int128_get64(section->size);
>>>> +    g_assert(gpa < end);  
>>>
>>> This would imply a zero-sized region, can't you simply return?  
>>
>> Zero-sized region or overflow, no?
> 
> Yes, but doesn't that imply a bogus MemoryRegionSection from the memory
> API?  Can that happen or are we pointlessly re-sanitizing a condition
> that cannot occur?


region_add() is called on flat view ranges and render_memory_region() is
not adding zero size ranges if I read the code correctly. Overflow can
still happen.


> 
>> When I copied this from vfio_listener_region_add(), I thought it is an
>> overflow check (which imho should have been assert() or hwerror(), is not
>> it? What do I miss?
> 
> That sort of consistency test that would justify an assert or hwerror
> doesn't seem like it belongs in a consumer of the API, the API should
> enforce it elsewhere.

This region_add()/region_del() API is 128bit, it cannot do the check (or it
should?)... Now it is quite confusing to me because it looks like flat
ranges support full 128bit address space which none of the actual machines
seems to use/allow/support.

Ok. So. I'll do s/g_assert(gpa < end)/if (gpa < end) return/ for now.


>>>   
>>>> +
>>>> +    memory_region_ref(section->mr);
>>>> +
>>>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
>>>> +    reg.size = end - gpa;
>>>> +
>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
>>>> +    trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
>>>> +    if (ret) {
>>>> +        /*
>>>> +         * On the initfn path, store the first error in the container so we
>>>> +         * can gracefully fail.  Runtime, there's not much we can do other
>>>> +         * than throw a hardware error.
>>>> +         */
>>>> +        if (!container->initialized) {
>>>> +            if (!container->error) {
>>>> +                container->error = ret;
>>>> +            }
>>>> +        } else {
>>>> +            hw_error("vfio: Memory registering failed, unable to continue");
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>>> +                                            MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>>>> +                                            prereg_listener);
>>>> +    const hwaddr gpa = section->offset_within_address_space;
>>>> +    hwaddr end;
>>>> +    int ret;
>>>> +    hwaddr page_mask = qemu_real_host_page_mask;
>>>> +    struct vfio_iommu_spapr_register_memory reg = {
>>>> +        .argsz = sizeof(reg),
>>>> +        .flags = 0,
>>>> +    };
>>>> +
>>>> +    if (vfio_prereg_listener_skipped_section(section)) {
>>>> +        trace_vfio_listener_region_del_skip(
>>>> +                section->offset_within_address_space,
>>>> +                section->offset_within_address_space +
>>>> +                int128_get64(int128_sub(section->size, int128_one())));  
>>>
>>> Again, indistinguishable from main listener trace.
>>>   
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (unlikely((section->offset_within_address_space & ~page_mask) ||
>>>> +                 (section->offset_within_region & ~page_mask) ||
>>>> +                 (int128_get64(section->size) & ~page_mask))) {
>>>> +        error_report("%s received unaligned region", __func__);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    end = section->offset_within_address_space + int128_get64(section->size);
>>>> +    if (gpa >= end) {
>>>> +        return;  
>>>
>>> We simply return here, not sure why we need to g_assert above.  
>>
>> Well, we won't get this far if this is the case - region_add() would fail
>> first.
> 
> Then why test it at all or why not make it an assert?  The point is if
> we can skip it here, why couldn't we skip it above.  If we assert
> above, why can we skip it here even though seeing it here would be
> another unexpected inconsistency.  IMO, we can skip it in both places
> just like the existing listener does.
> 
>>>   
>>>> +    }
>>>> +
>>>> +    reg.vaddr = (__u64) vfio_prereg_gpa_to_vaddr(section, gpa);
>>>> +    reg.size = end - gpa;
>>>> +
>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, &reg);
>>>> +    trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
>>>> +}
>>>> +
>>>> +const MemoryListener vfio_prereg_listener = {
>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>> +};
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index 0610377..405c3b2 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>>>>      VFIOAddressSpace *space;
>>>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>>>      MemoryListener listener;
>>>> +    MemoryListener prereg_listener;
>>>> +    unsigned iommu_type;
>>>>      int error;
>>>>      bool initialized;
>>>>      /*
>>>> @@ -158,4 +160,6 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>>  #endif
>>>> +extern const MemoryListener vfio_prereg_listener;
>>>> +
>>>>  #endif /* !HW_VFIO_VFIO_COMMON_H */
>>>> diff --git a/trace-events b/trace-events
>>>> index de42012..ddb8676 100644
>>>> --- a/trace-events
>>>> +++ b/trace-events
>>>> @@ -1766,6 +1766,8 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
>>>>  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
>>>>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>>>> +vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>>>> +vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"  
>>>
>>> This file loosely calls out which file the trace is in, these are not
>>> in common.c.
>>>   
>>>>  
>>>>  # hw/vfio/platform.c
>>>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"  
>>>   
>>
>>
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
  2016-06-06 13:31     ` Paolo Bonzini
@ 2016-06-07  3:42       ` Alexey Kardashevskiy
  2016-06-08  6:00       ` David Gibson
  1 sibling, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-07  3:42 UTC (permalink / raw)
  To: Paolo Bonzini, David Gibson
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

On 06/06/16 23:31, Paolo Bonzini wrote:
> 
> 
> On 02/06/2016 05:35, David Gibson wrote:
>> On Wed, Jun 01, 2016 at 06:57:37PM +1000, Alexey Kardashevskiy wrote:
>>>> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
>>>> uses when translating, however this information is not available outside
>>>> the translate context for various checks.
>>>>
>>>> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
>>>> a wrapper for it so IOMMU users (such as VFIO) can know the actual
>>>> page size(s) used by an IOMMU.
>>>>
>>>> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
>>>> as fallback.
>>>>
>>>> This removes vfio_container_granularity() and uses new helper in
>>>> memory_region_iommu_replay() when replaying IOMMU mappings on added
>>>> IOMMU memory region.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> Paolo,
>>
>> Looks like you were left off the CC for this one.
>>
>> I think this is ready to go - do you want to merge, comment or ack and
>> we'll take it either through my tree or Alex's?
> 
> It's okay for you to merge, but the callback should be called
> "get_page_size" or "get_replay_granularity".  The plural is weird.

I'll repost next time with get_page_size() then. Thanks.



-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
  2016-06-06  6:45     ` Alexey Kardashevskiy
@ 2016-06-08  5:56       ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-06-08  5:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 12998 bytes --]

On Mon, Jun 06, 2016 at 04:45:50PM +1000, Alexey Kardashevskiy wrote:
> On 03/06/16 17:37, David Gibson wrote:
> > On Wed, Jun 01, 2016 at 06:57:41PM +1000, Alexey Kardashevskiy wrote:
> >> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >> This adds ability to VFIO common code to dynamically allocate/remove
> >> DMA windows in the host kernel when new VFIO container is added/removed.
> >>
> >> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >> and adds just created IOMMU into the host IOMMU list; the opposite
> >> action is taken in vfio_listener_region_del.
> >>
> >> When creating a new window, this uses heuristic to decide on the TCE table
> >> levels number.
> >>
> >> This should cause no guest visible change in behavior.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v17:
> >> * moved spapr window create/remove helpers to separate file
> >> * added hw_error() if vfio_host_win_del() failed
> >>
> >> v16:
> >> * used memory_region_iommu_get_page_sizes() in vfio_listener_region_add()
> >> * enforced no intersections between windows
> >>
> >> v14:
> >> * new to the series
> >> ---
> >>  hw/vfio/common.c              | 76 +++++++++++++++++++++++++++++++++++++------
> >>  hw/vfio/spapr.c               | 70 +++++++++++++++++++++++++++++++++++++++
> >>  include/hw/vfio/vfio-common.h |  6 ++++
> >>  trace-events                  |  2 ++
> >>  4 files changed, 144 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 52b08fd..7f55c26 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -275,6 +275,18 @@ static void vfio_host_win_add(VFIOContainer *container,
> >>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
> >>  }
> >>  
> >> +static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova)
> >> +{
> >> +    VFIOHostDMAWindow *hostwin = vfio_host_win_lookup(container, min_iova, 1);
> > 
> > Hrm.. and for this case I think you want exact match, rather than
> > looking for range inclusion.
> 
> I suppose so, I'll change this.
> 
> 
> >> +
> >> +    if (!hostwin) {
> >> +        return -1;
> >> +    }
> >> +    QLIST_REMOVE(hostwin, hostwin_next);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >>      return (!memory_region_is_ram(section->mr) &&
> >> @@ -388,6 +400,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      }
> >>      end = int128_get64(int128_sub(llend, int128_one()));
> >>  
> >> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >> +        VFIOHostDMAWindow *hostwin;
> >> +        hwaddr pgsize = 0;
> >> +
> >> +        /* For now intersections are not allowed, we may relax this later */
> >> +        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> >> +            if (ranges_overlap(hostwin->min_iova,
> >> +                               hostwin->max_iova - hostwin->min_iova + 1,
> >> +                               section->offset_within_address_space,
> >> +                               int128_get64(section->size))) {
> >> +                goto fail;
> >> +            }
> >> +        }
> >> +
> >> +        ret = vfio_spapr_create_window(container, section, &pgsize);
> >> +        if (ret) {
> >> +            goto fail;
> >> +        }
> >> +
> >> +        vfio_host_win_add(container, section->offset_within_address_space,
> >> +                          section->offset_within_address_space +
> >> +                          int128_get64(section->size) - 1, pgsize);
> >> +    }
> >> +
> >>      if (!vfio_host_win_lookup(container, iova, end)) {
> >>          error_report("vfio: IOMMU container %p can't map guest IOVA region"
> >>                       " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> >> @@ -523,6 +559,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                       "0x%"HWADDR_PRIx") = %d (%m)",
> >>                       container, iova, int128_get64(llsize), ret);
> >>      }
> >> +
> >> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >> +        vfio_spapr_remove_window(container,
> >> +                                 section->offset_within_address_space);
> > 
> > Should check for error here.
> 
> 
> And do what here? vfio_spapr_remove_window() calls error_report() already
> and I still want to remove the host window here.

Hmm.. yes, I guess so.  Probably best to have a comment here saying
why it's safe to ignore an error you should usually test for.

> > 
> >> +        if (vfio_host_win_del(container,
> >> +                              section->offset_within_address_space) < 0) {
> >> +            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> >> +                     __func__, section->offset_within_address_space);
> > 
> > Personally I think assert() would be better here, but Alex doesn't
> > like them so I'm ok with this.
> > 
> >> +        }
> >> +
> >> +        trace_vfio_spapr_remove_window(section->offset_within_address_space);
> >> +    }
> >>  }
> >>  
> >>  static const MemoryListener vfio_memory_listener = {
> >> @@ -960,11 +1008,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              }
> >>          }
> >>  
> >> -        /*
> >> -         * This only considers the host IOMMU's 32-bit window.  At
> >> -         * some point we need to add support for the optional 64-bit
> >> -         * window and dynamic windows
> >> -         */
> >>          info.argsz = sizeof(info);
> >>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
> >>          if (ret) {
> >> @@ -973,11 +1016,24 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
> >>              goto listener_release_exit;
> >>          }
> >>  
> >> -        /* The default table uses 4K pages */
> >> -        vfio_host_win_add(container, info.dma32_window_start,
> >> -                          info.dma32_window_start +
> >> -                          info.dma32_window_size - 1,
> >> -                          0x1000);
> >> +        if (v2) {
> >> +            /*
> >> +             * There is a default window in just created container.
> >> +             * To make region_add/del simpler, we better remove this
> >> +             * window now and let those iommu_listener callbacks
> >> +             * create/remove them when needed.
> >> +             */
> >> +            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
> >> +            if (ret) {
> >> +                goto free_container_exit;
> >> +            }
> >> +        } else {
> >> +            /* The default table uses 4K pages */
> >> +            vfio_host_win_add(container, info.dma32_window_start,
> >> +                              info.dma32_window_start +
> >> +                              info.dma32_window_size - 1,
> >> +                              0x1000);
> >> +        }
> >>      } else {
> >>          error_report("vfio: No available IOMMU models");
> >>          ret = -EINVAL;
> >> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> >> index f339472..0c784c4 100644
> >> --- a/hw/vfio/spapr.c
> >> +++ b/hw/vfio/spapr.c
> >> @@ -135,3 +135,73 @@ const MemoryListener vfio_prereg_listener = {
> >>      .region_add = vfio_prereg_listener_region_add,
> >>      .region_del = vfio_prereg_listener_region_del,
> >>  };
> >> +
> >> +int vfio_spapr_create_window(VFIOContainer *container,
> >> +                             MemoryRegionSection *section,
> >> +                             hwaddr *pgsize)
> >> +{
> >> +    int ret;
> >> +    unsigned pagesizes = memory_region_iommu_get_page_sizes(section->mr);
> >> +    unsigned pagesize = (hwaddr)1 << ctz64(pagesizes);
> >> +    unsigned entries, pages;
> >> +    struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> >> +
> >> +    /*
> >> +     * FIXME: For VFIO iommu types which have KVM acceleration to
> >> +     * avoid bouncing all map/unmaps through qemu this way, this
> >> +     * would be the right place to wire that up (tell the KVM
> >> +     * device emulation the VFIO iommu handles to use).
> >> +     */
> >> +    create.window_size = int128_get64(section->size);
> >> +    create.page_shift = ctz64(pagesize);
> > 
> > Doing a ctz on a value which is defined as 1 << n seems a bit
> > perverse.
> 
> Well, this way it felt more obvious that pagesize is a single page size,
> not a mask. Not sure if memory_region_iommu_get_page_sizes() returning a
> mask (rather than a page size) is a good idea after all...
> 
> I'll make it:
> 
>  create.page_shift = ctz64(pagesizes);
> 
> and (below):
> 
>  *pgsize = 1ULL << create.page_shift;

Yes, I think that's better.

> and remove pagesize.
> 
> >> +    /*
> >> +     * SPAPR host supports multilevel TCE tables, there is some
> >> +     * heuristic to decide how many levels we want for our table:
> >> +     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >> +     */
> >> +    entries = create.window_size >> create.page_shift;
> >> +    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> >> +    pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
> >> +    create.levels = ctz64(pages) / 6 + 1;
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >> +    if (ret) {
> >> +        error_report("Failed to create a window, ret = %d (%m)", ret);
> >> +        return -errno;
> >> +    }
> >> +
> >> +    if (create.start_addr != section->offset_within_address_space) {
> >> +        vfio_spapr_remove_window(container, create.start_addr);
> >> +
> >> +        error_report("Host doesn't support DMA window at %"HWADDR_PRIx", must be %"PRIx64,
> >> +                     section->offset_within_address_space,
> >> +                     create.start_addr);
> >> +        ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >> +        return -EINVAL;
> >> +    }
> >> +    trace_vfio_spapr_create_window(create.page_shift,
> >> +                                   create.window_size,
> >> +                                   create.start_addr);
> >> +    *pgsize = pagesize;
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +int vfio_spapr_remove_window(VFIOContainer *container,
> >> +                             hwaddr offset_within_address_space)
> >> +{
> >> +    struct vfio_iommu_spapr_tce_remove remove = {
> >> +        .argsz = sizeof(remove),
> >> +        .start_addr = offset_within_address_space,
> >> +    };
> >> +    int ret;
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> >> +    if (ret) {
> >> +        error_report("Failed to remove window at %"PRIx64,
> >> +                     remove.start_addr);
> >> +        return -errno;
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index c76ddc4..7e80382 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -167,4 +167,10 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>  #endif
> >>  extern const MemoryListener vfio_prereg_listener;
> >>  
> >> +int vfio_spapr_create_window(VFIOContainer *container,
> >> +                             MemoryRegionSection *section,
> >> +                             hwaddr *pgsize);
> >> +int vfio_spapr_remove_window(VFIOContainer *container,
> >> +                             hwaddr offset_within_address_space);
> >> +
> >>  #endif /* !HW_VFIO_VFIO_COMMON_H */
> >> diff --git a/trace-events b/trace-events
> >> index ddb8676..ec32c20 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1768,6 +1768,8 @@ vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sp
> >>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
> >>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> >> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> >> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
> >>  
> >>  # hw/vfio/platform.c
> >>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-06  8:12     ` Alexey Kardashevskiy
@ 2016-06-08  5:57       ` David Gibson
  2016-06-08  6:09         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-06-08  5:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 29948 bytes --]

On Mon, Jun 06, 2016 at 06:12:58PM +1000, Alexey Kardashevskiy wrote:
> On 06/06/16 15:57, David Gibson wrote:
> > On Wed, Jun 01, 2016 at 06:57:42PM +1000, Alexey Kardashevskiy wrote:
> >> This adds support for Dynamic DMA Windows (DDW) option defined by
> >> the SPAPR specification which allows to have additional DMA window(s)
> >>
> >> The "ddw" property is enabled by default on a PHB but for compatibility
> >> the pseries-2.5 machine (TODO: update version) and older disable it.
> > 
> > Looks like your todo is now todone, but you need to update the commit
> > message.
> > 
> >> This also creates a single DMA window for the older machines to
> >> maintain backward migration.
> >>
> >> This implements DDW for PHB with emulated and VFIO devices. The host
> >> kernel support is required. The advertised IOMMU page sizes are 4K and
> >> 64K; 16M pages are supported but not advertised by default, in order to
> >> enable them, the user has to specify "pgsz" property for PHB and
> >> enable huge pages for RAM.
> >>
> >> The existing linux guests try creating one additional huge DMA window
> >> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >> the guest switches to dma_direct_ops and never calls TCE hypercalls
> >> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >> property which is a bus address for the 64bit window and by default
> >> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >> uses and this allows having emulated and VFIO devices on the same bus.
> >>
> >> This adds 4 RTAS handlers:
> >> * ibm,query-pe-dma-window
> >> * ibm,create-pe-dma-window
> >> * ibm,remove-pe-dma-window
> >> * ibm,reset-pe-dma-window
> >> These are registered from type_init() callback.
> >>
> >> These RTAS handlers are implemented in a separate file to avoid polluting
> >> spapr_iommu.c with PCI.
> >>
> >> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > Looks pretty close to ready.
> > 
> > There are a handful of nits and one real error noted below.
> > 
> >> ---
> >> Changes:
> >> v17:
> >> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
> >>
> >> v16:
> >> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> >> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> >>
> >> v15:
> >> * moved page mask filtering to PHB realize(), use "-mempath" to know
> >> if there are huge pages
> >> * fixed error reporting in RTAS handlers
> >> * max window size accounts now hotpluggable memory boundaries
> >> ---
> >>  hw/ppc/Makefile.objs        |   1 +
> >>  hw/ppc/spapr.c              |   5 +
> >>  hw/ppc/spapr_pci.c          |  77 +++++++++---
> >>  hw/ppc/spapr_rtas_ddw.c     | 293 ++++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/pci-host/spapr.h |   8 +-
> >>  include/hw/ppc/spapr.h      |  16 ++-
> >>  trace-events                |   4 +
> >>  7 files changed, 383 insertions(+), 21 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index c1ffc77..986b36f 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>  obj-y += spapr_pci_vfio.o
> >>  endif
> >> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index 44e401a..6ddcda9 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -2366,6 +2366,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>          .driver   = "spapr-vlan", \
> >>          .property = "use-rx-buffer-pools", \
> >>          .value    = "off", \
> >> +    }, \
> >> +    {\
> >> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >> +        .property = "ddw",\
> >> +        .value    = stringify(off),\
> >>      },
> >>  
> >>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 68de523..bcf0360 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -35,6 +35,7 @@
> >>  #include "hw/ppc/spapr.h"
> >>  #include "hw/pci-host/spapr.h"
> >>  #include "exec/address-spaces.h"
> >> +#include "exec/ram_addr.h"
> >>  #include <libfdt.h>
> >>  #include "trace.h"
> >>  #include "qemu/error-report.h"
> >> @@ -45,6 +46,7 @@
> >>  #include "hw/ppc/spapr_drc.h"
> >>  #include "sysemu/device_tree.h"
> >>  #include "sysemu/kvm.h"
> >> +#include "sysemu/hostmem.h"
> >>  
> >>  #include "hw/vfio/vfio.h"
> >>  
> >> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >>      int fdt_start_offset = 0, fdt_size;
> >>  
> >>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> >> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >>  
> >>          spapr_tce_set_need_vfio(tcet, true);
> >>      }
> > 
> > Hang on.. I thought you'd got rid of the need for this explicit
> > set_need_vfio() stuff.
> 
> 
> It is in 12/12 (which I'll split in 2 halves when I respin this), I moved
> it to the end as it is not essential for DDW itself.

Yes, sorry, I saw that shortly after writing this.

> >> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>      PCIBus *bus;
> >>      uint64_t msi_window_size = 4096;
> >>      sPAPRTCETable *tcet;
> >> +    const unsigned windows_supported =
> >> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >>  
> >>      if (sphb->index != (uint32_t)-1) {
> >>          hwaddr windows_base;
> >>  
> >> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> >> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> >> +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
> >>              || (sphb->mem_win_addr != (hwaddr)-1)
> >>              || (sphb->io_win_addr != (hwaddr)-1)) {
> >>              error_setg(errp, "Either \"index\" or other parameters must"
> >> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>          }
> >>  
> >>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> >> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> >> +        for (i = 0; i < windows_supported; ++i) {
> >> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> >> +        }
> >>  
> >>          windows_base = SPAPR_PCI_WINDOW_BASE
> >>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> >> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>          return;
> >>      }
> >>  
> >> -    if (sphb->dma_liobn == (uint32_t)-1) {
> >> -        error_setg(errp, "LIOBN not specified for PHB");
> >> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
> >> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
> >> +        error_setg(errp, "LIOBN(s) not specified for PHB");
> >>          return;
> >>      }
> > 
> > Hrm.. there's a bit of false generality here, since this would break
> > if windows_supported > 2, and dma_liobn[2] was not specified.  Not
> > urgent for the initial commit though.
> 
> 
> Is s/windows_supported > 1/windows_supported == 2/ any better here?

Not really.  Unless you also have a windows_supported > 2 case (which
could just error / abort / whatever).

> >> @@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>          }
> >>      }
> >>  
> >> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> >> -    if (!tcet) {
> >> -        error_setg(errp, "Unable to create TCE table for %s",
> >> -                   sphb->dtbusname);
> >> -        return;
> >> +    /* DMA setup */
> >> +    for (i = 0; i < windows_supported; ++i) {
> >> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
> >> +        if (!tcet) {
> >> +            error_setg(errp, "Creating window#%d failed for %s",
> >> +                       i, sphb->dtbusname);
> >> +            return;
> >> +        }
> >> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >> +                                            spapr_tce_get_iommu(tcet), 0);
> >>      }
> >>  
> >> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >> -                                        spapr_tce_get_iommu(tcet), 0);
> >> -
> >>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >>  }
> >>  
> >> @@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
> >>  
> >>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >>  {
> >> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> >> +    int i;
> >> +    sPAPRTCETable *tcet;
> >>  
> >> -    if (tcet && tcet->nb_table) {
> >> -        spapr_tce_table_disable(tcet);
> >> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
> >> +
> >> +        if (tcet && tcet->nb_table) {
> >> +            spapr_tce_table_disable(tcet);
> >> +        }
> >>      }
> >>  
> >>      /* Register default 32bit DMA window */
> >> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
> >>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
> >>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
> >>  }
> >> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
> >>  static Property spapr_phb_properties[] = {
> >>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
> >>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
> >> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
> >> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
> >> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
> >>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
> >>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
> >>                         SPAPR_PCI_MMIO_WIN_SIZE),
> >> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
> >>      /* Default DMA window is 0..1GB */
> >>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> >> +                       0x800000000000000ULL),
> >> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >> +                       (1ULL << 12) | (1ULL << 16)),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
> >>      .post_load = spapr_pci_post_load,
> >>      .fields = (VMStateField[]) {
> >>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> >> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> >> +        VMSTATE_UNUSED(4), /* dma_liobn */
> >>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
> >>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
> >>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> >> @@ -1780,6 +1802,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      uint32_t interrupt_map_mask[] = {
> >>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >> +    uint32_t ddw_applicable[] = {
> >> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >> +    };
> >> +    uint32_t ddw_extensions[] = {
> >> +        cpu_to_be32(1),
> >> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >> +    };
> >>      sPAPRTCETable *tcet;
> >>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>      sPAPRFDT s_fdt;
> >> @@ -1804,6 +1835,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>  
> >> +    /* Dynamic DMA window */
> >> +    if (phb->ddw_enabled) {
> >> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >> +                         sizeof(ddw_applicable)));
> >> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >> +                         &ddw_extensions, sizeof(ddw_extensions)));
> >> +    }
> >> +
> >>      /* Build the interrupt-map, this must matches what is done
> >>       * in pci_spapr_map_irq
> >>       */
> >> @@ -1827,7 +1866,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
> >>                       sizeof(interrupt_map)));
> >>  
> >> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >>      if (!tcet) {
> >>          return -1;
> >>      }
> >> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >> new file mode 100644
> >> index 0000000..17bbae0
> >> --- /dev/null
> >> +++ b/hw/ppc/spapr_rtas_ddw.c
> >> @@ -0,0 +1,293 @@
> >> +/*
> >> + * QEMU sPAPR Dynamic DMA windows support
> >> + *
> >> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >> + *
> >> + *  This program is free software; you can redistribute it and/or modify
> >> + *  it under the terms of the GNU General Public License as published by
> >> + *  the Free Software Foundation; either version 2 of the License,
> >> + *  or (at your option) any later version.
> >> + *
> >> + *  This program is distributed in the hope that it will be useful,
> >> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> + *  GNU General Public License for more details.
> >> + *
> >> + *  You should have received a copy of the GNU General Public License
> >> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "cpu.h"
> >> +#include "qemu/error-report.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/pci-host/spapr.h"
> >> +#include "trace.h"
> >> +
> >> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >> +{
> >> +    sPAPRTCETable *tcet;
> >> +
> >> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >> +    if (tcet && tcet->nb_table) {
> >> +        ++*(unsigned *)opaque;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >> +{
> >> +    unsigned ret = 0;
> >> +
> >> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >> +{
> >> +    sPAPRTCETable *tcet;
> >> +
> >> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >> +    if (tcet && !tcet->nb_table) {
> >> +        *(uint32_t *)opaque = tcet->liobn;
> >> +        return 1;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >> +{
> >> +    uint32_t liobn = 0;
> >> +
> >> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >> +
> >> +    return liobn;
> >> +}
> >> +
> >> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> >> +{
> >> +    int i;
> >> +    uint32_t mask = 0;
> >> +    const struct { int shift; uint32_t mask; } masks[] = {
> >> +        { 12, RTAS_DDW_PGSIZE_4K },
> >> +        { 16, RTAS_DDW_PGSIZE_64K },
> >> +        { 24, RTAS_DDW_PGSIZE_16M },
> >> +        { 25, RTAS_DDW_PGSIZE_32M },
> >> +        { 26, RTAS_DDW_PGSIZE_64M },
> >> +        { 27, RTAS_DDW_PGSIZE_128M },
> >> +        { 28, RTAS_DDW_PGSIZE_256M },
> >> +        { 34, RTAS_DDW_PGSIZE_16G },
> >> +    };
> >> +
> >> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> >> +        if (page_mask & (1ULL << masks[i].shift)) {
> >> +            mask |= masks[i].mask;
> >> +        }
> >> +    }
> >> +
> >> +    return mask;
> >> +}
> >> +
> >> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >> +                                         sPAPRMachineState *spapr,
> >> +                                         uint32_t token, uint32_t nargs,
> >> +                                         target_ulong args,
> >> +                                         uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    uint64_t buid, max_window_size;
> >> +    uint32_t avail, addr, pgmask = 0;
> >> +    MachineState *machine = MACHINE(spapr);
> >> +
> >> +    if ((nargs != 3) || (nret != 5)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    /* Translate page mask to LoPAPR format */
> >> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> >> +
> >> +    /*
> >> +     * This is "Largest contiguous block of TCEs allocated specifically
> >> +     * for (that is, are reserved for) this PE".
> >> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> >> +     */
> >> +    if (machine->ram_size == machine->maxram_size) {
> >> +        max_window_size = machine->ram_size;
> >> +    } else {
> >> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> >> +
> >> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> >> +    }
> >> +
> >> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, avail);
> >> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> >> +    rtas_st(rets, 3, pgmask);
> >> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> >> +
> >> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRTCETable *tcet = NULL;
> >> +    uint32_t addr, page_shift, window_shift, liobn;
> >> +    uint64_t buid;
> >> +
> >> +    if ((nargs != 5) || (nret != 4)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    page_shift = rtas_ld(args, 3);
> >> +    window_shift = rtas_ld(args, 4);
> >> +    liobn = spapr_phb_get_free_liobn(sphb);
> >> +
> >> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
> >> +        (window_shift < page_shift)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    if (!liobn || !sphb->ddw_enabled ||
> >> +        spapr_phb_get_active_win_num(sphb) == SPAPR_PCI_DMA_MAX_WINDOWS) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    tcet = spapr_tce_find_by_liobn(liobn);
> >> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> >> +                                 1ULL << window_shift,
> >> +                                 tcet ? tcet->bus_offset : 0xbaadf00d, liobn);
> >> +    if (!tcet) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    spapr_tce_table_enable(tcet, page_shift, sphb->dma64_window_addr,
> > 
> > This looks like it's assuming you're creating the second 64-bit
> > window.  If the guest removed the default window then tried to
> > recreate it, that might not be the case.
> 
> Yup, bug, was sitting there for a long time...
> 
> 
> 
> 
> 
> > 
> >> +                           1ULL << (window_shift - page_shift));
> >> +    if (!tcet->nb_table) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, liobn);
> >> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> >> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> >> +
> >> +    return;
> >> +
> >> +hw_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRTCETable *tcet;
> >> +    uint32_t liobn;
> >> +
> >> +    if ((nargs != 1) || (nret != 1)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    liobn = rtas_ld(args, 0);
> >> +    tcet = spapr_tce_find_by_liobn(liobn);
> >> +    if (!tcet) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> >> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    spapr_tce_table_disable(tcet);
> >> +    trace_spapr_iommu_ddw_remove(liobn);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> >> +                                         sPAPRMachineState *spapr,
> >> +                                         uint32_t token, uint32_t nargs,
> >> +                                         target_ulong args,
> >> +                                         uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    uint64_t buid;
> >> +    uint32_t addr;
> >> +
> >> +    if ((nargs != 3) || (nret != 1)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    spapr_phb_dma_reset(sphb);
> >> +    trace_spapr_iommu_ddw_reset(buid, addr);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void spapr_rtas_ddw_init(void)
> >> +{
> >> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> >> +                        "ibm,query-pe-dma-window",
> >> +                        rtas_ibm_query_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> >> +                        "ibm,create-pe-dma-window",
> >> +                        rtas_ibm_create_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> >> +                        "ibm,remove-pe-dma-window",
> >> +                        rtas_ibm_remove_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> >> +                        "ibm,reset-pe-dma-window",
> >> +                        rtas_ibm_reset_pe_dma_window);
> >> +}
> >> +
> >> +type_init(spapr_rtas_ddw_init)
> >> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >> index 7848366..36a370e 100644
> >> --- a/include/hw/pci-host/spapr.h
> >> +++ b/include/hw/pci-host/spapr.h
> >> @@ -32,6 +32,8 @@
> >>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
> >>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
> >>  
> >> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> >> +
> >>  typedef struct sPAPRPHBState sPAPRPHBState;
> >>  
> >>  typedef struct spapr_pci_msi {
> >> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
> >>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
> >>      MemoryRegion memwindow, iowindow, msiwindow;
> >>  
> >> -    uint32_t dma_liobn;
> >> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
> >>      hwaddr dma_win_addr, dma_win_size;
> >>      AddressSpace iommu_as;
> >>      MemoryRegion iommu_root;
> >> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
> >>      spapr_pci_msi_mig *msi_devs;
> >>  
> >>      QLIST_ENTRY(sPAPRPHBState) list;
> >> +
> >> +    bool ddw_enabled;
> >> +    uint64_t page_size_mask;
> >> +    uint64_t dma64_window_addr;
> >>  };
> >>  
> >>  #define SPAPR_PCI_MAX_INDEX          255
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 971df3d..59fad22 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -412,6 +412,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
> >>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
> >>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
> >>  
> >> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> >> +#define RTAS_DDW_PGSIZE_4K       0x01
> >> +#define RTAS_DDW_PGSIZE_64K      0x02
> >> +#define RTAS_DDW_PGSIZE_16M      0x04
> >> +#define RTAS_DDW_PGSIZE_32M      0x08
> >> +#define RTAS_DDW_PGSIZE_64M      0x10
> >> +#define RTAS_DDW_PGSIZE_128M     0x20
> >> +#define RTAS_DDW_PGSIZE_256M     0x40
> >> +#define RTAS_DDW_PGSIZE_16G      0x80
> >> +
> >>  /* RTAS tokens */
> >>  #define RTAS_TOKEN_BASE      0x2000
> >>  
> >> @@ -453,8 +463,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
> >>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
> >>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
> >>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> >> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> >> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> >> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> >> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
> >>  
> >> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> >> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
> >>  
> >>  /* RTAS ibm,get-system-parameter token values */
> >>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> >> diff --git a/trace-events b/trace-events
> >> index ec32c20..dec80e4 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1433,6 +1433,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
> >>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> >>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> >>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> >> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> >> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
> >> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
> >> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
> >>  
> >>  # hw/ppc/ppc.c
> >>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
  2016-06-06 13:31     ` Paolo Bonzini
  2016-06-07  3:42       ` Alexey Kardashevskiy
@ 2016-06-08  6:00       ` David Gibson
  2016-06-08  6:05         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-06-08  6:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Alexander Graf,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 8111 bytes --]

On Mon, Jun 06, 2016 at 03:31:04PM +0200, Paolo Bonzini wrote:
> 
> 
> On 02/06/2016 05:35, David Gibson wrote:
> > On Wed, Jun 01, 2016 at 06:57:37PM +1000, Alexey Kardashevskiy wrote:
> >> > Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> >> > uses when translating, however this information is not available outside
> >> > the translate context for various checks.
> >> > 
> >> > This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> >> > a wrapper for it so IOMMU users (such as VFIO) can know the actual
> >> > page size(s) used by an IOMMU.
> >> > 
> >> > As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
> >> > as fallback.
> >> > 
> >> > This removes vfio_container_granularity() and uses new helper in
> >> > memory_region_iommu_replay() when replaying IOMMU mappings on added
> >> > IOMMU memory region.
> >> > 
> >> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > Paolo,
> > 
> > Looks like you were left off the CC for this one.
> > 
> > I think this is ready to go - do you want to merge, comment or ack and
> > we'll take it either through my tree or Alex's?
> 
> It's okay for you to merge, but the callback should be called
> "get_page_size" or "get_replay_granularity".  The plural is weird.

Hm, no, it really could return multiple page sizes if the logical
(guest side) IOMMU supports them.  That might be useful at some point
in the future.  For now, it's sufficient for the replay to use the
smallest pagesize.

> 
> Thanks,
> 
> Paolo
> 
> 
> >> > ---
> >> > Changes:
> >> > v16:
> >> > * used memory_region_iommu_get_page_sizes() instead of
> >> > mr->iommu_ops->get_page_sizes() in memory_region_iommu_replay()
> >> > 
> >> > v15:
> >> > * s/qemu_real_host_page_size/TARGET_PAGE_SIZE/ in memory_region_iommu_get_page_sizes
> >> > 
> >> > v14:
> >> > * removed vfio_container_granularity(), changed memory_region_iommu_replay()
> >> > 
> >> > v4:
> >> > * s/1<<TARGET_PAGE_BITS/qemu_real_host_page_size/
> >> > ---
> >> >  hw/ppc/spapr_iommu.c  |  8 ++++++++
> >> >  hw/vfio/common.c      |  6 ------
> >> >  include/exec/memory.h | 18 ++++++++++++++----
> >> >  memory.c              | 16 +++++++++++++---
> >> >  4 files changed, 35 insertions(+), 13 deletions(-)
> >> > 
> >> > diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >> > index a3cc572..90a45c0 100644
> >> > --- a/hw/ppc/spapr_iommu.c
> >> > +++ b/hw/ppc/spapr_iommu.c
> >> > @@ -149,6 +149,13 @@ static void spapr_tce_table_pre_save(void *opaque)
> >> >                                 tcet->bus_offset, tcet->page_shift);
> >> >  }
> >> >  
> >> > +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >> > +{
> >> > +    sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> >> > +
> >> > +    return 1ULL << tcet->page_shift;
> >> > +}
> >> > +
> >> >  static int spapr_tce_table_post_load(void *opaque, int version_id)
> >> >  {
> >> >      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> >> > @@ -228,6 +235,7 @@ static const VMStateDescription vmstate_spapr_tce_table = {
> >> >  
> >> >  static MemoryRegionIOMMUOps spapr_iommu_ops = {
> >> >      .translate = spapr_tce_translate_iommu,
> >> > +    .get_page_sizes = spapr_tce_get_page_sizes,
> >> >  };
> >> >  
> >> >  static int spapr_tce_table_realize(DeviceState *dev)
> >> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> > index e51ed3a..f1a12b0 100644
> >> > --- a/hw/vfio/common.c
> >> > +++ b/hw/vfio/common.c
> >> > @@ -322,11 +322,6 @@ out:
> >> >      rcu_read_unlock();
> >> >  }
> >> >  
> >> > -static hwaddr vfio_container_granularity(VFIOContainer *container)
> >> > -{
> >> > -    return (hwaddr)1 << ctz64(container->iova_pgsizes);
> >> > -}
> >> > -
> >> >  static void vfio_listener_region_add(MemoryListener *listener,
> >> >                                       MemoryRegionSection *section)
> >> >  {
> >> > @@ -394,7 +389,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >> >  
> >> >          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> >> >          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> >> > -                                   vfio_container_granularity(container),
> >> >                                     false);
> >> >  
> >> >          return;
> >> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> >> > index f649697..bd9625f 100644
> >> > --- a/include/exec/memory.h
> >> > +++ b/include/exec/memory.h
> >> > @@ -149,6 +149,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
> >> >  struct MemoryRegionIOMMUOps {
> >> >      /* Return a TLB entry that contains a given address. */
> >> >      IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool is_write);
> >> > +    /* Returns supported page sizes */
> >> > +    uint64_t (*get_page_sizes)(MemoryRegion *iommu);
> >> >  };
> >> >  
> >> >  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> >> > @@ -571,6 +573,15 @@ static inline bool memory_region_is_iommu(MemoryRegion *mr)
> >> >  
> >> >  
> >> >  /**
> >> > + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
> >> > + *
> >> > + * Returns %bitmap of supported page sizes for an iommu.
> >> > + *
> >> > + * @mr: the memory region being queried
> >> > + */
> >> > +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
> >> > +
> >> > +/**
> >> >   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
> >> >   *
> >> >   * @mr: the memory region that was changed
> >> > @@ -594,16 +605,15 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n);
> >> >  
> >> >  /**
> >> >   * memory_region_iommu_replay: replay existing IOMMU translations to
> >> > - * a notifier
> >> > + * a notifier with the minimum page granularity returned by
> >> > + * mr->iommu_ops->get_page_sizes().
> >> >   *
> >> >   * @mr: the memory region to observe
> >> >   * @n: the notifier to which to replay iommu mappings
> >> > - * @granularity: Minimum page granularity to replay notifications for
> >> >   * @is_write: Whether to treat the replay as a translate "write"
> >> >   *     through the iommu
> >> >   */
> >> > -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> >> > -                                hwaddr granularity, bool is_write);
> >> > +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write);
> >> >  
> >> >  /**
> >> >   * memory_region_unregister_iommu_notifier: unregister a notifier for
> >> > diff --git a/memory.c b/memory.c
> >> > index 4e3cda8..761ae92 100644
> >> > --- a/memory.c
> >> > +++ b/memory.c
> >> > @@ -1500,12 +1500,22 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
> >> >      notifier_list_add(&mr->iommu_notify, n);
> >> >  }
> >> >  
> >> > -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> >> > -                                hwaddr granularity, bool is_write)
> >> > +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr)
> >> >  {
> >> > -    hwaddr addr;
> >> > +    assert(memory_region_is_iommu(mr));
> >> > +    if (mr->iommu_ops && mr->iommu_ops->get_page_sizes) {
> >> > +        return mr->iommu_ops->get_page_sizes(mr);
> >> > +    }
> >> > +    return TARGET_PAGE_SIZE;
> >> > +}
> >> > +
> >> > +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool is_write)
> >> > +{
> >> > +    hwaddr addr, granularity;
> >> >      IOMMUTLBEntry iotlb;
> >> >  
> >> > +    granularity = (hwaddr)1 << ctz64(memory_region_iommu_get_page_sizes(mr));
> >> > +
> >> >      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
  2016-06-08  6:00       ` David Gibson
@ 2016-06-08  6:05         ` Alexey Kardashevskiy
  2016-06-14 21:41           ` Alexey Kardashevskiy
  2016-06-15  6:15           ` David Gibson
  0 siblings, 2 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-08  6:05 UTC (permalink / raw)
  To: David Gibson, Paolo Bonzini
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 1862 bytes --]

On 08/06/16 16:00, David Gibson wrote:
> On Mon, Jun 06, 2016 at 03:31:04PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 02/06/2016 05:35, David Gibson wrote:
>>> On Wed, Jun 01, 2016 at 06:57:37PM +1000, Alexey Kardashevskiy wrote:
>>>>> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
>>>>> uses when translating, however this information is not available outside
>>>>> the translate context for various checks.
>>>>>
>>>>> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
>>>>> a wrapper for it so IOMMU users (such as VFIO) can know the actual
>>>>> page size(s) used by an IOMMU.
>>>>>
>>>>> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
>>>>> as fallback.
>>>>>
>>>>> This removes vfio_container_granularity() and uses new helper in
>>>>> memory_region_iommu_replay() when replaying IOMMU mappings on added
>>>>> IOMMU memory region.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>> Paolo,
>>>
>>> Looks like you were left off the CC for this one.
>>>
>>> I think this is ready to go - do you want to merge, comment or ack and
>>> we'll take it either through my tree or Alex's?
>>
>> It's okay for you to merge, but the callback should be called
>> "get_page_size" or "get_replay_granularity".  The plural is weird.
> 
> Hm, no, it really could return multiple page sizes if the logical
> (guest side) IOMMU supports them.

It could but it does not now and I cannot see it coming in near future so I
am really confused now about the naming and what the callback should return
- one page size or a mask. What should it be now?


> That might be useful at some point
> in the future.  For now, it's sufficient for the replay to use the
> smallest pagesize.
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-08  5:57       ` David Gibson
@ 2016-06-08  6:09         ` Alexey Kardashevskiy
  2016-06-09  4:28           ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-08  6:09 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 8210 bytes --]

On 08/06/16 15:57, David Gibson wrote:
> On Mon, Jun 06, 2016 at 06:12:58PM +1000, Alexey Kardashevskiy wrote:
>> On 06/06/16 15:57, David Gibson wrote:
>>> On Wed, Jun 01, 2016 at 06:57:42PM +1000, Alexey Kardashevskiy wrote:
>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>
>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>> the pseries-2.5 machine (TODO: update version) and older disable it.
>>>
>>> Looks like your todo is now todone, but you need to update the commit
>>> message.
>>>
>>>> This also creates a single DMA window for the older machines to
>>>> maintain backward migration.
>>>>
>>>> This implements DDW for PHB with emulated and VFIO devices. The host
>>>> kernel support is required. The advertised IOMMU page sizes are 4K and
>>>> 64K; 16M pages are supported but not advertised by default, in order to
>>>> enable them, the user has to specify "pgsz" property for PHB and
>>>> enable huge pages for RAM.
>>>>
>>>> The existing linux guests try creating one additional huge DMA window
>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>> property which is a bus address for the 64bit window and by default
>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>
>>>> This adds 4 RTAS handlers:
>>>> * ibm,query-pe-dma-window
>>>> * ibm,create-pe-dma-window
>>>> * ibm,remove-pe-dma-window
>>>> * ibm,reset-pe-dma-window
>>>> These are registered from type_init() callback.
>>>>
>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>> spapr_iommu.c with PCI.
>>>>
>>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> Looks pretty close to ready.
>>>
>>> There are a handful of nits and one real error noted below.
>>>
>>>> ---
>>>> Changes:
>>>> v17:
>>>> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
>>>>
>>>> v16:
>>>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
>>>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>>>>
>>>> v15:
>>>> * moved page mask filtering to PHB realize(), use "-mempath" to know
>>>> if there are huge pages
>>>> * fixed error reporting in RTAS handlers
>>>> * max window size accounts now hotpluggable memory boundaries
>>>> ---
>>>>  hw/ppc/Makefile.objs        |   1 +
>>>>  hw/ppc/spapr.c              |   5 +
>>>>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>>>>  hw/ppc/spapr_rtas_ddw.c     | 293 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/pci-host/spapr.h |   8 +-
>>>>  include/hw/ppc/spapr.h      |  16 ++-
>>>>  trace-events                |   4 +
>>>>  7 files changed, 383 insertions(+), 21 deletions(-)
>>>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index c1ffc77..986b36f 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>  obj-y += spapr_pci_vfio.o
>>>>  endif
>>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>  # PowerPC 4xx boards
>>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>  obj-y += ppc4xx_pci.o
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index 44e401a..6ddcda9 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -2366,6 +2366,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>>>          .driver   = "spapr-vlan", \
>>>>          .property = "use-rx-buffer-pools", \
>>>>          .value    = "off", \
>>>> +    }, \
>>>> +    {\
>>>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>>> +        .property = "ddw",\
>>>> +        .value    = stringify(off),\
>>>>      },
>>>>  
>>>>  static void spapr_machine_2_5_instance_options(MachineState *machine)
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 68de523..bcf0360 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -35,6 +35,7 @@
>>>>  #include "hw/ppc/spapr.h"
>>>>  #include "hw/pci-host/spapr.h"
>>>>  #include "exec/address-spaces.h"
>>>> +#include "exec/ram_addr.h"
>>>>  #include <libfdt.h>
>>>>  #include "trace.h"
>>>>  #include "qemu/error-report.h"
>>>> @@ -45,6 +46,7 @@
>>>>  #include "hw/ppc/spapr_drc.h"
>>>>  #include "sysemu/device_tree.h"
>>>>  #include "sysemu/kvm.h"
>>>> +#include "sysemu/hostmem.h"
>>>>  
>>>>  #include "hw/vfio/vfio.h"
>>>>  
>>>> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>>>      int fdt_start_offset = 0, fdt_size;
>>>>  
>>>>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>>>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>>>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>>>  
>>>>          spapr_tce_set_need_vfio(tcet, true);
>>>>      }
>>>
>>> Hang on.. I thought you'd got rid of the need for this explicit
>>> set_need_vfio() stuff.
>>
>>
>> It is in 12/12 (which I'll split in 2 halves when I respin this), I moved
>> it to the end as it is not essential for DDW itself.
> 
> Yes, sorry, I saw that shortly after writing this.
> 
>>>> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>      PCIBus *bus;
>>>>      uint64_t msi_window_size = 4096;
>>>>      sPAPRTCETable *tcet;
>>>> +    const unsigned windows_supported =
>>>> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>>>>  
>>>>      if (sphb->index != (uint32_t)-1) {
>>>>          hwaddr windows_base;
>>>>  
>>>> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
>>>> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>>>> +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
>>>>              || (sphb->mem_win_addr != (hwaddr)-1)
>>>>              || (sphb->io_win_addr != (hwaddr)-1)) {
>>>>              error_setg(errp, "Either \"index\" or other parameters must"
>>>> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>          }
>>>>  
>>>>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>>>> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>>>> +        for (i = 0; i < windows_supported; ++i) {
>>>> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
>>>> +        }
>>>>  
>>>>          windows_base = SPAPR_PCI_WINDOW_BASE
>>>>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>>>> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>          return;
>>>>      }
>>>>  
>>>> -    if (sphb->dma_liobn == (uint32_t)-1) {
>>>> -        error_setg(errp, "LIOBN not specified for PHB");
>>>> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
>>>> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
>>>> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>>>>          return;
>>>>      }
>>>
>>> Hrm.. there's a bit of false generality here, since this would break
>>> if windows_supported > 2, and dma_liobn[2] was not specified.  Not
>>> urgent for the initial commit though.
>>
>>
>> Is s/windows_supported > 1/windows_supported == 2/ any better here?
> 
> Not really.  Unless you also have a windows_supported > 2 case (which
> could just error / abort / whatever).


It cannot be >2 as SPAPR_PCI_DMA_MAX_WINDOWS is defined as "2" now and it
is quite unlikely to change in the future.

What do I do right now? :)


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
  2016-06-08  6:09         ` Alexey Kardashevskiy
@ 2016-06-09  4:28           ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-06-09  4:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Alexander Graf, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 9243 bytes --]

On Wed, Jun 08, 2016 at 04:09:57PM +1000, Alexey Kardashevskiy wrote:
> On 08/06/16 15:57, David Gibson wrote:
> > On Mon, Jun 06, 2016 at 06:12:58PM +1000, Alexey Kardashevskiy wrote:
> >> On 06/06/16 15:57, David Gibson wrote:
> >>> On Wed, Jun 01, 2016 at 06:57:42PM +1000, Alexey Kardashevskiy wrote:
> >>>> This adds support for Dynamic DMA Windows (DDW) option defined by
> >>>> the SPAPR specification which allows to have additional DMA window(s)
> >>>>
> >>>> The "ddw" property is enabled by default on a PHB but for compatibility
> >>>> the pseries-2.5 machine (TODO: update version) and older disable it.
> >>>
> >>> Looks like your todo is now todone, but you need to update the commit
> >>> message.
> >>>
> >>>> This also creates a single DMA window for the older machines to
> >>>> maintain backward migration.
> >>>>
> >>>> This implements DDW for PHB with emulated and VFIO devices. The host
> >>>> kernel support is required. The advertised IOMMU page sizes are 4K and
> >>>> 64K; 16M pages are supported but not advertised by default, in order to
> >>>> enable them, the user has to specify "pgsz" property for PHB and
> >>>> enable huge pages for RAM.
> >>>>
> >>>> The existing linux guests try creating one additional huge DMA window
> >>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >>>> property which is a bus address for the 64bit window and by default
> >>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >>>> uses and this allows having emulated and VFIO devices on the same bus.
> >>>>
> >>>> This adds 4 RTAS handlers:
> >>>> * ibm,query-pe-dma-window
> >>>> * ibm,create-pe-dma-window
> >>>> * ibm,remove-pe-dma-window
> >>>> * ibm,reset-pe-dma-window
> >>>> These are registered from type_init() callback.
> >>>>
> >>>> These RTAS handlers are implemented in a separate file to avoid polluting
> >>>> spapr_iommu.c with PCI.
> >>>>
> >>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>
> >>> Looks pretty close to ready.
> >>>
> >>> There are a handful of nits and one real error noted below.
> >>>
> >>>> ---
> >>>> Changes:
> >>>> v17:
> >>>> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
> >>>>
> >>>> v16:
> >>>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> >>>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> >>>>
> >>>> v15:
> >>>> * moved page mask filtering to PHB realize(), use "-mempath" to know
> >>>> if there are huge pages
> >>>> * fixed error reporting in RTAS handlers
> >>>> * max window size accounts now hotpluggable memory boundaries
> >>>> ---
> >>>>  hw/ppc/Makefile.objs        |   1 +
> >>>>  hw/ppc/spapr.c              |   5 +
> >>>>  hw/ppc/spapr_pci.c          |  77 +++++++++---
> >>>>  hw/ppc/spapr_rtas_ddw.c     | 293 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>  include/hw/pci-host/spapr.h |   8 +-
> >>>>  include/hw/ppc/spapr.h      |  16 ++-
> >>>>  trace-events                |   4 +
> >>>>  7 files changed, 383 insertions(+), 21 deletions(-)
> >>>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>>>
> >>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>>> index c1ffc77..986b36f 100644
> >>>> --- a/hw/ppc/Makefile.objs
> >>>> +++ b/hw/ppc/Makefile.objs
> >>>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>>>  obj-y += spapr_pci_vfio.o
> >>>>  endif
> >>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>>>  # PowerPC 4xx boards
> >>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>>>  obj-y += ppc4xx_pci.o
> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>> index 44e401a..6ddcda9 100644
> >>>> --- a/hw/ppc/spapr.c
> >>>> +++ b/hw/ppc/spapr.c
> >>>> @@ -2366,6 +2366,11 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>>>          .driver   = "spapr-vlan", \
> >>>>          .property = "use-rx-buffer-pools", \
> >>>>          .value    = "off", \
> >>>> +    }, \
> >>>> +    {\
> >>>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >>>> +        .property = "ddw",\
> >>>> +        .value    = stringify(off),\
> >>>>      },
> >>>>  
> >>>>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>>> index 68de523..bcf0360 100644
> >>>> --- a/hw/ppc/spapr_pci.c
> >>>> +++ b/hw/ppc/spapr_pci.c
> >>>> @@ -35,6 +35,7 @@
> >>>>  #include "hw/ppc/spapr.h"
> >>>>  #include "hw/pci-host/spapr.h"
> >>>>  #include "exec/address-spaces.h"
> >>>> +#include "exec/ram_addr.h"
> >>>>  #include <libfdt.h>
> >>>>  #include "trace.h"
> >>>>  #include "qemu/error-report.h"
> >>>> @@ -45,6 +46,7 @@
> >>>>  #include "hw/ppc/spapr_drc.h"
> >>>>  #include "sysemu/device_tree.h"
> >>>>  #include "sysemu/kvm.h"
> >>>> +#include "sysemu/hostmem.h"
> >>>>  
> >>>>  #include "hw/vfio/vfio.h"
> >>>>  
> >>>> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >>>>      int fdt_start_offset = 0, fdt_size;
> >>>>  
> >>>>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> >>>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >>>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >>>>  
> >>>>          spapr_tce_set_need_vfio(tcet, true);
> >>>>      }
> >>>
> >>> Hang on.. I thought you'd got rid of the need for this explicit
> >>> set_need_vfio() stuff.
> >>
> >>
> >> It is in 12/12 (which I'll split in 2 halves when I respin this), I moved
> >> it to the end as it is not essential for DDW itself.
> > 
> > Yes, sorry, I saw that shortly after writing this.
> > 
> >>>> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>      PCIBus *bus;
> >>>>      uint64_t msi_window_size = 4096;
> >>>>      sPAPRTCETable *tcet;
> >>>> +    const unsigned windows_supported =
> >>>> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >>>>  
> >>>>      if (sphb->index != (uint32_t)-1) {
> >>>>          hwaddr windows_base;
> >>>>  
> >>>> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> >>>> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> >>>> +            || ((sphb->dma_liobn[1] != (uint32_t)-1) && (windows_supported > 1))
> >>>>              || (sphb->mem_win_addr != (hwaddr)-1)
> >>>>              || (sphb->io_win_addr != (hwaddr)-1)) {
> >>>>              error_setg(errp, "Either \"index\" or other parameters must"
> >>>> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>          }
> >>>>  
> >>>>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> >>>> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> >>>> +        for (i = 0; i < windows_supported; ++i) {
> >>>> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> >>>> +        }
> >>>>  
> >>>>          windows_base = SPAPR_PCI_WINDOW_BASE
> >>>>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> >>>> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>          return;
> >>>>      }
> >>>>  
> >>>> -    if (sphb->dma_liobn == (uint32_t)-1) {
> >>>> -        error_setg(errp, "LIOBN not specified for PHB");
> >>>> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
> >>>> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
> >>>> +        error_setg(errp, "LIOBN(s) not specified for PHB");
> >>>>          return;
> >>>>      }
> >>>
> >>> Hrm.. there's a bit of false generality here, since this would break
> >>> if windows_supported > 2, and dma_liobn[2] was not specified.  Not
> >>> urgent for the initial commit though.
> >>
> >>
> >> Is s/windows_supported > 1/windows_supported == 2/ any better here?
> > 
> > Not really.  Unless you also have a windows_supported > 2 case (which
> > could just error / abort / whatever).
> 
> 
> It cannot be >2 as SPAPR_PCI_DMA_MAX_WINDOWS is defined as "2" now and it
> is quite unlikely to change in the future.

Yes, I know, but the value of MAX_WINDOWS isn't obvious if you're just
looking at this code.  My point is that a casual look at this code
suggests it will handle arbitrary windows_supported values, but that's
not actually the case.  This is generally bad practice.

> What do I do right now? :)

Well, as I said it's not urgent for the first commit.  Just leave it
as it is and we'll deal with the consequences later.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
  2016-06-08  6:05         ` Alexey Kardashevskiy
@ 2016-06-14 21:41           ` Alexey Kardashevskiy
  2016-06-15  6:15           ` David Gibson
  1 sibling, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-06-14 21:41 UTC (permalink / raw)
  To: David Gibson, Paolo Bonzini
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

On 08/06/16 16:05, Alexey Kardashevskiy wrote:
> On 08/06/16 16:00, David Gibson wrote:
>> On Mon, Jun 06, 2016 at 03:31:04PM +0200, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/06/2016 05:35, David Gibson wrote:
>>>> On Wed, Jun 01, 2016 at 06:57:37PM +1000, Alexey Kardashevskiy wrote:
>>>>>> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
>>>>>> uses when translating, however this information is not available outside
>>>>>> the translate context for various checks.
>>>>>>
>>>>>> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
>>>>>> a wrapper for it so IOMMU users (such as VFIO) can know the actual
>>>>>> page size(s) used by an IOMMU.
>>>>>>
>>>>>> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
>>>>>> as fallback.
>>>>>>
>>>>>> This removes vfio_container_granularity() and uses new helper in
>>>>>> memory_region_iommu_replay() when replaying IOMMU mappings on added
>>>>>> IOMMU memory region.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>> Paolo,
>>>>
>>>> Looks like you were left off the CC for this one.
>>>>
>>>> I think this is ready to go - do you want to merge, comment or ack and
>>>> we'll take it either through my tree or Alex's?
>>>
>>> It's okay for you to merge, but the callback should be called
>>> "get_page_size" or "get_replay_granularity".  The plural is weird.
>>
>> Hm, no, it really could return multiple page sizes if the logical
>> (guest side) IOMMU supports them.
> 
> It could but it does not now and I cannot see it coming in near future so I
> am really confused now about the naming and what the callback should return
> - one page size or a mask. What should it be now?


Paolo, David?


> 
>> That might be useful at some point
>> in the future.  For now, it's sufficient for the replay to use the
>> smallest pagesize.
>>
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes
  2016-06-08  6:05         ` Alexey Kardashevskiy
  2016-06-14 21:41           ` Alexey Kardashevskiy
@ 2016-06-15  6:15           ` David Gibson
  1 sibling, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-06-15  6:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 2357 bytes --]

On Wed, Jun 08, 2016 at 04:05:48PM +1000, Alexey Kardashevskiy wrote:
> On 08/06/16 16:00, David Gibson wrote:
> > On Mon, Jun 06, 2016 at 03:31:04PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 02/06/2016 05:35, David Gibson wrote:
> >>> On Wed, Jun 01, 2016 at 06:57:37PM +1000, Alexey Kardashevskiy wrote:
> >>>>> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> >>>>> uses when translating, however this information is not available outside
> >>>>> the translate context for various checks.
> >>>>>
> >>>>> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> >>>>> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> >>>>> page size(s) used by an IOMMU.
> >>>>>
> >>>>> As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
> >>>>> as fallback.
> >>>>>
> >>>>> This removes vfio_container_granularity() and uses new helper in
> >>>>> memory_region_iommu_replay() when replaying IOMMU mappings on added
> >>>>> IOMMU memory region.
> >>>>>
> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>> Paolo,
> >>>
> >>> Looks like you were left off the CC for this one.
> >>>
> >>> I think this is ready to go - do you want to merge, comment or ack and
> >>> we'll take it either through my tree or Alex's?
> >>
> >> It's okay for you to merge, but the callback should be called
> >> "get_page_size" or "get_replay_granularity".  The plural is weird.
> > 
> > Hm, no, it really could return multiple page sizes if the logical
> > (guest side) IOMMU supports them.
> 
> It could but it does not now and I cannot see it coming in near future so I
> am really confused now about the naming and what the callback should return
> - one page size or a mask. What should it be now?

Eh, make it a single page size for now.  We can extend it to a mask if
we ever need it.  But for clarity, best to call it 'get_min_page_size'
or 'get_granularity'.

> > That might be useful at some point
> > in the future.  For now, it's sufficient for the replay to use the
> > smallest pagesize.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-06-15  6:27 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1464771463-37214-1-git-send-email-aik@ozlabs.ru>
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 01/12] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 02/12] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 03/12] spapr_iommu: Migrate full state Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 04/12] spapr_iommu: Add root memory region Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 05/12] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 08/12] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-06-01  8:57 ` [Qemu-devel] [PATCH qemu v17 12/12] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening Alexey Kardashevskiy
     [not found] ` <201606010902.u518wwmb029353@mx0a-001b2d01.pphosted.com>
2016-06-02  3:35   ` [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes David Gibson
2016-06-06 13:31     ` Paolo Bonzini
2016-06-07  3:42       ` Alexey Kardashevskiy
2016-06-08  6:00       ` David Gibson
2016-06-08  6:05         ` Alexey Kardashevskiy
2016-06-14 21:41           ` Alexey Kardashevskiy
2016-06-15  6:15           ` David Gibson
     [not found] ` <201606010900.u518wvH7046287@mx0a-001b2d01.pphosted.com>
2016-06-02  4:18   ` [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) David Gibson
     [not found] ` <201606010902.u518x15j023604@mx0a-001b2d01.pphosted.com>
2016-06-02  4:19   ` [Qemu-devel] [PATCH qemu v17 08/12] spapr_pci: Add and export DMA resetting helper David Gibson
     [not found] ` <201606010901.u518wwEL029369@mx0a-001b2d01.pphosted.com>
2016-06-03  7:23   ` [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities David Gibson
     [not found] ` <201606011012.u51A9A6i023070@mx0a-001b2d01.pphosted.com>
2016-06-03  7:37   ` [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) David Gibson
2016-06-06  6:45     ` Alexey Kardashevskiy
2016-06-08  5:56       ` David Gibson
     [not found] ` <201606010902.u51902Zl007699@mx0a-001b2d01.pphosted.com>
2016-06-03 15:37   ` [Qemu-devel] [PATCH qemu v17 06/12] memory: Add reporting of supported page sizes Alex Williamson
     [not found] ` <201606010900.u51900Om007391@mx0a-001b2d01.pphosted.com>
2016-06-03 16:13   ` [Qemu-devel] [PATCH qemu v17 07/12] vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) Alex Williamson
2016-06-06  6:04     ` Alexey Kardashevskiy
2016-06-06 17:20       ` Alex Williamson
2016-06-07  3:10         ` Alexey Kardashevskiy
     [not found] ` <201606010901.u518x843001647@mx0a-001b2d01.pphosted.com>
2016-06-03 16:32   ` [Qemu-devel] [PATCH qemu v17 09/12] vfio: Add host side DMA window capabilities Alex Williamson
     [not found] ` <201606010901.u518x7AQ001537@mx0a-001b2d01.pphosted.com>
2016-06-03 16:50   ` [Qemu-devel] [PATCH qemu v17 10/12] vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) Alex Williamson
     [not found] ` <201606011013.u51A9ALx023064@mx0a-001b2d01.pphosted.com>
2016-06-03 16:59   ` [Qemu-devel] [PATCH qemu v17 12/12] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening Alex Williamson
     [not found] ` <201606010901.u518x1wF023568@mx0a-001b2d01.pphosted.com>
2016-06-06  5:57   ` [Qemu-devel] [PATCH qemu v17 11/12] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) David Gibson
2016-06-06  8:12     ` Alexey Kardashevskiy
2016-06-08  5:57       ` David Gibson
2016-06-08  6:09         ` Alexey Kardashevskiy
2016-06-09  4:28           ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.