All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW)
@ 2014-07-31  9:34 Alexey Kardashevskiy
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 01/10] qom: Make object_child_foreach safe for objects removal Alexey Kardashevskiy
                   ` (10 more replies)
  0 siblings, 11 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

At the moment sPAPR PHB supports only a single 32bit window
which is normally 1..2GB which is not enough for high performance devices.

PAPR spec enables creating an additional window(s) to support 64bit
DMA and bigger page sizes.

This patchset adds DDW support for pseries. The host kernel changes are
required.

This was tested on POWER8 system which allows one additional DMA window
which is mapped at 0x800.0000.0000.0000 and supports 16MB pages.
Existing guests check for DDW capabilities in PHB's device tree and if it
is present, they request for an additional window and map entire guest RAM
using H_PUT_TCE/... hypercalls once at boot time and switch to direct DMA
operations.

TCE tables still may be big enough for guests backed with 64K pages but they
are reasonably small for guests backed by 16MB pages.

Please comment. Thanks!




Alexey Kardashevskiy (10):
  qom: Make object_child_foreach safe for objects removal
  spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows
  spapr_pci: Make find_phb()/find_dev() public
  spapr_iommu: Make spapr_tce_find_by_liobn() public
  linux headers update for DDW
  spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  spapr: Add "ddw" machine option
  spapr_pci: Enable DDW
  spapr_pci_vfio: Enable DDW
  vfio: Enable DDW ioctls to VFIO IOMMU driver

 hw/misc/vfio.c              |   4 +
 hw/ppc/Makefile.objs        |   3 +
 hw/ppc/spapr.c              |  15 +++
 hw/ppc/spapr_iommu.c        |   8 +-
 hw/ppc/spapr_pci.c          |  87 +++++++++++--
 hw/ppc/spapr_pci_vfio.c     |  75 +++++++++++
 hw/ppc/spapr_rtas_ddw.c     | 296 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  27 ++++
 include/hw/ppc/spapr.h      |   7 +-
 linux-headers/linux/vfio.h  |  37 +++++-
 qom/object.c                |   4 +-
 trace-events                |   4 +
 vl.c                        |   4 +
 13 files changed, 552 insertions(+), 19 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

-- 
2.0.0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 01/10] qom: Make object_child_foreach safe for objects removal
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

Current object_child_foreach() uses QTAILQ_FOREACH() to walk
through children and that makes children removal from the callback
impossible.

This makes object_child_foreach() use QTAILQ_FOREACH_SAFE().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This went to Andreas's qom-next tree, it is here for the reference only.
---
 qom/object.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/qom/object.c b/qom/object.c
index 0e8267b..4a814dc 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -678,10 +678,10 @@ void object_class_foreach(void (*fn)(ObjectClass *klass, void *opaque),
 int object_child_foreach(Object *obj, int (*fn)(Object *child, void *opaque),
                          void *opaque)
 {
-    ObjectProperty *prop;
+    ObjectProperty *prop, *next;
     int ret = 0;
 
-    QTAILQ_FOREACH(prop, &obj->properties, node) {
+    QTAILQ_FOREACH_SAFE(prop, &obj->properties, node, next) {
         if (object_property_is_child(prop)) {
             ret = fn(prop->opaque, opaque);
             if (ret != 0) {
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 01/10] qom: Make object_child_foreach safe for objects removal Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-12  1:17   ` David Gibson
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

The existing KVM_CREATE_SPAPR_TCE ioctl only support 4G windows max.
We are going to add huge DMA windows support so this will create small
window and unexpectedly fail later.

This disables KVM_CREATE_SPAPR_TCE for windows bigger that 4GB. Since
those windows are normally mapped at the boot time, there will be no
performance impact.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index f6e32a4..36f5d27 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -113,11 +113,11 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
 static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+    uint64_t window_size = tcet->nb_table << tcet->page_shift;
 
-    if (kvm_enabled()) {
+    if (kvm_enabled() && !(window_size >> 32)) {
         tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
-                                              tcet->nb_table <<
-                                              tcet->page_shift,
+                                              window_size,
                                               &tcet->fd,
                                               tcet->vfio_accel);
     }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 01/10] qom: Make object_child_foreach safe for objects removal Alexey Kardashevskiy
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-11 11:39   ` Alexander Graf
  2014-08-12  1:19   ` David Gibson
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 04/10] spapr_iommu: Make spapr_tce_find_by_liobn() public Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

This makes find_phb()/find_dev() public and changed its names
to spapr_pci_find_phb()/spapr_pci_find_dev() as they are going to
be used from other parts of QEMU such as VFIO DDW (dynamic DMA window)
or VFIO PCI error injection or VFIO EEH handling - in all these
cases there are RTAS calls which are addressed to BUID+config_addr
in IEEE1275 format.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_pci.c          | 22 +++++++++++-----------
 include/hw/pci-host/spapr.h |  4 ++++
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 9ed39a9..230b59c 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -47,7 +47,7 @@
 #define RTAS_TYPE_MSI           1
 #define RTAS_TYPE_MSIX          2
 
-static sPAPRPHBState *find_phb(sPAPREnvironment *spapr, uint64_t buid)
+sPAPRPHBState *spapr_pci_find_phb(sPAPREnvironment *spapr, uint64_t buid)
 {
     sPAPRPHBState *sphb;
 
@@ -61,10 +61,10 @@ static sPAPRPHBState *find_phb(sPAPREnvironment *spapr, uint64_t buid)
     return NULL;
 }
 
-static PCIDevice *find_dev(sPAPREnvironment *spapr, uint64_t buid,
-                           uint32_t config_addr)
+PCIDevice *spapr_pci_find_dev(sPAPREnvironment *spapr, uint64_t buid,
+                              uint32_t config_addr)
 {
-    sPAPRPHBState *sphb = find_phb(spapr, buid);
+    sPAPRPHBState *sphb = spapr_pci_find_phb(spapr, buid);
     PCIHostState *phb = PCI_HOST_BRIDGE(sphb);
     int bus_num = (config_addr >> 16) & 0xFF;
     int devfn = (config_addr >> 8) & 0xFF;
@@ -95,7 +95,7 @@ static void finish_read_pci_config(sPAPREnvironment *spapr, uint64_t buid,
         return;
     }
 
-    pci_dev = find_dev(spapr, buid, addr);
+    pci_dev = spapr_pci_find_dev(spapr, buid, addr);
     addr = rtas_pci_cfgaddr(addr);
 
     if (!pci_dev || (addr % size) || (addr >= pci_config_size(pci_dev))) {
@@ -162,7 +162,7 @@ static void finish_write_pci_config(sPAPREnvironment *spapr, uint64_t buid,
         return;
     }
 
-    pci_dev = find_dev(spapr, buid, addr);
+    pci_dev = spapr_pci_find_dev(spapr, buid, addr);
     addr = rtas_pci_cfgaddr(addr);
 
     if (!pci_dev || (addr % size) || (addr >= pci_config_size(pci_dev))) {
@@ -281,9 +281,9 @@ static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     }
 
     /* Fins sPAPRPHBState */
-    phb = find_phb(spapr, buid);
+    phb = spapr_pci_find_phb(spapr, buid);
     if (phb) {
-        pdev = find_dev(spapr, buid, config_addr);
+        pdev = spapr_pci_find_dev(spapr, buid, config_addr);
     }
     if (!phb || !pdev) {
         rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
@@ -377,9 +377,9 @@ static void rtas_ibm_query_interrupt_source_number(PowerPCCPU *cpu,
     spapr_pci_msi *msi;
 
     /* Find sPAPRPHBState */
-    phb = find_phb(spapr, buid);
+    phb = spapr_pci_find_phb(spapr, buid);
     if (phb) {
-        pdev = find_dev(spapr, buid, config_addr);
+        pdev = spapr_pci_find_dev(spapr, buid, config_addr);
     }
     if (!phb || !pdev) {
         rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
@@ -553,7 +553,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (find_phb(spapr, sphb->buid)) {
+    if (spapr_pci_find_phb(spapr, sphb->buid)) {
         error_setg(errp, "PCI host bridges must have unique BUIDs");
         return;
     }
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 32f0aa7..14c2ab0 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -122,4 +122,8 @@ void spapr_pci_msi_init(sPAPREnvironment *spapr, hwaddr addr);
 
 void spapr_pci_rtas_init(void);
 
+sPAPRPHBState *spapr_pci_find_phb(sPAPREnvironment *spapr, uint64_t buid);
+PCIDevice *spapr_pci_find_dev(sPAPREnvironment *spapr, uint64_t buid,
+                              uint32_t config_addr);
+
 #endif /* __HW_SPAPR_PCI_H__ */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 04/10] spapr_iommu: Make spapr_tce_find_by_liobn() public
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-12  1:19   ` David Gibson
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

At the moment spapr_tce_find_by_liobn() is used by H_PUT_TCE/...
handlers to find an IOMMU by LIOBN.

We are going to implement Dynamic DMA windows (DDW), new code
will go to a new file and we will use spapr_tce_find_by_liobn()
there too so let's make it public.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   | 2 +-
 include/hw/ppc/spapr.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 36f5d27..588d442 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -40,7 +40,7 @@ enum sPAPRTCEAccess {
 
 static QLIST_HEAD(spapr_tce_tables, sPAPRTCETable) spapr_tce_tables;
 
-static sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
+sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
 {
     sPAPRTCETable *tcet;
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 36e8e51..c9d6c6c 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -467,6 +467,7 @@ struct sPAPRTCETable {
     QLIST_ENTRY(sPAPRTCETable) list;
 };
 
+sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn);
 void spapr_events_init(sPAPREnvironment *spapr);
 void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 int spapr_h_cas_compose_response(target_ulong addr, target_ulong size);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 04/10] spapr_iommu: Make spapr_tce_find_by_liobn() public Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-12  1:20   ` David Gibson
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 linux-headers/linux/vfio.h | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 26c218e..f0aa97d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -448,13 +448,48 @@ struct vfio_iommu_type1_dma_unmap {
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_TCE_FLAG_DDW	1 /* Support dynamic windows */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/*
+ * Dynamic DMA windows
+ */
+struct vfio_iommu_spapr_tce_query {
+	__u32 argsz;
+	/* out */
+	__u32 windows_available;
+	__u32 page_size_mask;
+};
+#define VFIO_IOMMU_SPAPR_TCE_QUERY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	/* in */
+	__u32 page_shift;
+	__u32 window_shift;
+	/* out */
+	__u64 start_addr;
+
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+struct vfio_iommu_spapr_tce_reset {
+	__u32 argsz;
+};
+#define VFIO_IOMMU_SPAPR_TCE_RESET	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-11 11:51   ` Alexander Graf
  2014-08-12  1:45   ` David Gibson
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 07/10] spapr: Add "ddw" machine option Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)
which can support page sizes other than 4K.

The existing implementation of DDW in the guest tries to create one huge
DMA window with 64K or 16MB pages and map the entire guest RAM to. If it
succeeds, the guest switches to dma_direct_ops and never calls
TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use
the entire RAM and not waste time on map/unmap.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PHB.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/Makefile.objs        |   3 +
 hw/ppc/spapr_rtas_ddw.c     | 296 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  18 +++
 include/hw/ppc/spapr.h      |   6 +-
 trace-events                |   4 +
 5 files changed, 326 insertions(+), 1 deletion(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index edd44d0..9773294 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
+obj-y += spapr_rtas_ddw.o
+endif
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..943af2c
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,296 @@
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static inline uint32_t spapr_iommu_fixmask(uint32_t cur_mask,
+                                           struct ppc_one_seg_page_size *sps,
+                                           uint32_t query_mask,
+                                           int shift,
+                                           uint32_t add_mask)
+{
+    if ((sps->page_shift == shift) && (query_mask & add_mask)) {
+        cur_mask |= add_mask;
+    }
+    return cur_mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPREnvironment *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    CPUPPCState *env = &cpu->env;
+    sPAPRPHBState *sphb;
+    sPAPRPHBClass *spc;
+    uint64_t buid;
+    uint32_t addr, pgmask = 0;
+    uint32_t windows_available = 0, page_size_mask = 0;
+    long ret, i;
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb) {
+        goto param_error_exit;
+    }
+
+    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
+    if (!spc->ddw_query) {
+        goto hw_error_exit;
+    }
+
+    ret = spc->ddw_query(sphb, &windows_available, &page_size_mask);
+    trace_spapr_iommu_ddw_query(buid, addr, windows_available,
+                                page_size_mask, pgmask, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    /* DBG! */
+    if (!(page_size_mask & DDW_PGSIZE_16M)) {
+        goto hw_error_exit;
+    }
+
+    /* Work out biggest possible page size */
+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
+        int j;
+        struct ppc_one_seg_page_size *sps = &env->sps.sps[i];
+        const struct { int shift; uint32_t mask; } masks[] = {
+            { 12, DDW_PGSIZE_4K },
+            { 16, DDW_PGSIZE_64K },
+            { 24, DDW_PGSIZE_16M },
+            { 25, DDW_PGSIZE_32M },
+            { 26, DDW_PGSIZE_64M },
+            { 27, DDW_PGSIZE_128M },
+            { 28, DDW_PGSIZE_256M },
+            { 34, DDW_PGSIZE_16G },
+        };
+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
+            pgmask = spapr_iommu_fixmask(pgmask, sps, page_size_mask,
+                                         masks[j].shift, masks[j].mask);
+        }
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, windows_available);
+    /* Return maximum number as all RAM was 4K pages */
+    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, pgmask); /* DMA migration mask */
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPREnvironment *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRPHBClass *spc;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+    long ret;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb) {
+        goto param_error_exit;
+    }
+
+    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
+    if (!spc->ddw_create) {
+        goto hw_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = sphb->dma_liobn + 0x10000;
+
+    ret = spc->ddw_create(sphb, page_shift, window_shift, liobn, &tcet);
+    trace_spapr_iommu_ddw_create(buid, addr, 1 << page_shift,
+                                 1 << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d,
+                                 liobn, ret);
+    if (ret || !tcet) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPREnvironment *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRPHBClass *spc;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+    long ret;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb) {
+        goto param_error_exit;
+    }
+
+    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
+    if (!spc->ddw_remove) {
+        goto hw_error_exit;
+    }
+
+    ret = spc->ddw_remove(sphb, tcet);
+    trace_spapr_iommu_ddw_remove(liobn, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    object_unparent(OBJECT(tcet));
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static int ddw_remove_tce_table_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->bus_offset) {
+        object_unparent(child);
+    }
+
+    return 0;
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPREnvironment *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRPHBClass *spc;
+    uint64_t buid;
+    uint32_t addr;
+    long ret;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb) {
+        goto param_error_exit;
+    }
+
+    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
+    if (!spc->ddw_reset) {
+        goto hw_error_exit;
+    }
+
+    ret = spc->ddw_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    object_child_foreach(OBJECT(sphb), ddw_remove_tce_table_cb, NULL);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 14c2ab0..119d326 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -49,6 +49,24 @@ struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
 
     void (*finish_realize)(sPAPRPHBState *sphb, Error **errp);
+
+/* sPAPR spec defined pagesize mask values */
+#define DDW_PGSIZE_4K       0x01
+#define DDW_PGSIZE_64K      0x02
+#define DDW_PGSIZE_16M      0x04
+#define DDW_PGSIZE_32M      0x08
+#define DDW_PGSIZE_64M      0x10
+#define DDW_PGSIZE_128M     0x20
+#define DDW_PGSIZE_256M     0x40
+#define DDW_PGSIZE_16G      0x80
+
+    int (*ddw_query)(sPAPRPHBState *sphb, uint32_t *windows_available,
+                     uint32_t *page_size_mask);
+    int (*ddw_create)(sPAPRPHBState *sphb, uint32_t page_shift,
+                      uint32_t window_shift, uint32_t liobn,
+                      sPAPRTCETable **ptcet);
+    int (*ddw_remove)(sPAPRPHBState *sphb, sPAPRTCETable *tcet);
+    int (*ddw_reset)(sPAPRPHBState *sphb);
 };
 
 typedef struct spapr_pci_msi {
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index c9d6c6c..b4bfdda 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -383,8 +383,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
 #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
 #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x20)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x21)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x22)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x23)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x24)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
diff --git a/trace-events b/trace-events
index 11a17a8..5b54fbd 100644
--- a/trace-events
+++ b/trace-events
@@ -1213,6 +1213,10 @@ spapr_iommu_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN
 spapr_iommu_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, uint32_t wa, uint32_t pgz, uint32_t pgz_fixed, long ret) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, sizes %"PRIx32", fixed %"PRIx32", ret = %ld"
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, unsigned long long pg_size, unsigned long long req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%llx, requested=0x%llx, start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 07/10] spapr: Add "ddw" machine option
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

This adds a new "ddw" option for a machine to control presense
of the Dynamic DMA window (DDW) feature.

This option will be used by pseries to decide whether to put
DDW RTAS tokens to PHB device tree nodes or not.

This is not a PHB property because there is no way to change
the emulated PHB properties at start time. Also there is no point
in enabling DDW only for some PHB's because for emulated PHBs
it would not add any noticible overhead and for VFIO the very first
DDW-capable PHB will pin the entire guest memory and others won't
change it anyhow.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr.c | 15 +++++++++++++++
 vl.c           |  4 ++++
 2 files changed, 19 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 364a1e1..192e398 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -101,6 +101,7 @@ struct sPAPRMachineState {
 
     /*< public >*/
     char *kvm_type;
+    bool ddw_supported;
 };
 
 sPAPREnvironment *spapr;
@@ -1633,10 +1634,24 @@ static void spapr_set_kvm_type(Object *obj, const char *value, Error **errp)
     sm->kvm_type = g_strdup(value);
 }
 
+static bool spapr_machine_get_ddw(Object *obj, Error **errp)
+{
+    sPAPRMachineState *sms = SPAPR_MACHINE(obj);
+    return sms->ddw_supported;
+}
+
+static void spapr_machine_set_ddw(Object *obj, bool value, Error **errp)
+{
+    sPAPRMachineState *sms = SPAPR_MACHINE(obj);
+    sms->ddw_supported = value;
+}
+
 static void spapr_machine_initfn(Object *obj)
 {
     object_property_add_str(obj, "kvm-type",
                             spapr_get_kvm_type, spapr_set_kvm_type, NULL);
+    object_property_add_bool(obj, "ddw", spapr_machine_get_ddw,
+                             spapr_machine_set_ddw, NULL);
 }
 
 static void spapr_machine_class_init(ObjectClass *oc, void *data)
diff --git a/vl.c b/vl.c
index fe451aa..e53eaeb 100644
--- a/vl.c
+++ b/vl.c
@@ -383,6 +383,10 @@ static QemuOptsList qemu_machine_opts = {
             .name = "kvm-type",
             .type = QEMU_OPT_STRING,
             .help = "Specifies the KVM virtualization mode (HV, PR)",
+        }, {
+            .name = "ddw",
+            .type = QEMU_OPT_BOOL,
+            .help = "Enable Dynamic DMA windows support (pseries only)",
         },{
             .name = PC_MACHINE_MAX_RAM_BELOW_4G,
             .type = QEMU_OPT_SIZE,
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 07/10] spapr: Add "ddw" machine option Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-11 11:59   ` Alexander Graf
  2014-08-12  2:10   ` David Gibson
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: " Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

This implements DDW for emulated PHB.

This advertises DDW in device tree.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

The DDW has not been tested as QEMU does not implement any 64bit DMA capable
device and existing linux guests do not use DDW for 32bit DMA.
---
 hw/ppc/spapr_pci.c          | 65 +++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  5 ++++
 2 files changed, 70 insertions(+)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 230b59c..d1f4c86 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -22,6 +22,7 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
+#include "sysemu/sysemu.h"
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msi.h"
@@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
     /* Register default 32bit DMA window */
     memory_region_add_subregion(&sphb->iommu_root, 0,
                                 spapr_tce_get_iommu(tcet));
+
+    sphb->ddw_supported = true;
 }
 
 static int spapr_phb_children_reset(Object *child, void *opaque)
@@ -781,6 +784,42 @@ static const char *spapr_phb_root_bus_path(PCIHostState *host_bridge,
     return sphb->dtbusname;
 }
 
+static int spapr_pci_ddw_query(sPAPRPHBState *sphb,
+                               uint32_t *windows_available,
+                               uint32_t *page_size_mask)
+{
+    *windows_available = 1;
+    *page_size_mask = DDW_PGSIZE_16M;
+
+    return 0;
+}
+
+static int spapr_pci_ddw_create(sPAPRPHBState *sphb, uint32_t page_shift,
+                                uint32_t window_shift, uint32_t liobn,
+                                sPAPRTCETable **ptcet)
+{
+    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, SPAPR_PCI_TCE64_START,
+                                 page_shift, 1 << (window_shift - page_shift),
+                                 true);
+    if (!*ptcet) {
+        return -1;
+    }
+    memory_region_add_subregion(&sphb->iommu_root, (*ptcet)->bus_offset,
+                                spapr_tce_get_iommu(*ptcet));
+
+    return 0;
+}
+
+static int spapr_pci_ddw_remove(sPAPRPHBState *sphb, sPAPRTCETable *tcet)
+{
+    return 0;
+}
+
+static int spapr_pci_ddw_reset(sPAPRPHBState *sphb)
+{
+    return 0;
+}
+
 static void spapr_phb_class_init(ObjectClass *klass, void *data)
 {
     PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
@@ -795,6 +834,10 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
     set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
     dc->cannot_instantiate_with_device_add_yet = false;
     spc->finish_realize = spapr_phb_finish_realize;
+    spc->ddw_query = spapr_pci_ddw_query;
+    spc->ddw_create = spapr_pci_ddw_create;
+    spc->ddw_remove = spapr_pci_ddw_remove;
+    spc->ddw_reset = spapr_pci_ddw_reset;
 }
 
 static const TypeInfo spapr_phb_info = {
@@ -878,6 +921,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        RTAS_IBM_QUERY_PE_DMA_WINDOW,
+        RTAS_IBM_CREATE_PE_DMA_WINDOW,
+        RTAS_IBM_REMOVE_PE_DMA_WINDOW
+    };
+    uint32_t ddw_extensions[] = { 1, RTAS_IBM_RESET_PE_DMA_WINDOW };
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(phb);
+    QemuOpts *machine_opts = qemu_get_machine_opts();
 
     /* Start populating the FDT */
     sprintf(nodename, "pci@%" PRIx64, phb->buid);
@@ -907,6 +958,20 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (qemu_opt_get_bool(machine_opts, "ddw", true) &&
+        phb->ddw_supported &&
+        spc->ddw_query && spc->ddw_create && spc->ddw_remove) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+
+        if (spc->ddw_reset) {
+            /* When enabled, the guest will remove the default 32bit window */
+            _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                             &ddw_extensions, sizeof(ddw_extensions)));
+        }
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 119d326..2046356 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -103,6 +103,8 @@ struct sPAPRPHBState {
     int32_t msi_devs_num;
     spapr_pci_msi_mig *msi_devs;
 
+    bool ddw_supported;
+
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
@@ -125,6 +127,9 @@ struct sPAPRPHBVFIOState {
 
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
 
+/* Default 64bit dynamic window offset */
+#define SPAPR_PCI_TCE64_START        0x8000000000000000ULL
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     return xics_get_qirq(spapr->icp, phb->lsi_table[pin].irq);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-11 12:02   ` Alexander Graf
  2014-08-12  2:14   ` David Gibson
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 10/10] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
  2014-08-05  1:30 ` [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  10 siblings, 2 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

This implements DDW for VFIO. Host kernel support is required for this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_pci_vfio.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index d3bddf2..dc443e2 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -69,6 +69,77 @@ static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
     /* Register default 32bit DMA window */
     memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
                                 spapr_tce_get_iommu(tcet));
+
+    sphb->ddw_supported = !!(info.flags & VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
+}
+
+static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
+                                    uint32_t *windows_available,
+                                    uint32_t *page_size_mask)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_query query = { .argsz = sizeof(query) };
+    int ret;
+
+    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
+                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
+    if (ret) {
+        return ret;
+    }
+
+    *windows_available = query.windows_available;
+    *page_size_mask = query.page_size_mask;
+
+    return ret;
+}
+
+static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t page_shift,
+                                     uint32_t window_shift, uint32_t liobn,
+                                     sPAPRTCETable **ptcet)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_create create = {
+        .argsz = sizeof(create),
+        .page_shift = page_shift,
+        .window_shift = window_shift,
+        .start_addr = 0
+    };
+    int ret;
+
+    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
+                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+    if (ret) {
+        return ret;
+    }
+
+    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, create.start_addr,
+                                 page_shift, 1 << (window_shift - page_shift),
+                                 true);
+    memory_region_add_subregion(&sphb->iommu_root, (*ptcet)->bus_offset,
+                                spapr_tce_get_iommu(*ptcet));
+
+    return ret;
+}
+
+static int spapr_pci_vfio_ddw_remove(sPAPRPHBState *sphb, sPAPRTCETable *tcet)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_remove remove = {
+        .argsz = sizeof(remove),
+        .start_addr = tcet->bus_offset
+    };
+
+    return vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
+                                VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+}
+
+static int spapr_pci_vfio_ddw_reset(sPAPRPHBState *sphb)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_reset reset = { .argsz = sizeof(reset) };
+
+    return vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
+                                VFIO_IOMMU_SPAPR_TCE_RESET, &reset);
 }
 
 static void spapr_phb_vfio_reset(DeviceState *qdev)
@@ -84,6 +155,10 @@ static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
     dc->props = spapr_phb_vfio_properties;
     dc->reset = spapr_phb_vfio_reset;
     spc->finish_realize = spapr_phb_vfio_finish_realize;
+    spc->ddw_query = spapr_pci_vfio_ddw_query;
+    spc->ddw_create = spapr_pci_vfio_ddw_create;
+    spc->ddw_remove = spapr_pci_vfio_ddw_remove;
+    spc->ddw_reset = spapr_pci_vfio_ddw_reset;
 }
 
 static const TypeInfo spapr_phb_vfio_info = {
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Qemu-devel] [RFC PATCH 10/10] vfio: Enable DDW ioctls to VFIO IOMMU driver
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: " Alexey Kardashevskiy
@ 2014-07-31  9:34 ` Alexey Kardashevskiy
  2014-08-05  1:30 ` [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
  10 siblings, 0 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-31  9:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf

This enables DDW RTAS-related ioctls in VFIO.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/misc/vfio.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 0b9eba0..e7b4d6e 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -4437,6 +4437,10 @@ int vfio_container_ioctl(AddressSpace *as, int32_t groupid,
     switch (req) {
     case VFIO_CHECK_EXTENSION:
     case VFIO_IOMMU_SPAPR_TCE_GET_INFO:
+    case VFIO_IOMMU_SPAPR_TCE_QUERY:
+    case VFIO_IOMMU_SPAPR_TCE_CREATE:
+    case VFIO_IOMMU_SPAPR_TCE_REMOVE:
+    case VFIO_IOMMU_SPAPR_TCE_RESET:
         break;
     default:
         /* Return an error on unknown requests */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 10/10] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
@ 2014-08-05  1:30 ` Alexey Kardashevskiy
  2014-08-10 23:50   ` Alexey Kardashevskiy
  10 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-05  1:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, qemu-ppc, Alexander Graf

On 07/31/2014 07:34 PM, Alexey Kardashevskiy wrote:
> At the moment sPAPR PHB supports only a single 32bit window
> which is normally 1..2GB which is not enough for high performance devices.
> 
> PAPR spec enables creating an additional window(s) to support 64bit
> DMA and bigger page sizes.
> 
> This patchset adds DDW support for pseries. The host kernel changes are
> required.
> 
> This was tested on POWER8 system which allows one additional DMA window
> which is mapped at 0x800.0000.0000.0000 and supports 16MB pages.
> Existing guests check for DDW capabilities in PHB's device tree and if it
> is present, they request for an additional window and map entire guest RAM
> using H_PUT_TCE/... hypercalls once at boot time and switch to direct DMA
> operations.
> 
> TCE tables still may be big enough for guests backed with 64K pages but they
> are reasonably small for guests backed by 16MB pages.
> 
> Please comment. Thanks!


Alexander Graf, ping!



> Alexey Kardashevskiy (10):
>   qom: Make object_child_foreach safe for objects removal
>   spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows
>   spapr_pci: Make find_phb()/find_dev() public
>   spapr_iommu: Make spapr_tce_find_by_liobn() public
>   linux headers update for DDW
>   spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
>   spapr: Add "ddw" machine option
>   spapr_pci: Enable DDW
>   spapr_pci_vfio: Enable DDW
>   vfio: Enable DDW ioctls to VFIO IOMMU driver
> 
>  hw/misc/vfio.c              |   4 +
>  hw/ppc/Makefile.objs        |   3 +
>  hw/ppc/spapr.c              |  15 +++
>  hw/ppc/spapr_iommu.c        |   8 +-
>  hw/ppc/spapr_pci.c          |  87 +++++++++++--
>  hw/ppc/spapr_pci_vfio.c     |  75 +++++++++++
>  hw/ppc/spapr_rtas_ddw.c     | 296 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |  27 ++++
>  include/hw/ppc/spapr.h      |   7 +-
>  linux-headers/linux/vfio.h  |  37 +++++-
>  qom/object.c                |   4 +-
>  trace-events                |   4 +
>  vl.c                        |   4 +
>  13 files changed, 552 insertions(+), 19 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW)
  2014-08-05  1:30 ` [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2014-08-10 23:50   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-10 23:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, qemu-ppc, Alexander Graf

On 08/05/2014 11:30 AM, Alexey Kardashevskiy wrote:
> On 07/31/2014 07:34 PM, Alexey Kardashevskiy wrote:
>> At the moment sPAPR PHB supports only a single 32bit window
>> which is normally 1..2GB which is not enough for high performance devices.
>>
>> PAPR spec enables creating an additional window(s) to support 64bit
>> DMA and bigger page sizes.
>>
>> This patchset adds DDW support for pseries. The host kernel changes are
>> required.
>>
>> This was tested on POWER8 system which allows one additional DMA window
>> which is mapped at 0x800.0000.0000.0000 and supports 16MB pages.
>> Existing guests check for DDW capabilities in PHB's device tree and if it
>> is present, they request for an additional window and map entire guest RAM
>> using H_PUT_TCE/... hypercalls once at boot time and switch to direct DMA
>> operations.
>>
>> TCE tables still may be big enough for guests backed with 64K pages but they
>> are reasonably small for guests backed by 16MB pages.
>>
>> Please comment. Thanks!
> 
> 
> Alexander Graf, ping!


Ping?


> 
> 
> 
>> Alexey Kardashevskiy (10):
>>   qom: Make object_child_foreach safe for objects removal
>>   spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows
>>   spapr_pci: Make find_phb()/find_dev() public
>>   spapr_iommu: Make spapr_tce_find_by_liobn() public
>>   linux headers update for DDW
>>   spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
>>   spapr: Add "ddw" machine option
>>   spapr_pci: Enable DDW
>>   spapr_pci_vfio: Enable DDW
>>   vfio: Enable DDW ioctls to VFIO IOMMU driver
>>
>>  hw/misc/vfio.c              |   4 +
>>  hw/ppc/Makefile.objs        |   3 +
>>  hw/ppc/spapr.c              |  15 +++
>>  hw/ppc/spapr_iommu.c        |   8 +-
>>  hw/ppc/spapr_pci.c          |  87 +++++++++++--
>>  hw/ppc/spapr_pci_vfio.c     |  75 +++++++++++
>>  hw/ppc/spapr_rtas_ddw.c     | 296 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/pci-host/spapr.h |  27 ++++
>>  include/hw/ppc/spapr.h      |   7 +-
>>  linux-headers/linux/vfio.h  |  37 +++++-
>>  qom/object.c                |   4 +-
>>  trace-events                |   4 +
>>  vl.c                        |   4 +
>>  13 files changed, 552 insertions(+), 19 deletions(-)
>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public Alexey Kardashevskiy
@ 2014-08-11 11:39   ` Alexander Graf
  2014-08-11 14:56     ` Alexey Kardashevskiy
  2014-08-12  1:19   ` David Gibson
  1 sibling, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-11 11:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 31.07.14 11:34, Alexey Kardashevskiy wrote:
> This makes find_phb()/find_dev() public and changed its names
> to spapr_pci_find_phb()/spapr_pci_find_dev() as they are going to
> be used from other parts of QEMU such as VFIO DDW (dynamic DMA window)
> or VFIO PCI error injection or VFIO EEH handling - in all these
> cases there are RTAS calls which are addressed to BUID+config_addr
> in IEEE1275 format.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Is there any particular reason these RTAS calls can't get handled inside 
of spapr_pci.c? After all, if they work on PCI granularity, they are 
semantically bound to the PCI PHB emulation.


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support Alexey Kardashevskiy
@ 2014-08-11 11:51   ` Alexander Graf
  2014-08-11 15:34     ` Alexey Kardashevskiy
  2014-08-12  1:45   ` David Gibson
  1 sibling, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-11 11:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 31.07.14 11:34, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> which can support page sizes other than 4K.
>
> The existing implementation of DDW in the guest tries to create one huge
> DMA window with 64K or 16MB pages and map the entire guest RAM to. If it
> succeeds, the guest switches to dma_direct_ops and never calls
> TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use
> the entire RAM and not waste time on map/unmap.
>
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
>
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PHB.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>   hw/ppc/Makefile.objs        |   3 +
>   hw/ppc/spapr_rtas_ddw.c     | 296 ++++++++++++++++++++++++++++++++++++++++++++
>   include/hw/pci-host/spapr.h |  18 +++
>   include/hw/ppc/spapr.h      |   6 +-
>   trace-events                |   4 +
>   5 files changed, 326 insertions(+), 1 deletion(-)
>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index edd44d0..9773294 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o
>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>   obj-y += spapr_pci_vfio.o
>   endif
> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
> +obj-y += spapr_rtas_ddw.o
> +endif
>   # PowerPC 4xx boards
>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>   obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..943af2c
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,296 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static inline uint32_t spapr_iommu_fixmask(uint32_t cur_mask,
> +                                           struct ppc_one_seg_page_size *sps,
> +                                           uint32_t query_mask,
> +                                           int shift,
> +                                           uint32_t add_mask)
> +{
> +    if ((sps->page_shift == shift) && (query_mask & add_mask)) {
> +        cur_mask |= add_mask;
> +    }
> +    return cur_mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPREnvironment *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    sPAPRPHBClass *spc;
> +    uint64_t buid;
> +    uint32_t addr, pgmask = 0;
> +    uint32_t windows_available = 0, page_size_mask = 0;
> +    long ret, i;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb) {
> +        goto param_error_exit;
> +    }
> +
> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> +    if (!spc->ddw_query) {
> +        goto hw_error_exit;
> +    }
> +
> +    ret = spc->ddw_query(sphb, &windows_available, &page_size_mask);
> +    trace_spapr_iommu_ddw_query(buid, addr, windows_available,
> +                                page_size_mask, pgmask, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    /* DBG! */
> +    if (!(page_size_mask & DDW_PGSIZE_16M)) {
> +        goto hw_error_exit;
> +    }
> +
> +    /* Work out biggest possible page size */
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        int j;
> +        struct ppc_one_seg_page_size *sps = &env->sps.sps[i];
> +        const struct { int shift; uint32_t mask; } masks[] = {
> +            { 12, DDW_PGSIZE_4K },
> +            { 16, DDW_PGSIZE_64K },
> +            { 24, DDW_PGSIZE_16M },
> +            { 25, DDW_PGSIZE_32M },
> +            { 26, DDW_PGSIZE_64M },
> +            { 27, DDW_PGSIZE_128M },
> +            { 28, DDW_PGSIZE_256M },
> +            { 34, DDW_PGSIZE_16G },
> +        };
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            pgmask = spapr_iommu_fixmask(pgmask, sps, page_size_mask,
> +                                         masks[j].shift, masks[j].mask);
> +        }
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, windows_available);
> +    /* Return maximum number as all RAM was 4K pages */
> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, pgmask); /* DMA migration mask */
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPREnvironment *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRPHBClass *spc;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    long ret;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb) {
> +        goto param_error_exit;
> +    }
> +
> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> +    if (!spc->ddw_create) {
> +        goto hw_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = sphb->dma_liobn + 0x10000;

What offset is this?

> +
> +    ret = spc->ddw_create(sphb, page_shift, window_shift, liobn, &tcet);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1 << page_shift,
> +                                 1 << window_shift,

1ULL? Otherwise 16G pages (and windows) won't work.


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW Alexey Kardashevskiy
@ 2014-08-11 11:59   ` Alexander Graf
  2014-08-11 15:26     ` Alexey Kardashevskiy
  2014-08-12  2:10   ` David Gibson
  1 sibling, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-11 11:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 31.07.14 11:34, Alexey Kardashevskiy wrote:
> This implements DDW for emulated PHB.
>
> This advertises DDW in device tree.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>
> The DDW has not been tested as QEMU does not implement any 64bit DMA capable
> device and existing linux guests do not use DDW for 32bit DMA.

Can't you just add the pci config space bit for it to the e1000 
emulation? That one should be pretty safe, no?

> ---
>   hw/ppc/spapr_pci.c          | 65 +++++++++++++++++++++++++++++++++++++++++++++
>   include/hw/pci-host/spapr.h |  5 ++++
>   2 files changed, 70 insertions(+)
>
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 230b59c..d1f4c86 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -22,6 +22,7 @@
>    * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>    * THE SOFTWARE.
>    */
> +#include "sysemu/sysemu.h"
>   #include "hw/hw.h"
>   #include "hw/pci/pci.h"
>   #include "hw/pci/msi.h"
> @@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
>       /* Register default 32bit DMA window */
>       memory_region_add_subregion(&sphb->iommu_root, 0,
>                                   spapr_tce_get_iommu(tcet));
> +
> +    sphb->ddw_supported = true;

Unconditionally?

Also, can't you make the ddw enable/disable flow go set-only? Basically 
have the flag in the machine struct if you must, but then on every PHB 
instantiation you set a QOM property that sets ddw_supported respectively?

Also keep in mind that we will have to at least disable ddw by default 
for existing machine types to maintain backwards compatibility.

>   }
>   
>   static int spapr_phb_children_reset(Object *child, void *opaque)
> @@ -781,6 +784,42 @@ static const char *spapr_phb_root_bus_path(PCIHostState *host_bridge,
>       return sphb->dtbusname;
>   }
>   
> +static int spapr_pci_ddw_query(sPAPRPHBState *sphb,
> +                               uint32_t *windows_available,
> +                               uint32_t *page_size_mask)
> +{
> +    *windows_available = 1;
> +    *page_size_mask = DDW_PGSIZE_16M;
> +
> +    return 0;
> +}
> +
> +static int spapr_pci_ddw_create(sPAPRPHBState *sphb, uint32_t page_shift,
> +                                uint32_t window_shift, uint32_t liobn,
> +                                sPAPRTCETable **ptcet)
> +{
> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, SPAPR_PCI_TCE64_START,
> +                                 page_shift, 1 << (window_shift - page_shift),
> +                                 true);
> +    if (!*ptcet) {
> +        return -1;
> +    }
> +    memory_region_add_subregion(&sphb->iommu_root, (*ptcet)->bus_offset,
> +                                spapr_tce_get_iommu(*ptcet));
> +
> +    return 0;
> +}
> +
> +static int spapr_pci_ddw_remove(sPAPRPHBState *sphb, sPAPRTCETable *tcet)
> +{
> +    return 0;
> +}
> +
> +static int spapr_pci_ddw_reset(sPAPRPHBState *sphb)
> +{
> +    return 0;
> +}
> +
>   static void spapr_phb_class_init(ObjectClass *klass, void *data)
>   {
>       PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
> @@ -795,6 +834,10 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
>       set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>       dc->cannot_instantiate_with_device_add_yet = false;
>       spc->finish_realize = spapr_phb_finish_realize;
> +    spc->ddw_query = spapr_pci_ddw_query;
> +    spc->ddw_create = spapr_pci_ddw_create;
> +    spc->ddw_remove = spapr_pci_ddw_remove;
> +    spc->ddw_reset = spapr_pci_ddw_reset;
>   }
>   
>   static const TypeInfo spapr_phb_info = {
> @@ -878,6 +921,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>       uint32_t interrupt_map_mask[] = {
>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +        RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +        RTAS_IBM_REMOVE_PE_DMA_WINDOW
> +    };
> +    uint32_t ddw_extensions[] = { 1, RTAS_IBM_RESET_PE_DMA_WINDOW };
> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(phb);
> +    QemuOpts *machine_opts = qemu_get_machine_opts();
>   
>       /* Start populating the FDT */
>       sprintf(nodename, "pci@%" PRIx64, phb->buid);
> @@ -907,6 +958,20 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>   
> +    /* Dynamic DMA window */
> +    if (qemu_opt_get_bool(machine_opts, "ddw", true) &&
> +        phb->ddw_supported &&

Yeah, just rename this to ddw_enabled and expose it via QOM. Make it 
unsettable to true for PHBs that don't support ddw.


Alex

> +        spc->ddw_query && spc->ddw_create && spc->ddw_remove) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +
> +        if (spc->ddw_reset) {
> +            /* When enabled, the guest will remove the default 32bit window */
> +            _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                             &ddw_extensions, sizeof(ddw_extensions)));
> +        }
> +    }
> +
>       /* Build the interrupt-map, this must matches what is done
>        * in pci_spapr_map_irq
>        */
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 119d326..2046356 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -103,6 +103,8 @@ struct sPAPRPHBState {
>       int32_t msi_devs_num;
>       spapr_pci_msi_mig *msi_devs;
>   
> +    bool ddw_supported;
> +
>       QLIST_ENTRY(sPAPRPHBState) list;
>   };
>   
> @@ -125,6 +127,9 @@ struct sPAPRPHBVFIOState {
>   
>   #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>   
> +/* Default 64bit dynamic window offset */
> +#define SPAPR_PCI_TCE64_START        0x8000000000000000ULL
> +
>   static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>   {
>       return xics_get_qirq(spapr->icp, phb->lsi_table[pin].irq);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: " Alexey Kardashevskiy
@ 2014-08-11 12:02   ` Alexander Graf
  2014-08-11 15:01     ` Alexey Kardashevskiy
  2014-08-12  2:14   ` David Gibson
  1 sibling, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-11 12:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 31.07.14 11:34, Alexey Kardashevskiy wrote:
> This implements DDW for VFIO. Host kernel support is required for this.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>   hw/ppc/spapr_pci_vfio.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 75 insertions(+)
>
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> index d3bddf2..dc443e2 100644
> --- a/hw/ppc/spapr_pci_vfio.c
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -69,6 +69,77 @@ static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>       /* Register default 32bit DMA window */
>       memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>                                   spapr_tce_get_iommu(tcet));
> +
> +    sphb->ddw_supported = !!(info.flags & VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
> +}
> +
> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
> +                                    uint32_t *windows_available,
> +                                    uint32_t *page_size_mask)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_query query = { .argsz = sizeof(query) };
> +    int ret;
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    *windows_available = query.windows_available;
> +    *page_size_mask = query.page_size_mask;
> +
> +    return ret;
> +}
> +
> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t page_shift,
> +                                     uint32_t window_shift, uint32_t liobn,
> +                                     sPAPRTCETable **ptcet)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_create create = {
> +        .argsz = sizeof(create),
> +        .page_shift = page_shift,
> +        .window_shift = window_shift,
> +        .start_addr = 0
> +    };
> +    int ret;
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, create.start_addr,
> +                                 page_shift, 1 << (window_shift - page_shift),

I spot a 1 without ULL again - this time it might work out ok, but 
please just always use ULL when you pass around addresses.

Please walk me though the abstraction levels on what each page size 
honoration means. If I use THP, what page size granularity can I use for 
TCE entries?


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public
  2014-08-11 11:39   ` Alexander Graf
@ 2014-08-11 14:56     ` Alexey Kardashevskiy
  2014-08-11 17:16       ` Alexander Graf
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-11 14:56 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/11/2014 09:39 PM, Alexander Graf wrote:
> 
> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>> This makes find_phb()/find_dev() public and changed its names
>> to spapr_pci_find_phb()/spapr_pci_find_dev() as they are going to
>> be used from other parts of QEMU such as VFIO DDW (dynamic DMA window)
>> or VFIO PCI error injection or VFIO EEH handling - in all these
>> cases there are RTAS calls which are addressed to BUID+config_addr
>> in IEEE1275 format.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Is there any particular reason these RTAS calls can't get handled inside of
> spapr_pci.c? After all, if they work on PCI granularity, they are
> semantically bound to the PCI PHB emulation.

Creation - yes, addressed to PHB BUID. Deletion - no, addressed to LIOBN
which has nothing to do with PCI.



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-11 12:02   ` Alexander Graf
@ 2014-08-11 15:01     ` Alexey Kardashevskiy
  2014-08-11 17:30       ` Alexander Graf
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-11 15:01 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/11/2014 10:02 PM, Alexander Graf wrote:
> 
> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>> This implements DDW for VFIO. Host kernel support is required for this.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/ppc/spapr_pci_vfio.c | 75
>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 75 insertions(+)
>>
>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>> index d3bddf2..dc443e2 100644
>> --- a/hw/ppc/spapr_pci_vfio.c
>> +++ b/hw/ppc/spapr_pci_vfio.c
>> @@ -69,6 +69,77 @@ static void
>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>       /* Register default 32bit DMA window */
>>       memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>                                   spapr_tce_get_iommu(tcet));
>> +
>> +    sphb->ddw_supported = !!(info.flags & VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>> +}
>> +
>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>> +                                    uint32_t *windows_available,
>> +                                    uint32_t *page_size_mask)
>> +{
>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>> +    struct vfio_iommu_spapr_tce_query query = { .argsz = sizeof(query) };
>> +    int ret;
>> +
>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    *windows_available = query.windows_available;
>> +    *page_size_mask = query.page_size_mask;
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>> page_shift,
>> +                                     uint32_t window_shift, uint32_t liobn,
>> +                                     sPAPRTCETable **ptcet)
>> +{
>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>> +    struct vfio_iommu_spapr_tce_create create = {
>> +        .argsz = sizeof(create),
>> +        .page_shift = page_shift,
>> +        .window_shift = window_shift,
>> +        .start_addr = 0
>> +    };
>> +    int ret;
>> +
>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, create.start_addr,
>> +                                 page_shift, 1 << (window_shift -
>> page_shift),
> 
> I spot a 1 without ULL again - this time it might work out ok, but please
> just always use ULL when you pass around addresses.

My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)


> 
> Please walk me though the abstraction levels on what each page size
> honoration means. If I use THP, what page size granularity can I use for
> TCE entries?


[RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support

+        const struct { int shift; uint32_t mask; } masks[] = {
+            { 12, DDW_PGSIZE_4K },
+            { 16, DDW_PGSIZE_64K },
+            { 24, DDW_PGSIZE_16M },
+            { 25, DDW_PGSIZE_32M },
+            { 26, DDW_PGSIZE_64M },
+            { 27, DDW_PGSIZE_128M },
+            { 28, DDW_PGSIZE_256M },
+            { 34, DDW_PGSIZE_16G },
+        };


Supported page sizes are returned by the host kernel via "query". For 16MB
pages, page shift will return DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
Or I did not understand the question...



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-08-11 11:59   ` Alexander Graf
@ 2014-08-11 15:26     ` Alexey Kardashevskiy
  2014-08-11 17:29       ` Alexander Graf
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-11 15:26 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/11/2014 09:59 PM, Alexander Graf wrote:
> 
> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>> This implements DDW for emulated PHB.
>>
>> This advertises DDW in device tree.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> The DDW has not been tested as QEMU does not implement any 64bit DMA capable
>> device and existing linux guests do not use DDW for 32bit DMA.
> 
> Can't you just add the pci config space bit for it to the e1000 emulation?

Sorry, I am not following you here. What bit in config space can enable
64bit DMA?

I tried patching the guest driver, that did not work so I did not dig further.

> That one should be pretty safe, no?
> 
>> ---
>>   hw/ppc/spapr_pci.c          | 65
>> +++++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/pci-host/spapr.h |  5 ++++
>>   2 files changed, 70 insertions(+)
>>
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 230b59c..d1f4c86 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -22,6 +22,7 @@
>>    * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>> DEALINGS IN
>>    * THE SOFTWARE.
>>    */
>> +#include "sysemu/sysemu.h"
>>   #include "hw/hw.h"
>>   #include "hw/pci/pci.h"
>>   #include "hw/pci/msi.h"
>> @@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState
>> *sphb, Error **errp)
>>       /* Register default 32bit DMA window */
>>       memory_region_add_subregion(&sphb->iommu_root, 0,
>>                                   spapr_tce_get_iommu(tcet));
>> +
>> +    sphb->ddw_supported = true;
> 
> Unconditionally?


Yes. Why not? I cannot think of any case when we would not want this. In
practice there is very little chance it will ever be used anyway :) There
is still a machine option to disable it completely.


> Also, can't you make the ddw enable/disable flow go set-only? Basically
> have the flag in the machine struct if you must, but then on every PHB
> instantiation you set a QOM property that sets ddw_supported respectively?

Uff. Very confusing review comments today :)

For VFIO, ddw_supported comes from the host kernel and totally depends on
hardware.

For emulated, there is just one emulated PHB (yes, can be many but noone
seems to be using more in reality) and what you suggest seems to be too
complicated.

This DDW thing - it is not really dynamic in the way it is used by the
existing linux guest. At the boot time the guest driver looks at DMA mask
and only if it is >32bit, it creates DDW, once, and after that the windows
remains active while the guest is running.


> Also keep in mind that we will have to at least disable ddw by default for
> existing machine types to maintain backwards compatibility.


Where exactly does the default setting "on" break in compatibility?



>>   }
>>     static int spapr_phb_children_reset(Object *child, void *opaque)
>> @@ -781,6 +784,42 @@ static const char
>> *spapr_phb_root_bus_path(PCIHostState *host_bridge,
>>       return sphb->dtbusname;
>>   }
>>   +static int spapr_pci_ddw_query(sPAPRPHBState *sphb,
>> +                               uint32_t *windows_available,
>> +                               uint32_t *page_size_mask)
>> +{
>> +    *windows_available = 1;
>> +    *page_size_mask = DDW_PGSIZE_16M;
>> +
>> +    return 0;
>> +}
>> +
>> +static int spapr_pci_ddw_create(sPAPRPHBState *sphb, uint32_t page_shift,
>> +                                uint32_t window_shift, uint32_t liobn,
>> +                                sPAPRTCETable **ptcet)
>> +{
>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>> SPAPR_PCI_TCE64_START,
>> +                                 page_shift, 1 << (window_shift -
>> page_shift),
>> +                                 true);
>> +    if (!*ptcet) {
>> +        return -1;
>> +    }
>> +    memory_region_add_subregion(&sphb->iommu_root, (*ptcet)->bus_offset,
>> +                                spapr_tce_get_iommu(*ptcet));
>> +
>> +    return 0;
>> +}
>> +
>> +static int spapr_pci_ddw_remove(sPAPRPHBState *sphb, sPAPRTCETable *tcet)
>> +{
>> +    return 0;
>> +}
>> +
>> +static int spapr_pci_ddw_reset(sPAPRPHBState *sphb)
>> +{
>> +    return 0;
>> +}
>> +
>>   static void spapr_phb_class_init(ObjectClass *klass, void *data)
>>   {
>>       PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
>> @@ -795,6 +834,10 @@ static void spapr_phb_class_init(ObjectClass *klass,
>> void *data)
>>       set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>>       dc->cannot_instantiate_with_device_add_yet = false;
>>       spc->finish_realize = spapr_phb_finish_realize;
>> +    spc->ddw_query = spapr_pci_ddw_query;
>> +    spc->ddw_create = spapr_pci_ddw_create;
>> +    spc->ddw_remove = spapr_pci_ddw_remove;
>> +    spc->ddw_reset = spapr_pci_ddw_reset;
>>   }
>>     static const TypeInfo spapr_phb_info = {
>> @@ -878,6 +921,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       uint32_t interrupt_map_mask[] = {
>>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +        RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +        RTAS_IBM_REMOVE_PE_DMA_WINDOW
>> +    };
>> +    uint32_t ddw_extensions[] = { 1, RTAS_IBM_RESET_PE_DMA_WINDOW };
>> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(phb);
>> +    QemuOpts *machine_opts = qemu_get_machine_opts();
>>         /* Start populating the FDT */
>>       sprintf(nodename, "pci@%" PRIx64, phb->buid);
>> @@ -907,6 +958,20 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type",
>> 0x1));
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>   +    /* Dynamic DMA window */
>> +    if (qemu_opt_get_bool(machine_opts, "ddw", true) &&
>> +        phb->ddw_supported &&
> 
> Yeah, just rename this to ddw_enabled and expose it via QOM. Make it
> unsettable to true for PHBs that don't support ddw.
> 
> 
> Alex
> 
>> +        spc->ddw_query && spc->ddw_create && spc->ddw_remove) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable",
>> &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +
>> +        if (spc->ddw_reset) {
>> +            /* When enabled, the guest will remove the default 32bit
>> window */
>> +            _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                             &ddw_extensions, sizeof(ddw_extensions)));
>> +        }
>> +    }
>> +
>>       /* Build the interrupt-map, this must matches what is done
>>        * in pci_spapr_map_irq
>>        */
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 119d326..2046356 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -103,6 +103,8 @@ struct sPAPRPHBState {
>>       int32_t msi_devs_num;
>>       spapr_pci_msi_mig *msi_devs;
>>   +    bool ddw_supported;
>> +
>>       QLIST_ENTRY(sPAPRPHBState) list;
>>   };
>>   @@ -125,6 +127,9 @@ struct sPAPRPHBVFIOState {
>>     #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>>   +/* Default 64bit dynamic window offset */
>> +#define SPAPR_PCI_TCE64_START        0x8000000000000000ULL
>> +
>>   static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb,
>> int pin)
>>   {
>>       return xics_get_qirq(spapr->icp, phb->lsi_table[pin].irq);
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-11 11:51   ` Alexander Graf
@ 2014-08-11 15:34     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-11 15:34 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/11/2014 09:51 PM, Alexander Graf wrote:
> 
> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>> which can support page sizes other than 4K.
>>
>> The existing implementation of DDW in the guest tries to create one huge
>> DMA window with 64K or 16MB pages and map the entire guest RAM to. If it
>> succeeds, the guest switches to dma_direct_ops and never calls
>> TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use
>> the entire RAM and not waste time on map/unmap.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PHB.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/ppc/Makefile.objs        |   3 +
>>   hw/ppc/spapr_rtas_ddw.c     | 296
>> ++++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/pci-host/spapr.h |  18 +++
>>   include/hw/ppc/spapr.h      |   6 +-
>>   trace-events                |   4 +
>>   5 files changed, 326 insertions(+), 1 deletion(-)
>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index edd44d0..9773294 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o
>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>   obj-y += spapr_pci_vfio.o
>>   endif
>> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
>> +obj-y += spapr_rtas_ddw.o
>> +endif
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..943af2c
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,296 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static inline uint32_t spapr_iommu_fixmask(uint32_t cur_mask,
>> +                                           struct ppc_one_seg_page_size
>> *sps,
>> +                                           uint32_t query_mask,
>> +                                           int shift,
>> +                                           uint32_t add_mask)
>> +{
>> +    if ((sps->page_shift == shift) && (query_mask & add_mask)) {
>> +        cur_mask |= add_mask;
>> +    }
>> +    return cur_mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPREnvironment *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    CPUPPCState *env = &cpu->env;
>> +    sPAPRPHBState *sphb;
>> +    sPAPRPHBClass *spc;
>> +    uint64_t buid;
>> +    uint32_t addr, pgmask = 0;
>> +    uint32_t windows_available = 0, page_size_mask = 0;
>> +    long ret, i;
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>> +    if (!spc->ddw_query) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    ret = spc->ddw_query(sphb, &windows_available, &page_size_mask);
>> +    trace_spapr_iommu_ddw_query(buid, addr, windows_available,
>> +                                page_size_mask, pgmask, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    /* DBG! */
>> +    if (!(page_size_mask & DDW_PGSIZE_16M)) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    /* Work out biggest possible page size */
>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>> +        int j;
>> +        struct ppc_one_seg_page_size *sps = &env->sps.sps[i];
>> +        const struct { int shift; uint32_t mask; } masks[] = {
>> +            { 12, DDW_PGSIZE_4K },
>> +            { 16, DDW_PGSIZE_64K },
>> +            { 24, DDW_PGSIZE_16M },
>> +            { 25, DDW_PGSIZE_32M },
>> +            { 26, DDW_PGSIZE_64M },
>> +            { 27, DDW_PGSIZE_128M },
>> +            { 28, DDW_PGSIZE_256M },
>> +            { 34, DDW_PGSIZE_16G },
>> +        };
>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>> +            pgmask = spapr_iommu_fixmask(pgmask, sps, page_size_mask,
>> +                                         masks[j].shift, masks[j].mask);
>> +        }
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, windows_available);
>> +    /* Return maximum number as all RAM was 4K pages */
>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, pgmask); /* DMA migration mask */
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPREnvironment *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRPHBClass *spc;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +    long ret;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>> +    if (!spc->ddw_create) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = sphb->dma_liobn + 0x10000;
> 
> What offset is this?

Some new LIOBN. Can be +1. May be worth defining as a macro.


> 
>> +
>> +    ret = spc->ddw_create(sphb, page_shift, window_shift, liobn, &tcet);
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1 << page_shift,
>> +                                 1 << window_shift,
> 
> 1ULL? Otherwise 16G pages (and windows) won't work.


Right. Thanks. I'll fix. 16_G_ _pages_ are not supported anyway though.



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public
  2014-08-11 14:56     ` Alexey Kardashevskiy
@ 2014-08-11 17:16       ` Alexander Graf
  0 siblings, 0 replies; 55+ messages in thread
From: Alexander Graf @ 2014-08-11 17:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 11.08.14 16:56, Alexey Kardashevskiy wrote:
> On 08/11/2014 09:39 PM, Alexander Graf wrote:
>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>> This makes find_phb()/find_dev() public and changed its names
>>> to spapr_pci_find_phb()/spapr_pci_find_dev() as they are going to
>>> be used from other parts of QEMU such as VFIO DDW (dynamic DMA window)
>>> or VFIO PCI error injection or VFIO EEH handling - in all these
>>> cases there are RTAS calls which are addressed to BUID+config_addr
>>> in IEEE1275 format.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Is there any particular reason these RTAS calls can't get handled inside of
>> spapr_pci.c? After all, if they work on PCI granularity, they are
>> semantically bound to the PCI PHB emulation.
> Creation - yes, addressed to PHB BUID. Deletion - no, addressed to LIOBN
> which has nothing to do with PCI.

Well, if there's no cleaner cut I guess the way you split it right now 
into a separate file is ok.


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-08-11 15:26     ` Alexey Kardashevskiy
@ 2014-08-11 17:29       ` Alexander Graf
  2014-08-12  0:13         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-11 17:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 11.08.14 17:26, Alexey Kardashevskiy wrote:
> On 08/11/2014 09:59 PM, Alexander Graf wrote:
>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>> This implements DDW for emulated PHB.
>>>
>>> This advertises DDW in device tree.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> The DDW has not been tested as QEMU does not implement any 64bit DMA capable
>>> device and existing linux guests do not use DDW for 32bit DMA.
>> Can't you just add the pci config space bit for it to the e1000 emulation?
> Sorry, I am not following you here. What bit in config space can enable
> 64bit DMA?

Apparently there's nothing at all required. The igb driver simply tries 
to use 64bit DMA masks.

>
> I tried patching the guest driver, that did not work so I did not dig further.

Which driver did you try it with?


>
>> That one should be pretty safe, no?
>>
>>> ---
>>>    hw/ppc/spapr_pci.c          | 65
>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>    include/hw/pci-host/spapr.h |  5 ++++
>>>    2 files changed, 70 insertions(+)
>>>
>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>> index 230b59c..d1f4c86 100644
>>> --- a/hw/ppc/spapr_pci.c
>>> +++ b/hw/ppc/spapr_pci.c
>>> @@ -22,6 +22,7 @@
>>>     * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>>> DEALINGS IN
>>>     * THE SOFTWARE.
>>>     */
>>> +#include "sysemu/sysemu.h"
>>>    #include "hw/hw.h"
>>>    #include "hw/pci/pci.h"
>>>    #include "hw/pci/msi.h"
>>> @@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState
>>> *sphb, Error **errp)
>>>        /* Register default 32bit DMA window */
>>>        memory_region_add_subregion(&sphb->iommu_root, 0,
>>>                                    spapr_tce_get_iommu(tcet));
>>> +
>>> +    sphb->ddw_supported = true;
>> Unconditionally?
>
> Yes. Why not? I cannot think of any case when we would not want this. In
> practice there is very little chance it will ever be used anyway :) There
> is still a machine option to disable it completely.
>
>
>> Also, can't you make the ddw enable/disable flow go set-only? Basically
>> have the flag in the machine struct if you must, but then on every PHB
>> instantiation you set a QOM property that sets ddw_supported respectively?
> Uff. Very confusing review comments today :)
>
> For VFIO, ddw_supported comes from the host kernel and totally depends on
> hardware.
>
> For emulated, there is just one emulated PHB (yes, can be many but noone
> seems to be using more in reality) and what you suggest seems to be too
> complicated.
>
> This DDW thing - it is not really dynamic in the way it is used by the
> existing linux guest. At the boot time the guest driver looks at DMA mask
> and only if it is >32bit, it creates DDW, once, and after that the windows
> remains active while the guest is running.

What I'm asking is that rather than having

   if (machine->ddw_enabled && phb->ddw_supported)

to instead only have

   if (phb->ddw_enabled)

which gets set by the machine to true if machine->ddw_enabled. If you 
make it a qom property you can control the setter, so you can at the 
point when the machine wants to set it also ignore the set to true if 
your vfio implementation doesn't support ddw, leaving ddw_enabled as false.

>
>
>> Also keep in mind that we will have to at least disable ddw by default for
>> existing machine types to maintain backwards compatibility.
>
> Where exactly does the default setting "on" break in compatibility?

Different device tree? Different return values on rtas calls? These are 
guest visible changes, so in theory we would have to make sure we don't 
change any of them.

Of course we can always consciously declare them as unimportant enough 
that they in reality shouldn't have side effects we care about for hot 
and live migration, but there'd have to be a good reasoning on why we 
shouldn't have it disabled rather than why we should have backwards 
compatibility.


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-11 15:01     ` Alexey Kardashevskiy
@ 2014-08-11 17:30       ` Alexander Graf
  2014-08-12  0:03         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-11 17:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 11.08.14 17:01, Alexey Kardashevskiy wrote:
> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>> This implements DDW for VFIO. Host kernel support is required for this.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>    hw/ppc/spapr_pci_vfio.c | 75
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 75 insertions(+)
>>>
>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>> index d3bddf2..dc443e2 100644
>>> --- a/hw/ppc/spapr_pci_vfio.c
>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>> @@ -69,6 +69,77 @@ static void
>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>        /* Register default 32bit DMA window */
>>>        memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>>                                    spapr_tce_get_iommu(tcet));
>>> +
>>> +    sphb->ddw_supported = !!(info.flags & VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>> +}
>>> +
>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>> +                                    uint32_t *windows_available,
>>> +                                    uint32_t *page_size_mask)
>>> +{
>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz = sizeof(query) };
>>> +    int ret;
>>> +
>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    *windows_available = query.windows_available;
>>> +    *page_size_mask = query.page_size_mask;
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>> page_shift,
>>> +                                     uint32_t window_shift, uint32_t liobn,
>>> +                                     sPAPRTCETable **ptcet)
>>> +{
>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>> +    struct vfio_iommu_spapr_tce_create create = {
>>> +        .argsz = sizeof(create),
>>> +        .page_shift = page_shift,
>>> +        .window_shift = window_shift,
>>> +        .start_addr = 0
>>> +    };
>>> +    int ret;
>>> +
>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, create.start_addr,
>>> +                                 page_shift, 1 << (window_shift -
>>> page_shift),
>> I spot a 1 without ULL again - this time it might work out ok, but please
>> just always use ULL when you pass around addresses.
> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>
>
>> Please walk me though the abstraction levels on what each page size
>> honoration means. If I use THP, what page size granularity can I use for
>> TCE entries?
>
> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
>
> +        const struct { int shift; uint32_t mask; } masks[] = {
> +            { 12, DDW_PGSIZE_4K },
> +            { 16, DDW_PGSIZE_64K },
> +            { 24, DDW_PGSIZE_16M },
> +            { 25, DDW_PGSIZE_32M },
> +            { 26, DDW_PGSIZE_64M },
> +            { 27, DDW_PGSIZE_128M },
> +            { 28, DDW_PGSIZE_256M },
> +            { 34, DDW_PGSIZE_16G },
> +        };
>
>
> Supported page sizes are returned by the host kernel via "query". For 16MB
> pages, page shift will return DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
> Or I did not understand the question...

Why do we care about the sizes? Anything bigger than what we support 
should always work, no? What happens if the guest creates a 16MB map but 
my pages are 4kb mapped? Wouldn't the same logic be able to deal with 
16G pages?


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-11 17:30       ` Alexander Graf
@ 2014-08-12  0:03         ` Alexey Kardashevskiy
  2014-08-12  9:37           ` Alexander Graf
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-12  0:03 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/12/2014 03:30 AM, Alexander Graf wrote:
> 
> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>> This implements DDW for VFIO. Host kernel support is required for this.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>    hw/ppc/spapr_pci_vfio.c | 75
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 75 insertions(+)
>>>>
>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>> index d3bddf2..dc443e2 100644
>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>> @@ -69,6 +69,77 @@ static void
>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>        /* Register default 32bit DMA window */
>>>>        memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>>>                                    spapr_tce_get_iommu(tcet));
>>>> +
>>>> +    sphb->ddw_supported = !!(info.flags & VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>> +}
>>>> +
>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>> +                                    uint32_t *windows_available,
>>>> +                                    uint32_t *page_size_mask)
>>>> +{
>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz = sizeof(query) };
>>>> +    int ret;
>>>> +
>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    *windows_available = query.windows_available;
>>>> +    *page_size_mask = query.page_size_mask;
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>> page_shift,
>>>> +                                     uint32_t window_shift, uint32_t
>>>> liobn,
>>>> +                                     sPAPRTCETable **ptcet)
>>>> +{
>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>> +        .argsz = sizeof(create),
>>>> +        .page_shift = page_shift,
>>>> +        .window_shift = window_shift,
>>>> +        .start_addr = 0
>>>> +    };
>>>> +    int ret;
>>>> +
>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, create.start_addr,
>>>> +                                 page_shift, 1 << (window_shift -
>>>> page_shift),
>>> I spot a 1 without ULL again - this time it might work out ok, but please
>>> just always use ULL when you pass around addresses.
>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>
>>
>>> Please walk me though the abstraction levels on what each page size
>>> honoration means. If I use THP, what page size granularity can I use for
>>> TCE entries?
>>
>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>> support
>>
>> +        const struct { int shift; uint32_t mask; } masks[] = {
>> +            { 12, DDW_PGSIZE_4K },
>> +            { 16, DDW_PGSIZE_64K },
>> +            { 24, DDW_PGSIZE_16M },
>> +            { 25, DDW_PGSIZE_32M },
>> +            { 26, DDW_PGSIZE_64M },
>> +            { 27, DDW_PGSIZE_128M },
>> +            { 28, DDW_PGSIZE_256M },
>> +            { 34, DDW_PGSIZE_16G },
>> +        };
>>
>>
>> Supported page sizes are returned by the host kernel via "query". For 16MB
>> pages, page shift will return DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>> Or I did not understand the question...
> 
> Why do we care about the sizes? Anything bigger than what we support should
> always work, no? What happens if the guest creates a 16MB map but my pages
> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?

It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
pages, I have to make sure these 16M are continuous - there will be one TCE
entry for it and no more translations besides IOMMU. What do I miss now?



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-08-11 17:29       ` Alexander Graf
@ 2014-08-12  0:13         ` Alexey Kardashevskiy
  2014-08-12  3:59           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-12  0:13 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/12/2014 03:29 AM, Alexander Graf wrote:
> 
> On 11.08.14 17:26, Alexey Kardashevskiy wrote:
>> On 08/11/2014 09:59 PM, Alexander Graf wrote:
>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>> This implements DDW for emulated PHB.
>>>>
>>>> This advertises DDW in device tree.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>
>>>> The DDW has not been tested as QEMU does not implement any 64bit DMA
>>>> capable
>>>> device and existing linux guests do not use DDW for 32bit DMA.
>>> Can't you just add the pci config space bit for it to the e1000 emulation?
>> Sorry, I am not following you here. What bit in config space can enable
>> 64bit DMA?
> 
> Apparently there's nothing at all required. The igb driver simply tries to
> use 64bit DMA masks.

A driver should use 64bit addresses (unsigned long, u64) for DMA, not 32bit
(unsigned, u32).


> 
>>
>> I tried patching the guest driver, that did not work so I did not dig
>> further.
> 
> Which driver did you try it with?


drivers/net/ethernet/intel/e1000/e1000_main.c

I looked again, the driver uses 64bit DMA if it is PCI-X-capable adapter
which e1000 form QEMU is not.


> 
> 
>>
>>> That one should be pretty safe, no?
>>>
>>>> ---
>>>>    hw/ppc/spapr_pci.c          | 65
>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>    include/hw/pci-host/spapr.h |  5 ++++
>>>>    2 files changed, 70 insertions(+)
>>>>
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 230b59c..d1f4c86 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -22,6 +22,7 @@
>>>>     * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>>>> DEALINGS IN
>>>>     * THE SOFTWARE.
>>>>     */
>>>> +#include "sysemu/sysemu.h"
>>>>    #include "hw/hw.h"
>>>>    #include "hw/pci/pci.h"
>>>>    #include "hw/pci/msi.h"
>>>> @@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState
>>>> *sphb, Error **errp)
>>>>        /* Register default 32bit DMA window */
>>>>        memory_region_add_subregion(&sphb->iommu_root, 0,
>>>>                                    spapr_tce_get_iommu(tcet));
>>>> +
>>>> +    sphb->ddw_supported = true;
>>> Unconditionally?
>>
>> Yes. Why not? I cannot think of any case when we would not want this. In
>> practice there is very little chance it will ever be used anyway :) There
>> is still a machine option to disable it completely.
>>
>>
>>> Also, can't you make the ddw enable/disable flow go set-only? Basically
>>> have the flag in the machine struct if you must, but then on every PHB
>>> instantiation you set a QOM property that sets ddw_supported respectively?
>> Uff. Very confusing review comments today :)
>>
>> For VFIO, ddw_supported comes from the host kernel and totally depends on
>> hardware.
>>
>> For emulated, there is just one emulated PHB (yes, can be many but noone
>> seems to be using more in reality) and what you suggest seems to be too
>> complicated.
>>
>> This DDW thing - it is not really dynamic in the way it is used by the
>> existing linux guest. At the boot time the guest driver looks at DMA mask
>> and only if it is >32bit, it creates DDW, once, and after that the windows
>> remains active while the guest is running.
> 
> What I'm asking is that rather than having
> 
>   if (machine->ddw_enabled && phb->ddw_supported)
> 
> to instead only have
> 
>   if (phb->ddw_enabled)
> 
> which gets set by the machine to true if machine->ddw_enabled. If you make
> it a qom property you can control the setter, so you can at the point when
> the machine wants to set it also ignore the set to true if your vfio
> implementation doesn't support ddw, leaving ddw_enabled as false.
> 
>>
>>
>>> Also keep in mind that we will have to at least disable ddw by default for
>>> existing machine types to maintain backwards compatibility.
>>
>> Where exactly does the default setting "on" break in compatibility?
> 
> Different device tree? Different return values on rtas calls? These are
> guest visible changes, so in theory we would have to make sure we don't
> change any of them.
>
> Of course we can always consciously declare them as unimportant enough that
> they in reality shouldn't have side effects we care about for hot and live
> migration, but there'd have to be a good reasoning on why we shouldn't have
> it disabled rather than why we should have backwards compatibility.

"hot" migration? What is that? :)

There is a machine option to disable it and migrate to older guest (which
we do not support afair or do we?). If we migrate to newer QEMU, these DDW
tokens will be missing in the destination guest's tree and DDW won't be
used, everybody is happy. I really fail to see a scenario when I would not
use DDW...


-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows Alexey Kardashevskiy
@ 2014-08-12  1:17   ` David Gibson
  2014-08-12  7:32     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: David Gibson @ 2014-08-12  1:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 2019 bytes --]

On Thu, Jul 31, 2014 at 07:34:06PM +1000, Alexey Kardashevskiy wrote:
> The existing KVM_CREATE_SPAPR_TCE ioctl only support 4G windows max.
> We are going to add huge DMA windows support so this will create small
> window and unexpectedly fail later.

I'm not entirely clear on what you're saying here.  Are you saying
that the kernel interface silently truncates a window > 4G, rather
than failing?

If so, that's a kernel bug which should be addressed - obviously we'd
still need this as a workaround for older kernels, but it should be
treated as a workaround, not as the real fix.

> This disables KVM_CREATE_SPAPR_TCE for windows bigger that 4GB. Since
> those windows are normally mapped at the boot time, there will be no
> performance impact.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr_iommu.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index f6e32a4..36f5d27 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -113,11 +113,11 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>  static int spapr_tce_table_realize(DeviceState *dev)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
> +    uint64_t window_size = tcet->nb_table << tcet->page_shift;
>  
> -    if (kvm_enabled()) {
> +    if (kvm_enabled() && !(window_size >> 32)) {
>          tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
> -                                              tcet->nb_table <<
> -                                              tcet->page_shift,
> +                                              window_size,
>                                                &tcet->fd,
>                                                tcet->vfio_accel);
>      }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public Alexey Kardashevskiy
  2014-08-11 11:39   ` Alexander Graf
@ 2014-08-12  1:19   ` David Gibson
  1 sibling, 0 replies; 55+ messages in thread
From: David Gibson @ 2014-08-12  1:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 781 bytes --]

On Thu, Jul 31, 2014 at 07:34:07PM +1000, Alexey Kardashevskiy wrote:
> This makes find_phb()/find_dev() public and changed its names
> to spapr_pci_find_phb()/spapr_pci_find_dev() as they are going to
> be used from other parts of QEMU such as VFIO DDW (dynamic DMA window)
> or VFIO PCI error injection or VFIO EEH handling - in all these
> cases there are RTAS calls which are addressed to BUID+config_addr
> in IEEE1275 format.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Seems reasonable enough.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 04/10] spapr_iommu: Make spapr_tce_find_by_liobn() public
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 04/10] spapr_iommu: Make spapr_tce_find_by_liobn() public Alexey Kardashevskiy
@ 2014-08-12  1:19   ` David Gibson
  0 siblings, 0 replies; 55+ messages in thread
From: David Gibson @ 2014-08-12  1:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 666 bytes --]

On Thu, Jul 31, 2014 at 07:34:08PM +1000, Alexey Kardashevskiy wrote:
> At the moment spapr_tce_find_by_liobn() is used by H_PUT_TCE/...
> handlers to find an IOMMU by LIOBN.
> 
> We are going to implement Dynamic DMA windows (DDW), new code
> will go to a new file and we will use spapr_tce_find_by_liobn()
> there too so let's make it public.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW Alexey Kardashevskiy
@ 2014-08-12  1:20   ` David Gibson
  2014-08-12  7:16     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: David Gibson @ 2014-08-12  1:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 446 bytes --]

On Thu, Jul 31, 2014 at 07:34:09PM +1000, Alexey Kardashevskiy wrote:
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

It would be nice for your commit message to state exactly what kernel
version you pulled these updated headers in from.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support Alexey Kardashevskiy
  2014-08-11 11:51   ` Alexander Graf
@ 2014-08-12  1:45   ` David Gibson
  2014-08-12  7:25     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 55+ messages in thread
From: David Gibson @ 2014-08-12  1:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 16785 bytes --]

On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> which can support page sizes other than 4K.
> 
> The existing implementation of DDW in the guest tries to create one huge
> DMA window with 64K or 16MB pages and map the entire guest RAM to. If it
> succeeds, the guest switches to dma_direct_ops and never calls
> TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use
> the entire RAM and not waste time on map/unmap.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PHB.

[snip]
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/Makefile.objs        |   3 +
>  hw/ppc/spapr_rtas_ddw.c     | 296 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |  18 +++
>  include/hw/ppc/spapr.h      |   6 +-
>  trace-events                |   4 +
>  5 files changed, 326 insertions(+), 1 deletion(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index edd44d0..9773294 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
> +obj-y += spapr_rtas_ddw.o
> +endif
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..943af2c
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,296 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static inline uint32_t spapr_iommu_fixmask(uint32_t cur_mask,
> +                                           struct ppc_one_seg_page_size *sps,
> +                                           uint32_t query_mask,
> +                                           int shift,
> +                                           uint32_t add_mask)
> +{
> +    if ((sps->page_shift == shift) && (query_mask & add_mask)) {
> +        cur_mask |= add_mask;
> +    }
> +    return cur_mask;
> +}


> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPREnvironment *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    sPAPRPHBClass *spc;
> +    uint64_t buid;
> +    uint32_t addr, pgmask = 0;
> +    uint32_t windows_available = 0, page_size_mask = 0;
> +    long ret, i;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb) {
> +        goto param_error_exit;
> +    }
> +
> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> +    if (!spc->ddw_query) {
> +        goto hw_error_exit;
> +    }
> +
> +    ret = spc->ddw_query(sphb, &windows_available, &page_size_mask);
> +    trace_spapr_iommu_ddw_query(buid, addr, windows_available,
> +                                page_size_mask, pgmask, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    /* DBG! */
> +    if (!(page_size_mask & DDW_PGSIZE_16M)) {
> +        goto hw_error_exit;
> +    }

Does this still belong here?

> +
> +    /* Work out biggest possible page size */
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        int j;
> +        struct ppc_one_seg_page_size *sps = &env->sps.sps[i];
> +        const struct { int shift; uint32_t mask; } masks[] = {
> +            { 12, DDW_PGSIZE_4K },
> +            { 16, DDW_PGSIZE_64K },
> +            { 24, DDW_PGSIZE_16M },
> +            { 25, DDW_PGSIZE_32M },
> +            { 26, DDW_PGSIZE_64M },
> +            { 27, DDW_PGSIZE_128M },
> +            { 28, DDW_PGSIZE_256M },
> +            { 34, DDW_PGSIZE_16G },
> +        };
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            pgmask = spapr_iommu_fixmask(pgmask, sps, page_size_mask,
> +                                         masks[j].shift, masks[j].mask);
> +        }
> +    }

The function of this is kind of unclear.  I'm assuming this is
filtering the supported page sizes reported by the PHB by the possible
page sizes based on host page size or other constraints.  Is that
right?

I think you'd be better off folding the whole double loop into the
fixmask function.

> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, windows_available);
> +    /* Return maximum number as all RAM was 4K pages */
> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);

I'm assuming this is the allowed size of the dynamic windows.
Shouldn't that be reported by a PHB callback, rather than hardcoded
here?

> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, pgmask); /* DMA migration mask */
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPREnvironment *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRPHBClass *spc;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    long ret;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb) {
> +        goto param_error_exit;
> +    }
> +
> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> +    if (!spc->ddw_create) {
> +        goto hw_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = sphb->dma_liobn + 0x10000;

Isn't using a fixed LIOBN here assuming you can only have a single DDW
per PHB?  That's true for now, but in theory shouldn't it be reported
by the PHB code itself?

> +    ret = spc->ddw_create(sphb, page_shift, window_shift, liobn, &tcet);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1 << page_shift,
> +                                 1 << window_shift,

For lage enough windows this will need to be 1ULL, regardless of the
page shift.

> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
> +                                 liobn, ret);
> +    if (ret || !tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPREnvironment *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRPHBClass *spc;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +    long ret;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb) {
> +        goto param_error_exit;
> +    }
> +
> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> +    if (!spc->ddw_remove) {
> +        goto hw_error_exit;
> +    }
> +
> +    ret = spc->ddw_remove(sphb, tcet);
> +    trace_spapr_iommu_ddw_remove(liobn, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    object_unparent(OBJECT(tcet));
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static int ddw_remove_tce_table_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->bus_offset) {
> +        object_unparent(child);
> +    }
> +
> +    return 0;
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPREnvironment *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRPHBClass *spc;
> +    uint64_t buid;
> +    uint32_t addr;
> +    long ret;
> +
> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb) {
> +        goto param_error_exit;
> +    }
> +
> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> +    if (!spc->ddw_reset) {
> +        goto hw_error_exit;
> +    }
> +
> +    ret = spc->ddw_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    object_child_foreach(OBJECT(sphb), ddw_remove_tce_table_cb, NULL);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 14c2ab0..119d326 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -49,6 +49,24 @@ struct sPAPRPHBClass {
>      PCIHostBridgeClass parent_class;
>  
>      void (*finish_realize)(sPAPRPHBState *sphb, Error **errp);
> +
> +/* sPAPR spec defined pagesize mask values */
> +#define DDW_PGSIZE_4K       0x01
> +#define DDW_PGSIZE_64K      0x02
> +#define DDW_PGSIZE_16M      0x04
> +#define DDW_PGSIZE_32M      0x08
> +#define DDW_PGSIZE_64M      0x10
> +#define DDW_PGSIZE_128M     0x20
> +#define DDW_PGSIZE_256M     0x40
> +#define DDW_PGSIZE_16G      0x80
> +
> +    int (*ddw_query)(sPAPRPHBState *sphb, uint32_t *windows_available,
> +                     uint32_t *page_size_mask);
> +    int (*ddw_create)(sPAPRPHBState *sphb, uint32_t page_shift,
> +                      uint32_t window_shift, uint32_t liobn,
> +                      sPAPRTCETable **ptcet);
> +    int (*ddw_remove)(sPAPRPHBState *sphb, sPAPRTCETable *tcet);
> +    int (*ddw_reset)(sPAPRPHBState *sphb);
>  };
>  
>  typedef struct spapr_pci_msi {
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index c9d6c6c..b4bfdda 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -383,8 +383,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_GET_SENSOR_STATE                   (RTAS_TOKEN_BASE + 0x1D)
>  #define RTAS_IBM_CONFIGURE_CONNECTOR            (RTAS_TOKEN_BASE + 0x1E)
>  #define RTAS_IBM_OS_TERM                        (RTAS_TOKEN_BASE + 0x1F)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x20)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x21)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x22)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x23)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x20)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x24)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> diff --git a/trace-events b/trace-events
> index 11a17a8..5b54fbd 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1213,6 +1213,10 @@ spapr_iommu_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t iobaN
>  spapr_iommu_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_new_table(uint64_t liobn, void *tcet, void *table, int fd) "liobn=%"PRIx64" tcet=%p table=%p fd=%d"
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, uint32_t wa, uint32_t pgz, uint32_t pgz_fixed, long ret) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, sizes %"PRIx32", fixed %"PRIx32", ret = %ld"
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, unsigned long long pg_size, unsigned long long req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%llx, requested=0x%llx, start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW Alexey Kardashevskiy
  2014-08-11 11:59   ` Alexander Graf
@ 2014-08-12  2:10   ` David Gibson
  1 sibling, 0 replies; 55+ messages in thread
From: David Gibson @ 2014-08-12  2:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 6588 bytes --]

On Thu, Jul 31, 2014 at 07:34:12PM +1000, Alexey Kardashevskiy wrote:
> This implements DDW for emulated PHB.
> 
> This advertises DDW in device tree.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> The DDW has not been tested as QEMU does not implement any 64bit DMA capable
> device and existing linux guests do not use DDW for 32bit DMA.
> ---
>  hw/ppc/spapr_pci.c          | 65 +++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |  5 ++++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 230b59c..d1f4c86 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -22,6 +22,7 @@
>   * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>   * THE SOFTWARE.
>   */
> +#include "sysemu/sysemu.h"
>  #include "hw/hw.h"
>  #include "hw/pci/pci.h"
>  #include "hw/pci/msi.h"
> @@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
>      /* Register default 32bit DMA window */
>      memory_region_add_subregion(&sphb->iommu_root, 0,
>                                  spapr_tce_get_iommu(tcet));
> +
> +    sphb->ddw_supported = true;
>  }
>  
>  static int spapr_phb_children_reset(Object *child, void *opaque)
> @@ -781,6 +784,42 @@ static const char *spapr_phb_root_bus_path(PCIHostState *host_bridge,
>      return sphb->dtbusname;
>  }
>  
> +static int spapr_pci_ddw_query(sPAPRPHBState *sphb,
> +                               uint32_t *windows_available,
> +                               uint32_t *page_size_mask)
> +{
> +    *windows_available = 1;
> +    *page_size_mask = DDW_PGSIZE_16M;
> +
> +    return 0;
> +}
> +
> +static int spapr_pci_ddw_create(sPAPRPHBState *sphb, uint32_t page_shift,
> +                                uint32_t window_shift, uint32_t liobn,
> +                                sPAPRTCETable **ptcet)
> +{
> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, SPAPR_PCI_TCE64_START,
> +                                 page_shift, 1 << (window_shift - page_shift),
> +                                 true);
> +    if (!*ptcet) {
> +        return -1;
> +    }
> +    memory_region_add_subregion(&sphb->iommu_root, (*ptcet)->bus_offset,
> +                                spapr_tce_get_iommu(*ptcet));
> +
> +    return 0;
> +}
> +
> +static int spapr_pci_ddw_remove(sPAPRPHBState *sphb, sPAPRTCETable *tcet)
> +{
> +    return 0;
> +}
> +
> +static int spapr_pci_ddw_reset(sPAPRPHBState *sphb)
> +{
> +    return 0;
> +}
> +
>  static void spapr_phb_class_init(ObjectClass *klass, void *data)
>  {
>      PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
> @@ -795,6 +834,10 @@ static void spapr_phb_class_init(ObjectClass *klass, void *data)
>      set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>      dc->cannot_instantiate_with_device_add_yet = false;
>      spc->finish_realize = spapr_phb_finish_realize;
> +    spc->ddw_query = spapr_pci_ddw_query;
> +    spc->ddw_create = spapr_pci_ddw_create;
> +    spc->ddw_remove = spapr_pci_ddw_remove;
> +    spc->ddw_reset = spapr_pci_ddw_reset;
>  }
>  
>  static const TypeInfo spapr_phb_info = {
> @@ -878,6 +921,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +        RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +        RTAS_IBM_REMOVE_PE_DMA_WINDOW
> +    };
> +    uint32_t ddw_extensions[] = { 1, RTAS_IBM_RESET_PE_DMA_WINDOW };
> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(phb);
> +    QemuOpts *machine_opts = qemu_get_machine_opts();
>  
>      /* Start populating the FDT */
>      sprintf(nodename, "pci@%" PRIx64, phb->buid);
> @@ -907,6 +958,20 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (qemu_opt_get_bool(machine_opts, "ddw", true) &&

So, I think this is a rephrasing of agraf's objection.  This feels
like the wrong place to be checking the machine option.

I think the machine option should actually disable the DDW support on
the PHBs (i.e. cause the DDW RTAS calls to fail if attempted), rather
than just turn off advertisement in the device tree.

In fact it should probably stop the RTAS calls being registered at
all, at the moment the RTAS tokens will still be advertised in the
device tree, even if ddw is "disabled"


> +        phb->ddw_supported &&
> +        spc->ddw_query && spc->ddw_create && spc->ddw_remove) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +
> +        if (spc->ddw_reset) {
> +            /* When enabled, the guest will remove the default 32bit window */
> +            _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                             &ddw_extensions, sizeof(ddw_extensions)));
> +        }
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 119d326..2046356 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -103,6 +103,8 @@ struct sPAPRPHBState {
>      int32_t msi_devs_num;
>      spapr_pci_msi_mig *msi_devs;
>  
> +    bool ddw_supported;
> +
>      QLIST_ENTRY(sPAPRPHBState) list;
>  };
>  
> @@ -125,6 +127,9 @@ struct sPAPRPHBVFIOState {
>  
>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>  
> +/* Default 64bit dynamic window offset */
> +#define SPAPR_PCI_TCE64_START        0x8000000000000000ULL
> +
>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>  {
>      return xics_get_qirq(spapr->icp, phb->lsi_table[pin].irq);

Also, I think there might be a leak on system_reset here.  I can't see
anything that will clean up any secondary DDW TCE tables when the
whole system is reset.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: " Alexey Kardashevskiy
  2014-08-11 12:02   ` Alexander Graf
@ 2014-08-12  2:14   ` David Gibson
  1 sibling, 0 replies; 55+ messages in thread
From: David Gibson @ 2014-08-12  2:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 4434 bytes --]

On Thu, Jul 31, 2014 at 07:34:13PM +1000, Alexey Kardashevskiy wrote:
> This implements DDW for VFIO. Host kernel support is required for this.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr_pci_vfio.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 75 insertions(+)
> 
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> index d3bddf2..dc443e2 100644
> --- a/hw/ppc/spapr_pci_vfio.c
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -69,6 +69,77 @@ static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>      /* Register default 32bit DMA window */
>      memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>                                  spapr_tce_get_iommu(tcet));
> +
> +    sphb->ddw_supported = !!(info.flags & VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
> +}
> +
> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
> +                                    uint32_t *windows_available,
> +                                    uint32_t *page_size_mask)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_query query = { .argsz = sizeof(query) };
> +    int ret;
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    *windows_available = query.windows_available;
> +    *page_size_mask = query.page_size_mask;
> +
> +    return ret;
> +}
> +
> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t page_shift,
> +                                     uint32_t window_shift, uint32_t liobn,
> +                                     sPAPRTCETable **ptcet)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_create create = {
> +        .argsz = sizeof(create),
> +        .page_shift = page_shift,
> +        .window_shift = window_shift,
> +        .start_addr = 0
> +    };
> +    int ret;
> +
> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, create.start_addr,
> +                                 page_shift, 1 << (window_shift - page_shift),
> +                                 true);
> +    memory_region_add_subregion(&sphb->iommu_root, (*ptcet)->bus_offset,
> +                                spapr_tce_get_iommu(*ptcet));
> +
> +    return ret;
> +}
> +
> +static int spapr_pci_vfio_ddw_remove(sPAPRPHBState *sphb, sPAPRTCETable *tcet)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_remove remove = {
> +        .argsz = sizeof(remove),
> +        .start_addr = tcet->bus_offset
> +    };
> +
> +    return vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> +                                VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +}
> +
> +static int spapr_pci_vfio_ddw_reset(sPAPRPHBState *sphb)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_reset reset = { .argsz = sizeof(reset) };
> +
> +    return vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> +                                VFIO_IOMMU_SPAPR_TCE_RESET, &reset);
>  }
>  
>  static void spapr_phb_vfio_reset(DeviceState *qdev)
> @@ -84,6 +155,10 @@ static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
>      dc->props = spapr_phb_vfio_properties;
>      dc->reset = spapr_phb_vfio_reset;
>      spc->finish_realize = spapr_phb_vfio_finish_realize;
> +    spc->ddw_query = spapr_pci_vfio_ddw_query;
> +    spc->ddw_create = spapr_pci_vfio_ddw_create;
> +    spc->ddw_remove = spapr_pci_vfio_ddw_remove;
> +    spc->ddw_reset = spapr_pci_vfio_ddw_reset;
>  }
>  
>  static const TypeInfo spapr_phb_vfio_info = {

As with the emulated version, I don't see anything which will reset
secondary TCE tables on a system reset.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-08-12  0:13         ` Alexey Kardashevskiy
@ 2014-08-12  3:59           ` Alexey Kardashevskiy
  2014-08-12  9:36             ` Alexander Graf
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-12  3:59 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/12/2014 10:13 AM, Alexey Kardashevskiy wrote:
> On 08/12/2014 03:29 AM, Alexander Graf wrote:
>>
>> On 11.08.14 17:26, Alexey Kardashevskiy wrote:
>>> On 08/11/2014 09:59 PM, Alexander Graf wrote:
>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>> This implements DDW for emulated PHB.
>>>>>
>>>>> This advertises DDW in device tree.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>>
>>>>> The DDW has not been tested as QEMU does not implement any 64bit DMA
>>>>> capable
>>>>> device and existing linux guests do not use DDW for 32bit DMA.
>>>> Can't you just add the pci config space bit for it to the e1000 emulation?
>>> Sorry, I am not following you here. What bit in config space can enable
>>> 64bit DMA?
>>
>> Apparently there's nothing at all required. The igb driver simply tries to
>> use 64bit DMA masks.
> 
> A driver should use 64bit addresses (unsigned long, u64) for DMA, not 32bit
> (unsigned, u32).
> 
> 
>>
>>>
>>> I tried patching the guest driver, that did not work so I did not dig
>>> further.
>>
>> Which driver did you try it with?
> 
> 
> drivers/net/ethernet/intel/e1000/e1000_main.c
> 
> I looked again, the driver uses 64bit DMA if it is PCI-X-capable adapter
> which e1000 form QEMU is not.
> 
> 
>>
>>
>>>
>>>> That one should be pretty safe, no?
>>>>
>>>>> ---
>>>>>    hw/ppc/spapr_pci.c          | 65
>>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>>    include/hw/pci-host/spapr.h |  5 ++++
>>>>>    2 files changed, 70 insertions(+)
>>>>>
>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>> index 230b59c..d1f4c86 100644
>>>>> --- a/hw/ppc/spapr_pci.c
>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>> @@ -22,6 +22,7 @@
>>>>>     * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>>>>> DEALINGS IN
>>>>>     * THE SOFTWARE.
>>>>>     */
>>>>> +#include "sysemu/sysemu.h"
>>>>>    #include "hw/hw.h"
>>>>>    #include "hw/pci/pci.h"
>>>>>    #include "hw/pci/msi.h"
>>>>> @@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState
>>>>> *sphb, Error **errp)
>>>>>        /* Register default 32bit DMA window */
>>>>>        memory_region_add_subregion(&sphb->iommu_root, 0,
>>>>>                                    spapr_tce_get_iommu(tcet));
>>>>> +
>>>>> +    sphb->ddw_supported = true;
>>>> Unconditionally?
>>>
>>> Yes. Why not? I cannot think of any case when we would not want this. In
>>> practice there is very little chance it will ever be used anyway :) There
>>> is still a machine option to disable it completely.
>>>
>>>
>>>> Also, can't you make the ddw enable/disable flow go set-only? Basically
>>>> have the flag in the machine struct if you must, but then on every PHB
>>>> instantiation you set a QOM property that sets ddw_supported respectively?
>>> Uff. Very confusing review comments today :)
>>>
>>> For VFIO, ddw_supported comes from the host kernel and totally depends on
>>> hardware.
>>>
>>> For emulated, there is just one emulated PHB (yes, can be many but noone
>>> seems to be using more in reality) and what you suggest seems to be too
>>> complicated.
>>>
>>> This DDW thing - it is not really dynamic in the way it is used by the
>>> existing linux guest. At the boot time the guest driver looks at DMA mask
>>> and only if it is >32bit, it creates DDW, once, and after that the windows
>>> remains active while the guest is running.
>>
>> What I'm asking is that rather than having
>>
>>   if (machine->ddw_enabled && phb->ddw_supported)
>>
>> to instead only have
>>
>>   if (phb->ddw_enabled)
>>
>> which gets set by the machine to true if machine->ddw_enabled. If you make
>> it a qom property you can control the setter, so you can at the point when
>> the machine wants to set it also ignore the set to true if your vfio
>> implementation doesn't support ddw, leaving ddw_enabled as false.
>>
>>>
>>>
>>>> Also keep in mind that we will have to at least disable ddw by default for
>>>> existing machine types to maintain backwards compatibility.
>>>
>>> Where exactly does the default setting "on" break in compatibility?
>>
>> Different device tree? Different return values on rtas calls? These are
>> guest visible changes, so in theory we would have to make sure we don't
>> change any of them.
>>
>> Of course we can always consciously declare them as unimportant enough that
>> they in reality shouldn't have side effects we care about for hot and live
>> migration, but there'd have to be a good reasoning on why we shouldn't have
>> it disabled rather than why we should have backwards compatibility.
> 
> "hot" migration? What is that? :)
> 
> There is a machine option to disable it and migrate to older guest (which
> we do not support afair or do we?). If we migrate to newer QEMU, these DDW
> tokens will be missing in the destination guest's tree and DDW won't be
> used, everybody is happy. I really fail to see a scenario when I would not
> use DDW...

Ok, Paul explained. So by default "ddw" must be off for the pseries-2.0
machine and on for pseries-2.2 and we'll be fine, right?



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW
  2014-08-12  1:20   ` David Gibson
@ 2014-08-12  7:16     ` Alexey Kardashevskiy
  2014-08-13  3:23       ` David Gibson
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-12  7:16 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 08/12/2014 11:20 AM, David Gibson wrote:
> On Thu, Jul 31, 2014 at 07:34:09PM +1000, Alexey Kardashevskiy wrote:
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> It would be nice for your commit message to state exactly what kernel
> version you pulled these updated headers in from.

There is no version, this is a "RFC" patchset and kernel changes did not
reach upstream yet :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-12  1:45   ` David Gibson
@ 2014-08-12  7:25     ` Alexey Kardashevskiy
  2014-08-13  3:27       ` David Gibson
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-12  7:25 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 08/12/2014 11:45 AM, David Gibson wrote:
> On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>> which can support page sizes other than 4K.
>>
>> The existing implementation of DDW in the guest tries to create one huge
>> DMA window with 64K or 16MB pages and map the entire guest RAM to. If it
>> succeeds, the guest switches to dma_direct_ops and never calls
>> TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use
>> the entire RAM and not waste time on map/unmap.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PHB.
> 
> [snip]
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/ppc/Makefile.objs        |   3 +
>>  hw/ppc/spapr_rtas_ddw.c     | 296 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/pci-host/spapr.h |  18 +++
>>  include/hw/ppc/spapr.h      |   6 +-
>>  trace-events                |   4 +
>>  5 files changed, 326 insertions(+), 1 deletion(-)
>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index edd44d0..9773294 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o
>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>  obj-y += spapr_pci_vfio.o
>>  endif
>> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
>> +obj-y += spapr_rtas_ddw.o
>> +endif
>>  # PowerPC 4xx boards
>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>  obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..943af2c
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,296 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2014 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static inline uint32_t spapr_iommu_fixmask(uint32_t cur_mask,
>> +                                           struct ppc_one_seg_page_size *sps,
>> +                                           uint32_t query_mask,
>> +                                           int shift,
>> +                                           uint32_t add_mask)
>> +{
>> +    if ((sps->page_shift == shift) && (query_mask & add_mask)) {
>> +        cur_mask |= add_mask;
>> +    }
>> +    return cur_mask;
>> +}
> 
> 
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPREnvironment *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    CPUPPCState *env = &cpu->env;
>> +    sPAPRPHBState *sphb;
>> +    sPAPRPHBClass *spc;
>> +    uint64_t buid;
>> +    uint32_t addr, pgmask = 0;
>> +    uint32_t windows_available = 0, page_size_mask = 0;
>> +    long ret, i;
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>> +    if (!spc->ddw_query) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    ret = spc->ddw_query(sphb, &windows_available, &page_size_mask);
>> +    trace_spapr_iommu_ddw_query(buid, addr, windows_available,
>> +                                page_size_mask, pgmask, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    /* DBG! */
>> +    if (!(page_size_mask & DDW_PGSIZE_16M)) {
>> +        goto hw_error_exit;
>> +    }
> 
> Does this still belong here?
> 
>> +
>> +    /* Work out biggest possible page size */
>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>> +        int j;
>> +        struct ppc_one_seg_page_size *sps = &env->sps.sps[i];
>> +        const struct { int shift; uint32_t mask; } masks[] = {
>> +            { 12, DDW_PGSIZE_4K },
>> +            { 16, DDW_PGSIZE_64K },
>> +            { 24, DDW_PGSIZE_16M },
>> +            { 25, DDW_PGSIZE_32M },
>> +            { 26, DDW_PGSIZE_64M },
>> +            { 27, DDW_PGSIZE_128M },
>> +            { 28, DDW_PGSIZE_256M },
>> +            { 34, DDW_PGSIZE_16G },
>> +        };
>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>> +            pgmask = spapr_iommu_fixmask(pgmask, sps, page_size_mask,
>> +                                         masks[j].shift, masks[j].mask);
>> +        }
>> +    }
> 
> The function of this is kind of unclear.  I'm assuming this is
> filtering the supported page sizes reported by the PHB by the possible
> page sizes based on host page size or other constraints.  Is that
> right?
> 
> I think you'd be better off folding the whole double loop into the
> fixmask function.
> 
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, windows_available);
>> +    /* Return maximum number as all RAM was 4K pages */
>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
> 
> I'm assuming this is the allowed size of the dynamic windows.
> Shouldn't that be reported by a PHB callback, rather than hardcoded
> here?

Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
only when we have memory hotplug (which we do not have) and the guest can
create smaller windows if it wants so I do not really follow you here.


> 
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, pgmask); /* DMA migration mask */
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPREnvironment *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRPHBClass *spc;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +    long ret;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>> +    if (!spc->ddw_create) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = sphb->dma_liobn + 0x10000;
> 
> Isn't using a fixed LIOBN here assuming you can only have a single DDW
> per PHB?  That's true for now, but in theory shouldn't it be reported
> by the PHB code itself?


This should be a unique LIOBN so it is not up to PHB to choose. And we
cannot make it completely random for migration purposes. I'll make it
something like

#define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)





-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows
  2014-08-12  1:17   ` David Gibson
@ 2014-08-12  7:32     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-12  7:32 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 08/12/2014 11:17 AM, David Gibson wrote:
> On Thu, Jul 31, 2014 at 07:34:06PM +1000, Alexey Kardashevskiy wrote:
>> The existing KVM_CREATE_SPAPR_TCE ioctl only support 4G windows max.
>> We are going to add huge DMA windows support so this will create small
>> window and unexpectedly fail later.
> 
> I'm not entirely clear on what you're saying here.  Are you saying
> that the kernel interface silently truncates a window > 4G, rather
> than failing?
> 
> If so, that's a kernel bug which should be addressed - obviously we'd
> still need this as a workaround for older kernels, but it should be
> treated as a workaround, not as the real fix.


This is an RFC patchset and I have a KVM_CREATE_SPAPR_TCE_64 patch for
kernel but since we are still thinking whether to allocate these tables in
the userspace or not, I have not posted it.


>> This disables KVM_CREATE_SPAPR_TCE for windows bigger that 4GB. Since
>> those windows are normally mapped at the boot time, there will be no
>> performance impact.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/ppc/spapr_iommu.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index f6e32a4..36f5d27 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -113,11 +113,11 @@ static MemoryRegionIOMMUOps spapr_iommu_ops = {
>>  static int spapr_tce_table_realize(DeviceState *dev)
>>  {
>>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>> +    uint64_t window_size = tcet->nb_table << tcet->page_shift;
>>  
>> -    if (kvm_enabled()) {
>> +    if (kvm_enabled() && !(window_size >> 32)) {
>>          tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
>> -                                              tcet->nb_table <<
>> -                                              tcet->page_shift,
>> +                                              window_size,
>>                                                &tcet->fd,
>>                                                tcet->vfio_accel);
>>      }
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW
  2014-08-12  3:59           ` Alexey Kardashevskiy
@ 2014-08-12  9:36             ` Alexander Graf
  0 siblings, 0 replies; 55+ messages in thread
From: Alexander Graf @ 2014-08-12  9:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 12.08.14 05:59, Alexey Kardashevskiy wrote:
> On 08/12/2014 10:13 AM, Alexey Kardashevskiy wrote:
>> On 08/12/2014 03:29 AM, Alexander Graf wrote:
>>> On 11.08.14 17:26, Alexey Kardashevskiy wrote:
>>>> On 08/11/2014 09:59 PM, Alexander Graf wrote:
>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>> This implements DDW for emulated PHB.
>>>>>>
>>>>>> This advertises DDW in device tree.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>
>>>>>> The DDW has not been tested as QEMU does not implement any 64bit DMA
>>>>>> capable
>>>>>> device and existing linux guests do not use DDW for 32bit DMA.
>>>>> Can't you just add the pci config space bit for it to the e1000 emulation?
>>>> Sorry, I am not following you here. What bit in config space can enable
>>>> 64bit DMA?
>>> Apparently there's nothing at all required. The igb driver simply tries to
>>> use 64bit DMA masks.
>> A driver should use 64bit addresses (unsigned long, u64) for DMA, not 32bit
>> (unsigned, u32).
>>
>>
>>>> I tried patching the guest driver, that did not work so I did not dig
>>>> further.
>>> Which driver did you try it with?
>>
>> drivers/net/ethernet/intel/e1000/e1000_main.c
>>
>> I looked again, the driver uses 64bit DMA if it is PCI-X-capable adapter
>> which e1000 form QEMU is not.

Does it decide this only based on the pci id? If so, fake it.

>>
>>
>>>
>>>>> That one should be pretty safe, no?
>>>>>
>>>>>> ---
>>>>>>     hw/ppc/spapr_pci.c          | 65
>>>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>     include/hw/pci-host/spapr.h |  5 ++++
>>>>>>     2 files changed, 70 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>>> index 230b59c..d1f4c86 100644
>>>>>> --- a/hw/ppc/spapr_pci.c
>>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>>> @@ -22,6 +22,7 @@
>>>>>>      * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>>>>>> DEALINGS IN
>>>>>>      * THE SOFTWARE.
>>>>>>      */
>>>>>> +#include "sysemu/sysemu.h"
>>>>>>     #include "hw/hw.h"
>>>>>>     #include "hw/pci/pci.h"
>>>>>>     #include "hw/pci/msi.h"
>>>>>> @@ -650,6 +651,8 @@ static void spapr_phb_finish_realize(sPAPRPHBState
>>>>>> *sphb, Error **errp)
>>>>>>         /* Register default 32bit DMA window */
>>>>>>         memory_region_add_subregion(&sphb->iommu_root, 0,
>>>>>>                                     spapr_tce_get_iommu(tcet));
>>>>>> +
>>>>>> +    sphb->ddw_supported = true;
>>>>> Unconditionally?
>>>> Yes. Why not? I cannot think of any case when we would not want this. In
>>>> practice there is very little chance it will ever be used anyway :) There
>>>> is still a machine option to disable it completely.
>>>>
>>>>
>>>>> Also, can't you make the ddw enable/disable flow go set-only? Basically
>>>>> have the flag in the machine struct if you must, but then on every PHB
>>>>> instantiation you set a QOM property that sets ddw_supported respectively?
>>>> Uff. Very confusing review comments today :)
>>>>
>>>> For VFIO, ddw_supported comes from the host kernel and totally depends on
>>>> hardware.
>>>>
>>>> For emulated, there is just one emulated PHB (yes, can be many but noone
>>>> seems to be using more in reality) and what you suggest seems to be too
>>>> complicated.
>>>>
>>>> This DDW thing - it is not really dynamic in the way it is used by the
>>>> existing linux guest. At the boot time the guest driver looks at DMA mask
>>>> and only if it is >32bit, it creates DDW, once, and after that the windows
>>>> remains active while the guest is running.
>>> What I'm asking is that rather than having
>>>
>>>    if (machine->ddw_enabled && phb->ddw_supported)
>>>
>>> to instead only have
>>>
>>>    if (phb->ddw_enabled)
>>>
>>> which gets set by the machine to true if machine->ddw_enabled. If you make
>>> it a qom property you can control the setter, so you can at the point when
>>> the machine wants to set it also ignore the set to true if your vfio
>>> implementation doesn't support ddw, leaving ddw_enabled as false.
>>>
>>>>
>>>>> Also keep in mind that we will have to at least disable ddw by default for
>>>>> existing machine types to maintain backwards compatibility.
>>>> Where exactly does the default setting "on" break in compatibility?
>>> Different device tree? Different return values on rtas calls? These are
>>> guest visible changes, so in theory we would have to make sure we don't
>>> change any of them.
>>>
>>> Of course we can always consciously declare them as unimportant enough that
>>> they in reality shouldn't have side effects we care about for hot and live
>>> migration, but there'd have to be a good reasoning on why we shouldn't have
>>> it disabled rather than why we should have backwards compatibility.
>> "hot" migration? What is that? :)
>>
>> There is a machine option to disable it and migrate to older guest (which
>> we do not support afair or do we?). If we migrate to newer QEMU, these DDW
>> tokens will be missing in the destination guest's tree and DDW won't be
>> used, everybody is happy. I really fail to see a scenario when I would not
>> use DDW...
> Ok, Paul explained. So by default "ddw" must be off for the pseries-2.0
> machine and on for pseries-2.2 and we'll be fine, right?

Yes.


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-12  0:03         ` Alexey Kardashevskiy
@ 2014-08-12  9:37           ` Alexander Graf
  2014-08-12 15:10             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-12  9:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 12.08.14 02:03, Alexey Kardashevskiy wrote:
> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>> This implements DDW for VFIO. Host kernel support is required for this.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>>     hw/ppc/spapr_pci_vfio.c | 75
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 75 insertions(+)
>>>>>
>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>> index d3bddf2..dc443e2 100644
>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>> @@ -69,6 +69,77 @@ static void
>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>         /* Register default 32bit DMA window */
>>>>>         memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>>>>                                     spapr_tce_get_iommu(tcet));
>>>>> +
>>>>> +    sphb->ddw_supported = !!(info.flags & VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>> +}
>>>>> +
>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>> +                                    uint32_t *windows_available,
>>>>> +                                    uint32_t *page_size_mask)
>>>>> +{
>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz = sizeof(query) };
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>> +    if (ret) {
>>>>> +        return ret;
>>>>> +    }
>>>>> +
>>>>> +    *windows_available = query.windows_available;
>>>>> +    *page_size_mask = query.page_size_mask;
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>> page_shift,
>>>>> +                                     uint32_t window_shift, uint32_t
>>>>> liobn,
>>>>> +                                     sPAPRTCETable **ptcet)
>>>>> +{
>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>> +        .argsz = sizeof(create),
>>>>> +        .page_shift = page_shift,
>>>>> +        .window_shift = window_shift,
>>>>> +        .start_addr = 0
>>>>> +    };
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>> +    if (ret) {
>>>>> +        return ret;
>>>>> +    }
>>>>> +
>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, create.start_addr,
>>>>> +                                 page_shift, 1 << (window_shift -
>>>>> page_shift),
>>>> I spot a 1 without ULL again - this time it might work out ok, but please
>>>> just always use ULL when you pass around addresses.
>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>
>>>
>>>> Please walk me though the abstraction levels on what each page size
>>>> honoration means. If I use THP, what page size granularity can I use for
>>>> TCE entries?
>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>> support
>>>
>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>> +            { 12, DDW_PGSIZE_4K },
>>> +            { 16, DDW_PGSIZE_64K },
>>> +            { 24, DDW_PGSIZE_16M },
>>> +            { 25, DDW_PGSIZE_32M },
>>> +            { 26, DDW_PGSIZE_64M },
>>> +            { 27, DDW_PGSIZE_128M },
>>> +            { 28, DDW_PGSIZE_256M },
>>> +            { 34, DDW_PGSIZE_16G },
>>> +        };
>>>
>>>
>>> Supported page sizes are returned by the host kernel via "query". For 16MB
>>> pages, page shift will return DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>> Or I did not understand the question...
>> Why do we care about the sizes? Anything bigger than what we support should
>> always work, no? What happens if the guest creates a 16MB map but my pages
>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
> pages, I have to make sure these 16M are continuous - there will be one TCE
> entry for it and no more translations besides IOMMU. What do I miss now?

Who does the shadow translation where? Does it exist at all?


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-12  9:37           ` Alexander Graf
@ 2014-08-12 15:10             ` Alexey Kardashevskiy
  2014-08-12 15:28               ` Alexander Graf
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-12 15:10 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/12/2014 07:37 PM, Alexander Graf wrote:
> 
> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>> This implements DDW for VFIO. Host kernel support is required for this.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>     hw/ppc/spapr_pci_vfio.c | 75
>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 75 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>> index d3bddf2..dc443e2 100644
>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>         /* Register default 32bit DMA window */
>>>>>>         memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>>>>>                                     spapr_tce_get_iommu(tcet));
>>>>>> +
>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>> +}
>>>>>> +
>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>> +                                    uint32_t *windows_available,
>>>>>> +                                    uint32_t *page_size_mask)
>>>>>> +{
>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>> sizeof(query) };
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>> +    if (ret) {
>>>>>> +        return ret;
>>>>>> +    }
>>>>>> +
>>>>>> +    *windows_available = query.windows_available;
>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>> page_shift,
>>>>>> +                                     uint32_t window_shift, uint32_t
>>>>>> liobn,
>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>> +{
>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>> +        .argsz = sizeof(create),
>>>>>> +        .page_shift = page_shift,
>>>>>> +        .window_shift = window_shift,
>>>>>> +        .start_addr = 0
>>>>>> +    };
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>> +    if (ret) {
>>>>>> +        return ret;
>>>>>> +    }
>>>>>> +
>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>> create.start_addr,
>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>> page_shift),
>>>>> I spot a 1 without ULL again - this time it might work out ok, but please
>>>>> just always use ULL when you pass around addresses.
>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>
>>>>
>>>>> Please walk me though the abstraction levels on what each page size
>>>>> honoration means. If I use THP, what page size granularity can I use for
>>>>> TCE entries?
>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>> support
>>>>
>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>> +            { 12, DDW_PGSIZE_4K },
>>>> +            { 16, DDW_PGSIZE_64K },
>>>> +            { 24, DDW_PGSIZE_16M },
>>>> +            { 25, DDW_PGSIZE_32M },
>>>> +            { 26, DDW_PGSIZE_64M },
>>>> +            { 27, DDW_PGSIZE_128M },
>>>> +            { 28, DDW_PGSIZE_256M },
>>>> +            { 34, DDW_PGSIZE_16G },
>>>> +        };
>>>>
>>>>
>>>> Supported page sizes are returned by the host kernel via "query". For 16MB
>>>> pages, page shift will return DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>> Or I did not understand the question...
>>> Why do we care about the sizes? Anything bigger than what we support should
>>> always work, no? What happens if the guest creates a 16MB map but my pages
>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>> pages, I have to make sure these 16M are continuous - there will be one TCE
>> entry for it and no more translations besides IOMMU. What do I miss now?
> 
> Who does the shadow translation where? Does it exist at all?

IOMMU? I am not sure I am following you... This IOMMU will look as direct
DMA for the guest but the real IOMMU table is sparse and it is populated
via a bunch of H_PUT_TCE calls as the default small window.

There is a direct mapping in the host called "bypass window" but it is not
used here as sPAPR does not define that for paravirtualization.


-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-12 15:10             ` Alexey Kardashevskiy
@ 2014-08-12 15:28               ` Alexander Graf
  2014-08-13  0:18                 ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: Alexander Graf @ 2014-08-12 15:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 12.08.14 17:10, Alexey Kardashevskiy wrote:
> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>> This implements DDW for VFIO. Host kernel support is required for this.
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>      hw/ppc/spapr_pci_vfio.c | 75
>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>      1 file changed, 75 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>          /* Register default 32bit DMA window */
>>>>>>>          memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>>>>>>>                                      spapr_tce_get_iommu(tcet));
>>>>>>> +
>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>> +                                    uint32_t *windows_available,
>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>> +{
>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>> sizeof(query) };
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>> +    if (ret) {
>>>>>>> +        return ret;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    *windows_available = query.windows_available;
>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>> page_shift,
>>>>>>> +                                     uint32_t window_shift, uint32_t
>>>>>>> liobn,
>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>> +{
>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>> +        .argsz = sizeof(create),
>>>>>>> +        .page_shift = page_shift,
>>>>>>> +        .window_shift = window_shift,
>>>>>>> +        .start_addr = 0
>>>>>>> +    };
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>>> +    if (ret) {
>>>>>>> +        return ret;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>> create.start_addr,
>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>> page_shift),
>>>>>> I spot a 1 without ULL again - this time it might work out ok, but please
>>>>>> just always use ULL when you pass around addresses.
>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>
>>>>>
>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>> honoration means. If I use THP, what page size granularity can I use for
>>>>>> TCE entries?
>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>> support
>>>>>
>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>> +        };
>>>>>
>>>>>
>>>>> Supported page sizes are returned by the host kernel via "query". For 16MB
>>>>> pages, page shift will return DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>> Or I did not understand the question...
>>>> Why do we care about the sizes? Anything bigger than what we support should
>>>> always work, no? What happens if the guest creates a 16MB map but my pages
>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>> pages, I have to make sure these 16M are continuous - there will be one TCE
>>> entry for it and no more translations besides IOMMU. What do I miss now?
>> Who does the shadow translation where? Does it exist at all?
> IOMMU? I am not sure I am following you... This IOMMU will look as direct
> DMA for the guest but the real IOMMU table is sparse and it is populated
> via a bunch of H_PUT_TCE calls as the default small window.
>
> There is a direct mapping in the host called "bypass window" but it is not
> used here as sPAPR does not define that for paravirtualization.

Ok, imagine I have 16MB of guest physical memory that is in reality 
backed by 256 64k pages on the host. The guest wants to create a 16M TCE 
entry for this (from its point of view contiguous) chunk of memory.

Do we allow this? Or do we force the guest to create 64k TCE entries?

If we allow it, why would we ever put any restriction at the upper end 
of TCE entry sizes? If we already implement enough logic to map things 
lazily around, we could as well have the guest create a 256M TCE entry 
and just split it on the host view to 64k TCE entries.


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-12 15:28               ` Alexander Graf
@ 2014-08-13  0:18                 ` Alexey Kardashevskiy
  2014-08-14 13:38                   ` Alexander Graf
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-13  0:18 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/13/2014 01:28 AM, Alexander Graf wrote:
> 
> On 12.08.14 17:10, Alexey Kardashevskiy wrote:
>> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>>> This implements DDW for VFIO. Host kernel support is required for
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>      hw/ppc/spapr_pci_vfio.c | 75
>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      1 file changed, 75 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>>          /* Register default 32bit DMA window */
>>>>>>>>          memory_region_add_subregion(&sphb->iommu_root,
>>>>>>>> tcet->bus_offset,
>>>>>>>>                                      spapr_tce_get_iommu(tcet));
>>>>>>>> +
>>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>>> +                                    uint32_t *windows_available,
>>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>>> +{
>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>>> sizeof(query) };
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>>> +    if (ret) {
>>>>>>>> +        return ret;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    *windows_available = query.windows_available;
>>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>>> +
>>>>>>>> +    return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>>> page_shift,
>>>>>>>> +                                     uint32_t window_shift, uint32_t
>>>>>>>> liobn,
>>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>>> +{
>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>>> +        .argsz = sizeof(create),
>>>>>>>> +        .page_shift = page_shift,
>>>>>>>> +        .window_shift = window_shift,
>>>>>>>> +        .start_addr = 0
>>>>>>>> +    };
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>>>> +    if (ret) {
>>>>>>>> +        return ret;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>>> create.start_addr,
>>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>>> page_shift),
>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but
>>>>>>> please
>>>>>>> just always use ULL when you pass around addresses.
>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>>
>>>>>>
>>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>>> honoration means. If I use THP, what page size granularity can I use
>>>>>>> for
>>>>>>> TCE entries?
>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>>> support
>>>>>>
>>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>>> +        };
>>>>>>
>>>>>>
>>>>>> Supported page sizes are returned by the host kernel via "query". For
>>>>>> 16MB
>>>>>> pages, page shift will return
>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>>> Or I did not understand the question...
>>>>> Why do we care about the sizes? Anything bigger than what we support
>>>>> should
>>>>> always work, no? What happens if the guest creates a 16MB map but my
>>>>> pages
>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>>> pages, I have to make sure these 16M are continuous - there will be one
>>>> TCE
>>>> entry for it and no more translations besides IOMMU. What do I miss now?
>>> Who does the shadow translation where? Does it exist at all?
>> IOMMU? I am not sure I am following you... This IOMMU will look as direct
>> DMA for the guest but the real IOMMU table is sparse and it is populated
>> via a bunch of H_PUT_TCE calls as the default small window.
>>
>> There is a direct mapping in the host called "bypass window" but it is not
>> used here as sPAPR does not define that for paravirtualization.
> 
> Ok, imagine I have 16MB of guest physical memory that is in reality backed
> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
> this (from its point of view contiguous) chunk of memory.
> 
> Do we allow this?

No, we do not. We tell the guest what it can use.

> Or do we force the guest to create 64k TCE entries?

16MB TCE pages are only allowed if qemu is running with hugepages.


> If we allow it, why would we ever put any restriction at the upper end of
> TCE entry sizes? If we already implement enough logic to map things lazily
> around, we could as well have the guest create a 256M TCE entry and just
> split it on the host view to 64k TCE entries.

Oh, thiiiiiis is what you meant...

Well, we could, just for now current linux guests support 4K/64K/16M only
and they choose depending on what hypervisor supports - look at
enable_ddw() in the guest. What you suggest seems to be an unnecessary code
duplication for 16MB pages case. For bigger page sizes - for example, for
64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
awesome enough, no?




-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW
  2014-08-12  7:16     ` Alexey Kardashevskiy
@ 2014-08-13  3:23       ` David Gibson
  0 siblings, 0 replies; 55+ messages in thread
From: David Gibson @ 2014-08-13  3:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 735 bytes --]

On Tue, Aug 12, 2014 at 05:16:44PM +1000, Alexey Kardashevskiy wrote:
> On 08/12/2014 11:20 AM, David Gibson wrote:
> > On Thu, Jul 31, 2014 at 07:34:09PM +1000, Alexey Kardashevskiy wrote:
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > It would be nice for your commit message to state exactly what kernel
> > version you pulled these updated headers in from.
> 
> There is no version, this is a "RFC" patchset and kernel changes did not
> reach upstream yet :)

Ok, the commit message should mention that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-12  7:25     ` Alexey Kardashevskiy
@ 2014-08-13  3:27       ` David Gibson
  2014-08-14  8:29         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: David Gibson @ 2014-08-13  3:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 3499 bytes --]

On Tue, Aug 12, 2014 at 05:25:29PM +1000, Alexey Kardashevskiy wrote:
> On 08/12/2014 11:45 AM, David Gibson wrote:
> > On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy
> wrote:
[snip]
> > The function of this is kind of unclear.  I'm assuming this is
> > filtering the supported page sizes reported by the PHB by the possible
> > page sizes based on host page size or other constraints.  Is that
> > right?
> > 
> > I think you'd be better off folding the whole double loop into the
> > fixmask function.
> > 
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, windows_available);
> >> +    /* Return maximum number as all RAM was 4K pages */
> >> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
> > 
> > I'm assuming this is the allowed size of the dynamic windows.
> > Shouldn't that be reported by a PHB callback, rather than hardcoded
> > here?
> 
> Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
> only when we have memory hotplug (which we do not have) and the guest can
> create smaller windows if it wants so I do not really follow you here.

What I'm not clear on is what this RTAS return actually means.  Is it
saying the maximum size of the DMA window, or the maximum address
which can be mapped by that window?  Remember I don't have access to
PAPR documentation any more - nor do others reading these patches.

[snip]
> >> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPREnvironment *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRPHBClass *spc;
> >> +    sPAPRTCETable *tcet = NULL;
> >> +    uint32_t addr, page_shift, window_shift, liobn;
> >> +    uint64_t buid;
> >> +    long ret;
> >> +
> >> +    if ((nargs != 5) || (nret != 4)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> >> +    if (!spc->ddw_create) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    page_shift = rtas_ld(args, 3);
> >> +    window_shift = rtas_ld(args, 4);
> >> +    liobn = sphb->dma_liobn + 0x10000;
> > 
> > Isn't using a fixed LIOBN here assuming you can only have a single DDW
> > per PHB?  That's true for now, but in theory shouldn't it be reported
> > by the PHB code itself?
> 
> 
> This should be a unique LIOBN so it is not up to PHB to choose. And we
> cannot make it completely random for migration purposes. I'll make it
> something like
> 
> #define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)

Ok.

Really, the assigned liobns should be included in the migration stream
if they're not already.  Relying them on them being set consistently
at startup is going to be really fragile.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-13  3:27       ` David Gibson
@ 2014-08-14  8:29         ` Alexey Kardashevskiy
  2014-08-15  0:04           ` David Gibson
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-14  8:29 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 08/13/2014 01:27 PM, David Gibson wrote:
> On Tue, Aug 12, 2014 at 05:25:29PM +1000, Alexey Kardashevskiy wrote:
>> On 08/12/2014 11:45 AM, David Gibson wrote:
>>> On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy
>> wrote:
> [snip]
>>> The function of this is kind of unclear.  I'm assuming this is
>>> filtering the supported page sizes reported by the PHB by the possible
>>> page sizes based on host page size or other constraints.  Is that
>>> right?
>>>
>>> I think you'd be better off folding the whole double loop into the
>>> fixmask function.
>>>
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    rtas_st(rets, 1, windows_available);
>>>> +    /* Return maximum number as all RAM was 4K pages */
>>>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
>>>
>>> I'm assuming this is the allowed size of the dynamic windows.
>>> Shouldn't that be reported by a PHB callback, rather than hardcoded
>>> here?
>>
>> Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
>> only when we have memory hotplug (which we do not have) and the guest can
>> create smaller windows if it wants so I do not really follow you here.
> 
> What I'm not clear on is what this RTAS return actually means.  Is it
> saying the maximum size of the DMA window, or the maximum address
> which can be mapped by that window?  Remember I don't have access to
> PAPR documentation any more - nor do others reading these patches.


It is literally "Largest contiguous block of TCEs allocated specifically
for (that is, are reserved for) this PE". Which I understand as the maximum
number of TCEs.



> [snip]
>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                          sPAPREnvironment *spapr,
>>>> +                                          uint32_t token, uint32_t nargs,
>>>> +                                          target_ulong args,
>>>> +                                          uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    sPAPRPHBClass *spc;
>>>> +    sPAPRTCETable *tcet = NULL;
>>>> +    uint32_t addr, page_shift, window_shift, liobn;
>>>> +    uint64_t buid;
>>>> +    long ret;
>>>> +
>>>> +    if ((nargs != 5) || (nret != 4)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>>>> +    if (!spc->ddw_create) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    page_shift = rtas_ld(args, 3);
>>>> +    window_shift = rtas_ld(args, 4);
>>>> +    liobn = sphb->dma_liobn + 0x10000;
>>>
>>> Isn't using a fixed LIOBN here assuming you can only have a single DDW
>>> per PHB?  That's true for now, but in theory shouldn't it be reported
>>> by the PHB code itself?
>>
>>
>> This should be a unique LIOBN so it is not up to PHB to choose. And we
>> cannot make it completely random for migration purposes. I'll make it
>> something like
>>
>> #define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)
> 
> Ok.
> 
> Really, the assigned liobns should be included in the migration stream
> if they're not already.

LIOBNs already migrate, liobn itself is an instance id of a TCE table
object in the migration stream.


>  Relying them on them being set consistently
> at startup is going to be really fragile.


-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-13  0:18                 ` Alexey Kardashevskiy
@ 2014-08-14 13:38                   ` Alexander Graf
  2014-08-15  0:09                     ` David Gibson
  2014-08-15  3:16                     ` Alexey Kardashevskiy
  0 siblings, 2 replies; 55+ messages in thread
From: Alexander Graf @ 2014-08-14 13:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 13.08.14 02:18, Alexey Kardashevskiy wrote:
> On 08/13/2014 01:28 AM, Alexander Graf wrote:
>> On 12.08.14 17:10, Alexey Kardashevskiy wrote:
>>> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>>>> This implements DDW for VFIO. Host kernel support is required for
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>>       hw/ppc/spapr_pci_vfio.c | 75
>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>       1 file changed, 75 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>>>           /* Register default 32bit DMA window */
>>>>>>>>>           memory_region_add_subregion(&sphb->iommu_root,
>>>>>>>>> tcet->bus_offset,
>>>>>>>>>                                       spapr_tce_get_iommu(tcet));
>>>>>>>>> +
>>>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>>>> +                                    uint32_t *windows_available,
>>>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>>>> +{
>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>>>> sizeof(query) };
>>>>>>>>> +    int ret;
>>>>>>>>> +
>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>>>> +    if (ret) {
>>>>>>>>> +        return ret;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    *windows_available = query.windows_available;
>>>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>>>> +
>>>>>>>>> +    return ret;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>>>> page_shift,
>>>>>>>>> +                                     uint32_t window_shift, uint32_t
>>>>>>>>> liobn,
>>>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>>>> +{
>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>>>> +        .argsz = sizeof(create),
>>>>>>>>> +        .page_shift = page_shift,
>>>>>>>>> +        .window_shift = window_shift,
>>>>>>>>> +        .start_addr = 0
>>>>>>>>> +    };
>>>>>>>>> +    int ret;
>>>>>>>>> +
>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>>>>> +    if (ret) {
>>>>>>>>> +        return ret;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>>>> create.start_addr,
>>>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>>>> page_shift),
>>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but
>>>>>>>> please
>>>>>>>> just always use ULL when you pass around addresses.
>>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>>>
>>>>>>>
>>>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>>>> honoration means. If I use THP, what page size granularity can I use
>>>>>>>> for
>>>>>>>> TCE entries?
>>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>>>> support
>>>>>>>
>>>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>>>> +        };
>>>>>>>
>>>>>>>
>>>>>>> Supported page sizes are returned by the host kernel via "query". For
>>>>>>> 16MB
>>>>>>> pages, page shift will return
>>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>>>> Or I did not understand the question...
>>>>>> Why do we care about the sizes? Anything bigger than what we support
>>>>>> should
>>>>>> always work, no? What happens if the guest creates a 16MB map but my
>>>>>> pages
>>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>>>> pages, I have to make sure these 16M are continuous - there will be one
>>>>> TCE
>>>>> entry for it and no more translations besides IOMMU. What do I miss now?
>>>> Who does the shadow translation where? Does it exist at all?
>>> IOMMU? I am not sure I am following you... This IOMMU will look as direct
>>> DMA for the guest but the real IOMMU table is sparse and it is populated
>>> via a bunch of H_PUT_TCE calls as the default small window.
>>>
>>> There is a direct mapping in the host called "bypass window" but it is not
>>> used here as sPAPR does not define that for paravirtualization.
>> Ok, imagine I have 16MB of guest physical memory that is in reality backed
>> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
>> this (from its point of view contiguous) chunk of memory.
>>
>> Do we allow this?
> No, we do not. We tell the guest what it can use.
>
>> Or do we force the guest to create 64k TCE entries?
> 16MB TCE pages are only allowed if qemu is running with hugepages.

That's unfortunate ;) but as long as we have to pin TCEd memory anyway, 
I guess it doesn't hurt as badly.

>
>
>> If we allow it, why would we ever put any restriction at the upper end of
>> TCE entry sizes? If we already implement enough logic to map things lazily
>> around, we could as well have the guest create a 256M TCE entry and just
>> split it on the host view to 64k TCE entries.
> Oh, thiiiiiis is what you meant...
>
> Well, we could, just for now current linux guests support 4K/64K/16M only
> and they choose depending on what hypervisor supports - look at
> enable_ddw() in the guest. What you suggest seems to be an unnecessary code
> duplication for 16MB pages case. For bigger page sizes - for example, for
> 64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
> awesome enough, no?

In "normal" invironments guests won't be backed by 16M pages, but by 64k 
pages with the occasional THP huge page merge that you can't rely on.

That's why I figured it'd be smart to support 16MB TCEs even when the 
underlying memory is only backed by 64k pages.


Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-14  8:29         ` Alexey Kardashevskiy
@ 2014-08-15  0:04           ` David Gibson
  2014-08-15  3:09             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: David Gibson @ 2014-08-15  0:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 4239 bytes --]

On Thu, Aug 14, 2014 at 06:29:50PM +1000, Alexey Kardashevskiy wrote:
> On 08/13/2014 01:27 PM, David Gibson wrote:
> > On Tue, Aug 12, 2014 at 05:25:29PM +1000, Alexey Kardashevskiy wrote:
> >> On 08/12/2014 11:45 AM, David Gibson wrote:
> >>> On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy
> >> wrote:
> > [snip]
> >>> The function of this is kind of unclear.  I'm assuming this is
> >>> filtering the supported page sizes reported by the PHB by the possible
> >>> page sizes based on host page size or other constraints.  Is that
> >>> right?
> >>>
> >>> I think you'd be better off folding the whole double loop into the
> >>> fixmask function.
> >>>
> >>>> +
> >>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>>> +    rtas_st(rets, 1, windows_available);
> >>>> +    /* Return maximum number as all RAM was 4K pages */
> >>>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
> >>>
> >>> I'm assuming this is the allowed size of the dynamic windows.
> >>> Shouldn't that be reported by a PHB callback, rather than hardcoded
> >>> here?
> >>
> >> Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
> >> only when we have memory hotplug (which we do not have) and the guest can
> >> create smaller windows if it wants so I do not really follow you here.
> > 
> > What I'm not clear on is what this RTAS return actually means.  Is it
> > saying the maximum size of the DMA window, or the maximum address
> > which can be mapped by that window?  Remember I don't have access to
> > PAPR documentation any more - nor do others reading these patches.
> 
> 
> It is literally "Largest contiguous block of TCEs allocated specifically
> for (that is, are reserved for) this PE". Which I understand as the maximum
> number of TCEs.

Ok, so essentially it's a property of the IOMMU.  Hrm, I guess
ram_size is good enough for now then.

[snip]
> > [snip]
> >>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >>>> +                                          sPAPREnvironment *spapr,
> >>>> +                                          uint32_t token, uint32_t nargs,
> >>>> +                                          target_ulong args,
> >>>> +                                          uint32_t nret, target_ulong rets)
> >>>> +{
> >>>> +    sPAPRPHBState *sphb;
> >>>> +    sPAPRPHBClass *spc;
> >>>> +    sPAPRTCETable *tcet = NULL;
> >>>> +    uint32_t addr, page_shift, window_shift, liobn;
> >>>> +    uint64_t buid;
> >>>> +    long ret;
> >>>> +
> >>>> +    if ((nargs != 5) || (nret != 4)) {
> >>>> +        goto param_error_exit;
> >>>> +    }
> >>>> +
> >>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>>> +    addr = rtas_ld(args, 0);
> >>>> +    sphb = spapr_pci_find_phb(spapr, buid);
> >>>> +    if (!sphb) {
> >>>> +        goto param_error_exit;
> >>>> +    }
> >>>> +
> >>>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> >>>> +    if (!spc->ddw_create) {
> >>>> +        goto hw_error_exit;
> >>>> +    }
> >>>> +
> >>>> +    page_shift = rtas_ld(args, 3);
> >>>> +    window_shift = rtas_ld(args, 4);
> >>>> +    liobn = sphb->dma_liobn + 0x10000;
> >>>
> >>> Isn't using a fixed LIOBN here assuming you can only have a single DDW
> >>> per PHB?  That's true for now, but in theory shouldn't it be reported
> >>> by the PHB code itself?
> >>
> >>
> >> This should be a unique LIOBN so it is not up to PHB to choose. And we
> >> cannot make it completely random for migration purposes. I'll make it
> >> something like
> >>
> >> #define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)
> > 
> > Ok.
> > 
> > Really, the assigned liobns should be included in the migration stream
> > if they're not already.
> 
> LIOBNs already migrate, liobn itself is an instance id of a TCE table
> object in the migration stream.

Ok, so couldn't we just add an alloc_liobn() function instead of
hardcoding how the liobns are constructed?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-14 13:38                   ` Alexander Graf
@ 2014-08-15  0:09                     ` David Gibson
  2014-08-15  3:22                       ` Alexey Kardashevskiy
  2014-08-15  3:16                     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 55+ messages in thread
From: David Gibson @ 2014-08-15  0:09 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 8252 bytes --]

On Thu, Aug 14, 2014 at 03:38:45PM +0200, Alexander Graf wrote:
> 
> On 13.08.14 02:18, Alexey Kardashevskiy wrote:
> >On 08/13/2014 01:28 AM, Alexander Graf wrote:
> >>On 12.08.14 17:10, Alexey Kardashevskiy wrote:
> >>>On 08/12/2014 07:37 PM, Alexander Graf wrote:
> >>>>On 12.08.14 02:03, Alexey Kardashevskiy wrote:
> >>>>>On 08/12/2014 03:30 AM, Alexander Graf wrote:
> >>>>>>On 11.08.14 17:01, Alexey Kardashevskiy wrote:
> >>>>>>>On 08/11/2014 10:02 PM, Alexander Graf wrote:
> >>>>>>>>On 31.07.14 11:34, Alexey Kardashevskiy wrote:
> >>>>>>>>>This implements DDW for VFIO. Host kernel support is required for
> >>>>>>>>>this.
> >>>>>>>>>
> >>>>>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>---
> >>>>>>>>>      hw/ppc/spapr_pci_vfio.c | 75
> >>>>>>>>>+++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>      1 file changed, 75 insertions(+)
> >>>>>>>>>
> >>>>>>>>>diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> >>>>>>>>>index d3bddf2..dc443e2 100644
> >>>>>>>>>--- a/hw/ppc/spapr_pci_vfio.c
> >>>>>>>>>+++ b/hw/ppc/spapr_pci_vfio.c
> >>>>>>>>>@@ -69,6 +69,77 @@ static void
> >>>>>>>>>spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
> >>>>>>>>>          /* Register default 32bit DMA window */
> >>>>>>>>>          memory_region_add_subregion(&sphb->iommu_root,
> >>>>>>>>>tcet->bus_offset,
> >>>>>>>>>                                      spapr_tce_get_iommu(tcet));
> >>>>>>>>>+
> >>>>>>>>>+    sphb->ddw_supported = !!(info.flags &
> >>>>>>>>>VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
> >>>>>>>>>+}
> >>>>>>>>>+
> >>>>>>>>>+static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
> >>>>>>>>>+                                    uint32_t *windows_available,
> >>>>>>>>>+                                    uint32_t *page_size_mask)
> >>>>>>>>>+{
> >>>>>>>>>+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> >>>>>>>>>+    struct vfio_iommu_spapr_tce_query query = { .argsz =
> >>>>>>>>>sizeof(query) };
> >>>>>>>>>+    int ret;
> >>>>>>>>>+
> >>>>>>>>>+    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> >>>>>>>>>+                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
> >>>>>>>>>+    if (ret) {
> >>>>>>>>>+        return ret;
> >>>>>>>>>+    }
> >>>>>>>>>+
> >>>>>>>>>+    *windows_available = query.windows_available;
> >>>>>>>>>+    *page_size_mask = query.page_size_mask;
> >>>>>>>>>+
> >>>>>>>>>+    return ret;
> >>>>>>>>>+}
> >>>>>>>>>+
> >>>>>>>>>+static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
> >>>>>>>>>page_shift,
> >>>>>>>>>+                                     uint32_t window_shift, uint32_t
> >>>>>>>>>liobn,
> >>>>>>>>>+                                     sPAPRTCETable **ptcet)
> >>>>>>>>>+{
> >>>>>>>>>+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> >>>>>>>>>+    struct vfio_iommu_spapr_tce_create create = {
> >>>>>>>>>+        .argsz = sizeof(create),
> >>>>>>>>>+        .page_shift = page_shift,
> >>>>>>>>>+        .window_shift = window_shift,
> >>>>>>>>>+        .start_addr = 0
> >>>>>>>>>+    };
> >>>>>>>>>+    int ret;
> >>>>>>>>>+
> >>>>>>>>>+    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
> >>>>>>>>>+                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>>>>>>>>+    if (ret) {
> >>>>>>>>>+        return ret;
> >>>>>>>>>+    }
> >>>>>>>>>+
> >>>>>>>>>+    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
> >>>>>>>>>create.start_addr,
> >>>>>>>>>+                                 page_shift, 1 << (window_shift -
> >>>>>>>>>page_shift),
> >>>>>>>>I spot a 1 without ULL again - this time it might work out ok, but
> >>>>>>>>please
> >>>>>>>>just always use ULL when you pass around addresses.
> >>>>>>>My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
> >>>>>>>
> >>>>>>>
> >>>>>>>>Please walk me though the abstraction levels on what each page size
> >>>>>>>>honoration means. If I use THP, what page size granularity can I use
> >>>>>>>>for
> >>>>>>>>TCE entries?
> >>>>>>>[RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
> >>>>>>>support
> >>>>>>>
> >>>>>>>+        const struct { int shift; uint32_t mask; } masks[] = {
> >>>>>>>+            { 12, DDW_PGSIZE_4K },
> >>>>>>>+            { 16, DDW_PGSIZE_64K },
> >>>>>>>+            { 24, DDW_PGSIZE_16M },
> >>>>>>>+            { 25, DDW_PGSIZE_32M },
> >>>>>>>+            { 26, DDW_PGSIZE_64M },
> >>>>>>>+            { 27, DDW_PGSIZE_128M },
> >>>>>>>+            { 28, DDW_PGSIZE_256M },
> >>>>>>>+            { 34, DDW_PGSIZE_16G },
> >>>>>>>+        };
> >>>>>>>
> >>>>>>>
> >>>>>>>Supported page sizes are returned by the host kernel via "query". For
> >>>>>>>16MB
> >>>>>>>pages, page shift will return
> >>>>>>>DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
> >>>>>>>Or I did not understand the question...
> >>>>>>Why do we care about the sizes? Anything bigger than what we support
> >>>>>>should
> >>>>>>always work, no? What happens if the guest creates a 16MB map but my
> >>>>>>pages
> >>>>>>are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
> >>>>>It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
> >>>>>pages, I have to make sure these 16M are continuous - there will be one
> >>>>>TCE
> >>>>>entry for it and no more translations besides IOMMU. What do I miss now?
> >>>>Who does the shadow translation where? Does it exist at all?
> >>>IOMMU? I am not sure I am following you... This IOMMU will look as direct
> >>>DMA for the guest but the real IOMMU table is sparse and it is populated
> >>>via a bunch of H_PUT_TCE calls as the default small window.
> >>>
> >>>There is a direct mapping in the host called "bypass window" but it is not
> >>>used here as sPAPR does not define that for paravirtualization.
> >>Ok, imagine I have 16MB of guest physical memory that is in reality backed
> >>by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
> >>this (from its point of view contiguous) chunk of memory.
> >>
> >>Do we allow this?
> >No, we do not. We tell the guest what it can use.
> >
> >>Or do we force the guest to create 64k TCE entries?
> >16MB TCE pages are only allowed if qemu is running with hugepages.
> 
> That's unfortunate ;) but as long as we have to pin TCEd memory anyway, I
> guess it doesn't hurt as badly.
> 
> >
> >
> >>If we allow it, why would we ever put any restriction at the upper end of
> >>TCE entry sizes? If we already implement enough logic to map things lazily
> >>around, we could as well have the guest create a 256M TCE entry and just
> >>split it on the host view to 64k TCE entries.
> >Oh, thiiiiiis is what you meant...
> >
> >Well, we could, just for now current linux guests support 4K/64K/16M only
> >and they choose depending on what hypervisor supports - look at
> >enable_ddw() in the guest. What you suggest seems to be an unnecessary code
> >duplication for 16MB pages case. For bigger page sizes - for example, for
> >64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
> >awesome enough, no?
> 
> In "normal" invironments guests won't be backed by 16M pages, but by 64k
> pages with the occasional THP huge page merge that you can't rely on.
> 
> That's why I figured it'd be smart to support 16MB TCEs even when the
> underlying memory is only backed by 64k pages.

That could work for emulated PCI devices, but not for VFIO.  With VFIO
the TCEs get passed through to the hardware, and so the pages mapped
must be physically contiguous, which can only happen if the guest is
backed by hugepages.

Well.. I guess you *could* fake it for VFIO, by making each guest
H_PUT_TCE result in many real TCEs being created.  But I think it's a
bad idea, because it would trigger the guest to map all RAM when not
hugepage backed, and that would mean the translated (host) TCE table
would be inordinately large.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-15  0:04           ` David Gibson
@ 2014-08-15  3:09             ` Alexey Kardashevskiy
  2014-08-15  4:20               ` David Gibson
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-15  3:09 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 08/15/2014 10:04 AM, David Gibson wrote:
> On Thu, Aug 14, 2014 at 06:29:50PM +1000, Alexey Kardashevskiy wrote:
>> On 08/13/2014 01:27 PM, David Gibson wrote:
>>> On Tue, Aug 12, 2014 at 05:25:29PM +1000, Alexey Kardashevskiy wrote:
>>>> On 08/12/2014 11:45 AM, David Gibson wrote:
>>>>> On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy
>>>> wrote:
>>> [snip]
>>>>> The function of this is kind of unclear.  I'm assuming this is
>>>>> filtering the supported page sizes reported by the PHB by the possible
>>>>> page sizes based on host page size or other constraints.  Is that
>>>>> right?
>>>>>
>>>>> I think you'd be better off folding the whole double loop into the
>>>>> fixmask function.
>>>>>
>>>>>> +
>>>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>>>> +    rtas_st(rets, 1, windows_available);
>>>>>> +    /* Return maximum number as all RAM was 4K pages */
>>>>>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
>>>>>
>>>>> I'm assuming this is the allowed size of the dynamic windows.
>>>>> Shouldn't that be reported by a PHB callback, rather than hardcoded
>>>>> here?
>>>>
>>>> Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
>>>> only when we have memory hotplug (which we do not have) and the guest can
>>>> create smaller windows if it wants so I do not really follow you here.
>>>
>>> What I'm not clear on is what this RTAS return actually means.  Is it
>>> saying the maximum size of the DMA window, or the maximum address
>>> which can be mapped by that window?  Remember I don't have access to
>>> PAPR documentation any more - nor do others reading these patches.
>>
>>
>> It is literally "Largest contiguous block of TCEs allocated specifically
>> for (that is, are reserved for) this PE". Which I understand as the maximum
>> number of TCEs.
> 
> Ok, so essentially it's a property of the IOMMU.  Hrm, I guess
> ram_size is good enough for now then.
> 
> [snip]
>>> [snip]
>>>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>>>>> +                                          sPAPREnvironment *spapr,
>>>>>> +                                          uint32_t token, uint32_t nargs,
>>>>>> +                                          target_ulong args,
>>>>>> +                                          uint32_t nret, target_ulong rets)
>>>>>> +{
>>>>>> +    sPAPRPHBState *sphb;
>>>>>> +    sPAPRPHBClass *spc;
>>>>>> +    sPAPRTCETable *tcet = NULL;
>>>>>> +    uint32_t addr, page_shift, window_shift, liobn;
>>>>>> +    uint64_t buid;
>>>>>> +    long ret;
>>>>>> +
>>>>>> +    if ((nargs != 5) || (nret != 4)) {
>>>>>> +        goto param_error_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>>>> +    addr = rtas_ld(args, 0);
>>>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>>>> +    if (!sphb) {
>>>>>> +        goto param_error_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>>>>>> +    if (!spc->ddw_create) {
>>>>>> +        goto hw_error_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    page_shift = rtas_ld(args, 3);
>>>>>> +    window_shift = rtas_ld(args, 4);
>>>>>> +    liobn = sphb->dma_liobn + 0x10000;
>>>>>
>>>>> Isn't using a fixed LIOBN here assuming you can only have a single DDW
>>>>> per PHB?  That's true for now, but in theory shouldn't it be reported
>>>>> by the PHB code itself?
>>>>
>>>>
>>>> This should be a unique LIOBN so it is not up to PHB to choose. And we
>>>> cannot make it completely random for migration purposes. I'll make it
>>>> something like
>>>>
>>>> #define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)
>>>
>>> Ok.
>>>
>>> Really, the assigned liobns should be included in the migration stream
>>> if they're not already.
>>
>> LIOBNs already migrate, liobn itself is an instance id of a TCE table
>> object in the migration stream.
> 
> Ok, so couldn't we just add an alloc_liobn() function instead of
> hardcoding how the liobns are constructed?


No. If we did so, exact numbers would depend on the device order in the
QEMU command line - QEMU command line produced by libvirt from handmade XML
and QEMU command line produced by libvirt from XML printed by "dumpxml" can
have devices in different order, so interrupt numbers, LIOBNs - all of this
gets broken.



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-14 13:38                   ` Alexander Graf
  2014-08-15  0:09                     ` David Gibson
@ 2014-08-15  3:16                     ` Alexey Kardashevskiy
  2014-08-15  7:37                       ` Alexander Graf
  1 sibling, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-15  3:16 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc

On 08/14/2014 11:38 PM, Alexander Graf wrote:
> 
> On 13.08.14 02:18, Alexey Kardashevskiy wrote:
>> On 08/13/2014 01:28 AM, Alexander Graf wrote:
>>> On 12.08.14 17:10, Alexey Kardashevskiy wrote:
>>>> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>>>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>>>>> This implements DDW for VFIO. Host kernel support is required for
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>       hw/ppc/spapr_pci_vfio.c | 75
>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       1 file changed, 75 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>>>>           /* Register default 32bit DMA window */
>>>>>>>>>>           memory_region_add_subregion(&sphb->iommu_root,
>>>>>>>>>> tcet->bus_offset,
>>>>>>>>>>                                       spapr_tce_get_iommu(tcet));
>>>>>>>>>> +
>>>>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>>>>> +                                    uint32_t *windows_available,
>>>>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>>>>> +{
>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>>>>> sizeof(query) };
>>>>>>>>>> +    int ret;
>>>>>>>>>> +
>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as,
>>>>>>>>>> svphb->iommugroupid,
>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>>>>> +    if (ret) {
>>>>>>>>>> +        return ret;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    *windows_available = query.windows_available;
>>>>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>>>>> +
>>>>>>>>>> +    return ret;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>>>>> page_shift,
>>>>>>>>>> +                                     uint32_t window_shift,
>>>>>>>>>> uint32_t
>>>>>>>>>> liobn,
>>>>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>>>>> +{
>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>>>>> +        .argsz = sizeof(create),
>>>>>>>>>> +        .page_shift = page_shift,
>>>>>>>>>> +        .window_shift = window_shift,
>>>>>>>>>> +        .start_addr = 0
>>>>>>>>>> +    };
>>>>>>>>>> +    int ret;
>>>>>>>>>> +
>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as,
>>>>>>>>>> svphb->iommugroupid,
>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE,
>>>>>>>>>> &create);
>>>>>>>>>> +    if (ret) {
>>>>>>>>>> +        return ret;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>>>>> create.start_addr,
>>>>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>>>>> page_shift),
>>>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but
>>>>>>>>> please
>>>>>>>>> just always use ULL when you pass around addresses.
>>>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>>>>
>>>>>>>>
>>>>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>>>>> honoration means. If I use THP, what page size granularity can I use
>>>>>>>>> for
>>>>>>>>> TCE entries?
>>>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>>>>> support
>>>>>>>>
>>>>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>>>>> +        };
>>>>>>>>
>>>>>>>>
>>>>>>>> Supported page sizes are returned by the host kernel via "query". For
>>>>>>>> 16MB
>>>>>>>> pages, page shift will return
>>>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>>>>> Or I did not understand the question...
>>>>>>> Why do we care about the sizes? Anything bigger than what we support
>>>>>>> should
>>>>>>> always work, no? What happens if the guest creates a 16MB map but my
>>>>>>> pages
>>>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>>>>> pages, I have to make sure these 16M are continuous - there will be one
>>>>>> TCE
>>>>>> entry for it and no more translations besides IOMMU. What do I miss now?
>>>>> Who does the shadow translation where? Does it exist at all?
>>>> IOMMU? I am not sure I am following you... This IOMMU will look as direct
>>>> DMA for the guest but the real IOMMU table is sparse and it is populated
>>>> via a bunch of H_PUT_TCE calls as the default small window.
>>>>
>>>> There is a direct mapping in the host called "bypass window" but it is not
>>>> used here as sPAPR does not define that for paravirtualization.
>>> Ok, imagine I have 16MB of guest physical memory that is in reality backed
>>> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
>>> this (from its point of view contiguous) chunk of memory.
>>>
>>> Do we allow this?
>> No, we do not. We tell the guest what it can use.
>>
>>> Or do we force the guest to create 64k TCE entries?
>> 16MB TCE pages are only allowed if qemu is running with hugepages.
> 
> That's unfortunate ;) 


This is my limitation, not SPAPR spec or anything like that.

> but as long as we have to pin TCEd memory anyway, I
> guess it doesn't hurt as badly.

Yep.


> 
>>
>>
>>> If we allow it, why would we ever put any restriction at the upper end of
>>> TCE entry sizes? If we already implement enough logic to map things lazily
>>> around, we could as well have the guest create a 256M TCE entry and just
>>> split it on the host view to 64k TCE entries.
>> Oh, thiiiiiis is what you meant...
>>
>> Well, we could, just for now current linux guests support 4K/64K/16M only
>> and they choose depending on what hypervisor supports - look at
>> enable_ddw() in the guest. What you suggest seems to be an unnecessary code
>> duplication for 16MB pages case. For bigger page sizes - for example, for
>> 64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
>> awesome enough, no?
> 
> In "normal" invironments guests won't be backed by 16M pages, but by 64k
> pages with the occasional THP huge page merge that you can't rely on.
> 
> That's why I figured it'd be smart to support 16MB TCEs even when the
> underlying memory is only backed by 64k pages.


Real TCE table will be using 64K pages anyway - whether we fake 16MB pages
to the guest and split them silently in QEMU/host OR guest itself uses 64K
pages, in any case we end up with a bigger 64K TCE table so I do not see
what we win if we fake 16MB pages for the guest.

My point it the code is already in the guest, it can do 64K and 16MB pages,
if it was not there, yes, everything you suggest would make sense to
implement :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-15  0:09                     ` David Gibson
@ 2014-08-15  3:22                       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-15  3:22 UTC (permalink / raw)
  To: David Gibson, Alexander Graf; +Cc: Alex Williamson, qemu-ppc, qemu-devel

On 08/15/2014 10:09 AM, David Gibson wrote:
> On Thu, Aug 14, 2014 at 03:38:45PM +0200, Alexander Graf wrote:
>>
>> On 13.08.14 02:18, Alexey Kardashevskiy wrote:
>>> On 08/13/2014 01:28 AM, Alexander Graf wrote:
>>>> On 12.08.14 17:10, Alexey Kardashevskiy wrote:
>>>>> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>>>>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>>>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>>>>>> This implements DDW for VFIO. Host kernel support is required for
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>>      hw/ppc/spapr_pci_vfio.c | 75
>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>      1 file changed, 75 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>>>>>          /* Register default 32bit DMA window */
>>>>>>>>>>>          memory_region_add_subregion(&sphb->iommu_root,
>>>>>>>>>>> tcet->bus_offset,
>>>>>>>>>>>                                      spapr_tce_get_iommu(tcet));
>>>>>>>>>>> +
>>>>>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>>>>>> +                                    uint32_t *windows_available,
>>>>>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>>>>>> +{
>>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>>>>>> sizeof(query) };
>>>>>>>>>>> +    int ret;
>>>>>>>>>>> +
>>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>>>>>> +    if (ret) {
>>>>>>>>>>> +        return ret;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    *windows_available = query.windows_available;
>>>>>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>>>>>> page_shift,
>>>>>>>>>>> +                                     uint32_t window_shift, uint32_t
>>>>>>>>>>> liobn,
>>>>>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>>>>>> +{
>>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>>>>>> +        .argsz = sizeof(create),
>>>>>>>>>>> +        .page_shift = page_shift,
>>>>>>>>>>> +        .window_shift = window_shift,
>>>>>>>>>>> +        .start_addr = 0
>>>>>>>>>>> +    };
>>>>>>>>>>> +    int ret;
>>>>>>>>>>> +
>>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>>>>>>> +    if (ret) {
>>>>>>>>>>> +        return ret;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>>>>>> create.start_addr,
>>>>>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>>>>>> page_shift),
>>>>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but
>>>>>>>>>> please
>>>>>>>>>> just always use ULL when you pass around addresses.
>>>>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>>>>>> honoration means. If I use THP, what page size granularity can I use
>>>>>>>>>> for
>>>>>>>>>> TCE entries?
>>>>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>>>>>> support
>>>>>>>>>
>>>>>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>>>>>> +        };
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Supported page sizes are returned by the host kernel via "query". For
>>>>>>>>> 16MB
>>>>>>>>> pages, page shift will return
>>>>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>>>>>> Or I did not understand the question...
>>>>>>>> Why do we care about the sizes? Anything bigger than what we support
>>>>>>>> should
>>>>>>>> always work, no? What happens if the guest creates a 16MB map but my
>>>>>>>> pages
>>>>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>>>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>>>>>> pages, I have to make sure these 16M are continuous - there will be one
>>>>>>> TCE
>>>>>>> entry for it and no more translations besides IOMMU. What do I miss now?
>>>>>> Who does the shadow translation where? Does it exist at all?
>>>>> IOMMU? I am not sure I am following you... This IOMMU will look as direct
>>>>> DMA for the guest but the real IOMMU table is sparse and it is populated
>>>>> via a bunch of H_PUT_TCE calls as the default small window.
>>>>>
>>>>> There is a direct mapping in the host called "bypass window" but it is not
>>>>> used here as sPAPR does not define that for paravirtualization.
>>>> Ok, imagine I have 16MB of guest physical memory that is in reality backed
>>>> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
>>>> this (from its point of view contiguous) chunk of memory.
>>>>
>>>> Do we allow this?
>>> No, we do not. We tell the guest what it can use.
>>>
>>>> Or do we force the guest to create 64k TCE entries?
>>> 16MB TCE pages are only allowed if qemu is running with hugepages.
>>
>> That's unfortunate ;) but as long as we have to pin TCEd memory anyway, I
>> guess it doesn't hurt as badly.
>>
>>>
>>>
>>>> If we allow it, why would we ever put any restriction at the upper end of
>>>> TCE entry sizes? If we already implement enough logic to map things lazily
>>>> around, we could as well have the guest create a 256M TCE entry and just
>>>> split it on the host view to 64k TCE entries.
>>> Oh, thiiiiiis is what you meant...
>>>
>>> Well, we could, just for now current linux guests support 4K/64K/16M only
>>> and they choose depending on what hypervisor supports - look at
>>> enable_ddw() in the guest. What you suggest seems to be an unnecessary code
>>> duplication for 16MB pages case. For bigger page sizes - for example, for
>>> 64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
>>> awesome enough, no?
>>
>> In "normal" invironments guests won't be backed by 16M pages, but by 64k
>> pages with the occasional THP huge page merge that you can't rely on.
>>
>> That's why I figured it'd be smart to support 16MB TCEs even when the
>> underlying memory is only backed by 64k pages.
> 
> That could work for emulated PCI devices, but not for VFIO.  With VFIO
> the TCEs get passed through to the hardware, and so the pages mapped
> must be physically contiguous, which can only happen if the guest is
> backed by hugepages.
> 
> Well.. I guess you *could* fake it for VFIO, by making each guest
> H_PUT_TCE result in many real TCEs being created.  But I think it's a
> bad idea, because it would trigger the guest to map all RAM when not
> hugepage backed, and that would mean the translated (host) TCE table
> would be inordinately large.


inordinately? :) 64GB of ram, 64K pages = 1 million TCEs, 8 bytes each so
the whole table would be 8MB (which is 0.01% of 64GB). Well, allocating 8MB
in one continuous chunk might be a problem - but P8 allows splitting tables
to up to 4 level trees. Or it is just a single 16MB page.



-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-15  3:09             ` Alexey Kardashevskiy
@ 2014-08-15  4:20               ` David Gibson
  2014-08-15  5:27                 ` Alexey Kardashevskiy
  0 siblings, 1 reply; 55+ messages in thread
From: David Gibson @ 2014-08-15  4:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 5480 bytes --]

On Fri, Aug 15, 2014 at 01:09:20PM +1000, Alexey Kardashevskiy wrote:
> On 08/15/2014 10:04 AM, David Gibson wrote:
> > On Thu, Aug 14, 2014 at 06:29:50PM +1000, Alexey Kardashevskiy wrote:
> >> On 08/13/2014 01:27 PM, David Gibson wrote:
> >>> On Tue, Aug 12, 2014 at 05:25:29PM +1000, Alexey Kardashevskiy wrote:
> >>>> On 08/12/2014 11:45 AM, David Gibson wrote:
> >>>>> On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy
> >>>> wrote:
> >>> [snip]
> >>>>> The function of this is kind of unclear.  I'm assuming this is
> >>>>> filtering the supported page sizes reported by the PHB by the possible
> >>>>> page sizes based on host page size or other constraints.  Is that
> >>>>> right?
> >>>>>
> >>>>> I think you'd be better off folding the whole double loop into the
> >>>>> fixmask function.
> >>>>>
> >>>>>> +
> >>>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>>>>> +    rtas_st(rets, 1, windows_available);
> >>>>>> +    /* Return maximum number as all RAM was 4K pages */
> >>>>>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
> >>>>>
> >>>>> I'm assuming this is the allowed size of the dynamic windows.
> >>>>> Shouldn't that be reported by a PHB callback, rather than hardcoded
> >>>>> here?
> >>>>
> >>>> Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
> >>>> only when we have memory hotplug (which we do not have) and the guest can
> >>>> create smaller windows if it wants so I do not really follow you here.
> >>>
> >>> What I'm not clear on is what this RTAS return actually means.  Is it
> >>> saying the maximum size of the DMA window, or the maximum address
> >>> which can be mapped by that window?  Remember I don't have access to
> >>> PAPR documentation any more - nor do others reading these patches.
> >>
> >>
> >> It is literally "Largest contiguous block of TCEs allocated specifically
> >> for (that is, are reserved for) this PE". Which I understand as the maximum
> >> number of TCEs.
> > 
> > Ok, so essentially it's a property of the IOMMU.  Hrm, I guess
> > ram_size is good enough for now then.
> > 
> > [snip]
> >>> [snip]
> >>>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >>>>>> +                                          sPAPREnvironment *spapr,
> >>>>>> +                                          uint32_t token, uint32_t nargs,
> >>>>>> +                                          target_ulong args,
> >>>>>> +                                          uint32_t nret, target_ulong rets)
> >>>>>> +{
> >>>>>> +    sPAPRPHBState *sphb;
> >>>>>> +    sPAPRPHBClass *spc;
> >>>>>> +    sPAPRTCETable *tcet = NULL;
> >>>>>> +    uint32_t addr, page_shift, window_shift, liobn;
> >>>>>> +    uint64_t buid;
> >>>>>> +    long ret;
> >>>>>> +
> >>>>>> +    if ((nargs != 5) || (nret != 4)) {
> >>>>>> +        goto param_error_exit;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>>>>> +    addr = rtas_ld(args, 0);
> >>>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
> >>>>>> +    if (!sphb) {
> >>>>>> +        goto param_error_exit;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> >>>>>> +    if (!spc->ddw_create) {
> >>>>>> +        goto hw_error_exit;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    page_shift = rtas_ld(args, 3);
> >>>>>> +    window_shift = rtas_ld(args, 4);
> >>>>>> +    liobn = sphb->dma_liobn + 0x10000;
> >>>>>
> >>>>> Isn't using a fixed LIOBN here assuming you can only have a single DDW
> >>>>> per PHB?  That's true for now, but in theory shouldn't it be reported
> >>>>> by the PHB code itself?
> >>>>
> >>>>
> >>>> This should be a unique LIOBN so it is not up to PHB to choose. And we
> >>>> cannot make it completely random for migration purposes. I'll make it
> >>>> something like
> >>>>
> >>>> #define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)
> >>>
> >>> Ok.
> >>>
> >>> Really, the assigned liobns should be included in the migration stream
> >>> if they're not already.
> >>
> >> LIOBNs already migrate, liobn itself is an instance id of a TCE table
> >> object in the migration stream.
> > 
> > Ok, so couldn't we just add an alloc_liobn() function instead of
> > hardcoding how the liobns are constructed?
> 
> 
> No. If we did so, exact numbers would depend on the device order in the
> QEMU command line - QEMU command line produced by libvirt from handmade XML
> and QEMU command line produced by libvirt from XML printed by "dumpxml" can
> have devices in different order, so interrupt numbers, LIOBNs - all of this
> gets broken.

Ah, duh.  Clearly I'm still swapping back in my knowledge of qemu
migration.

What I was meaning before is that qemu should reconstruct the TCE
tables / liobns based on the info in the migration stream, which
remains true, but it can't really be done without more broadly
addressing the fact that qemu doesn't transmit the hardware
configuration in the migration stream.

Ok, so what should probably be done here is to explicitly assign
fields in the LIOBN to "device type", "device id" (vio reg or PHB
buid), and per-device instance (DDW number for PCI).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-15  4:20               ` David Gibson
@ 2014-08-15  5:27                 ` Alexey Kardashevskiy
  2014-08-15  5:30                   ` David Gibson
  0 siblings, 1 reply; 55+ messages in thread
From: Alexey Kardashevskiy @ 2014-08-15  5:27 UTC (permalink / raw)
  To: David Gibson; +Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

On 08/15/2014 02:20 PM, David Gibson wrote:
> On Fri, Aug 15, 2014 at 01:09:20PM +1000, Alexey Kardashevskiy wrote:
>> On 08/15/2014 10:04 AM, David Gibson wrote:
>>> On Thu, Aug 14, 2014 at 06:29:50PM +1000, Alexey Kardashevskiy wrote:
>>>> On 08/13/2014 01:27 PM, David Gibson wrote:
>>>>> On Tue, Aug 12, 2014 at 05:25:29PM +1000, Alexey Kardashevskiy wrote:
>>>>>> On 08/12/2014 11:45 AM, David Gibson wrote:
>>>>>>> On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy
>>>>>> wrote:
>>>>> [snip]
>>>>>>> The function of this is kind of unclear.  I'm assuming this is
>>>>>>> filtering the supported page sizes reported by the PHB by the possible
>>>>>>> page sizes based on host page size or other constraints.  Is that
>>>>>>> right?
>>>>>>>
>>>>>>> I think you'd be better off folding the whole double loop into the
>>>>>>> fixmask function.
>>>>>>>
>>>>>>>> +
>>>>>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>>>>>> +    rtas_st(rets, 1, windows_available);
>>>>>>>> +    /* Return maximum number as all RAM was 4K pages */
>>>>>>>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
>>>>>>>
>>>>>>> I'm assuming this is the allowed size of the dynamic windows.
>>>>>>> Shouldn't that be reported by a PHB callback, rather than hardcoded
>>>>>>> here?
>>>>>>
>>>>>> Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
>>>>>> only when we have memory hotplug (which we do not have) and the guest can
>>>>>> create smaller windows if it wants so I do not really follow you here.
>>>>>
>>>>> What I'm not clear on is what this RTAS return actually means.  Is it
>>>>> saying the maximum size of the DMA window, or the maximum address
>>>>> which can be mapped by that window?  Remember I don't have access to
>>>>> PAPR documentation any more - nor do others reading these patches.
>>>>
>>>>
>>>> It is literally "Largest contiguous block of TCEs allocated specifically
>>>> for (that is, are reserved for) this PE". Which I understand as the maximum
>>>> number of TCEs.
>>>
>>> Ok, so essentially it's a property of the IOMMU.  Hrm, I guess
>>> ram_size is good enough for now then.
>>>
>>> [snip]
>>>>> [snip]
>>>>>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>>>>>>> +                                          sPAPREnvironment *spapr,
>>>>>>>> +                                          uint32_t token, uint32_t nargs,
>>>>>>>> +                                          target_ulong args,
>>>>>>>> +                                          uint32_t nret, target_ulong rets)
>>>>>>>> +{
>>>>>>>> +    sPAPRPHBState *sphb;
>>>>>>>> +    sPAPRPHBClass *spc;
>>>>>>>> +    sPAPRTCETable *tcet = NULL;
>>>>>>>> +    uint32_t addr, page_shift, window_shift, liobn;
>>>>>>>> +    uint64_t buid;
>>>>>>>> +    long ret;
>>>>>>>> +
>>>>>>>> +    if ((nargs != 5) || (nret != 4)) {
>>>>>>>> +        goto param_error_exit;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>>>>>> +    addr = rtas_ld(args, 0);
>>>>>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>>>>>> +    if (!sphb) {
>>>>>>>> +        goto param_error_exit;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
>>>>>>>> +    if (!spc->ddw_create) {
>>>>>>>> +        goto hw_error_exit;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    page_shift = rtas_ld(args, 3);
>>>>>>>> +    window_shift = rtas_ld(args, 4);
>>>>>>>> +    liobn = sphb->dma_liobn + 0x10000;
>>>>>>>
>>>>>>> Isn't using a fixed LIOBN here assuming you can only have a single DDW
>>>>>>> per PHB?  That's true for now, but in theory shouldn't it be reported
>>>>>>> by the PHB code itself?
>>>>>>
>>>>>>
>>>>>> This should be a unique LIOBN so it is not up to PHB to choose. And we
>>>>>> cannot make it completely random for migration purposes. I'll make it
>>>>>> something like
>>>>>>
>>>>>> #define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)
>>>>>
>>>>> Ok.
>>>>>
>>>>> Really, the assigned liobns should be included in the migration stream
>>>>> if they're not already.
>>>>
>>>> LIOBNs already migrate, liobn itself is an instance id of a TCE table
>>>> object in the migration stream.
>>>
>>> Ok, so couldn't we just add an alloc_liobn() function instead of
>>> hardcoding how the liobns are constructed?
>>
>>
>> No. If we did so, exact numbers would depend on the device order in the
>> QEMU command line - QEMU command line produced by libvirt from handmade XML
>> and QEMU command line produced by libvirt from XML printed by "dumpxml" can
>> have devices in different order, so interrupt numbers, LIOBNs - all of this
>> gets broken.
> 
> Ah, duh.  Clearly I'm still swapping back in my knowledge of qemu
> migration.
> 
> What I was meaning before is that qemu should reconstruct the TCE
> tables / liobns based on the info in the migration stream, which
> remains true, but it can't really be done without more broadly
> addressing the fact that qemu doesn't transmit the hardware
> configuration in the migration stream.
> 
> Ok, so what should probably be done here is to explicitly assign
> fields in the LIOBN to "device type", "device id" (vio reg or PHB
> buid), and per-device instance (DDW number for PCI).
> 

Yes, this is my new queue:

------------------------------ hw/ppc/spapr_pci.c
------------------------------
index 5c46c0d..17eb0d8 100644
@@ -527,11 +527,11 @@ static void spapr_phb_realize(DeviceState *dev, Error
**errp)
                        " be specified for PAPR PHB, not both");
             return;
         }

         sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-        sphb->dma_liobn = SPAPR_PCI_BASE_LIOBN + sphb->index;
+        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);

         windows_base = SPAPR_PCI_WINDOW_BASE
             + sphb->index * SPAPR_PCI_WINDOW_SPACING;
         sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
         sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;

---------------------------- include/hw/ppc/spapr.h
----------------------------
index c9d6c6c..92c6e2c 100644
@@ -441,11 +441,11 @@ int spapr_rtas_device_tree_setup(void *fdt, hwaddr
rtas_addr,
 #define SPAPR_TCE_PAGE_SHIFT   12
 #define SPAPR_TCE_PAGE_SIZE    (1ULL << SPAPR_TCE_PAGE_SHIFT)
 #define SPAPR_TCE_PAGE_MASK    (SPAPR_TCE_PAGE_SIZE - 1)

 #define SPAPR_VIO_BASE_LIOBN    0x00000000
-#define SPAPR_PCI_BASE_LIOBN    0x80000000
+#define SPAPR_PCI_LIOBN(i, n)   (0x80000000 | ((i) << 8) | (n))

 #define RTAS_ERROR_LOG_MAX      2048




-- 
Alexey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support
  2014-08-15  5:27                 ` Alexey Kardashevskiy
@ 2014-08-15  5:30                   ` David Gibson
  0 siblings, 0 replies; 55+ messages in thread
From: David Gibson @ 2014-08-15  5:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 7279 bytes --]

On Fri, Aug 15, 2014 at 03:27:42PM +1000, Alexey Kardashevskiy wrote:
> On 08/15/2014 02:20 PM, David Gibson wrote:
> > On Fri, Aug 15, 2014 at 01:09:20PM +1000, Alexey Kardashevskiy wrote:
> >> On 08/15/2014 10:04 AM, David Gibson wrote:
> >>> On Thu, Aug 14, 2014 at 06:29:50PM +1000, Alexey Kardashevskiy wrote:
> >>>> On 08/13/2014 01:27 PM, David Gibson wrote:
> >>>>> On Tue, Aug 12, 2014 at 05:25:29PM +1000, Alexey Kardashevskiy wrote:
> >>>>>> On 08/12/2014 11:45 AM, David Gibson wrote:
> >>>>>>> On Thu, Jul 31, 2014 at 07:34:10PM +1000, Alexey Kardashevskiy
> >>>>>> wrote:
> >>>>> [snip]
> >>>>>>> The function of this is kind of unclear.  I'm assuming this is
> >>>>>>> filtering the supported page sizes reported by the PHB by the possible
> >>>>>>> page sizes based on host page size or other constraints.  Is that
> >>>>>>> right?
> >>>>>>>
> >>>>>>> I think you'd be better off folding the whole double loop into the
> >>>>>>> fixmask function.
> >>>>>>>
> >>>>>>>> +
> >>>>>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>>>>>>> +    rtas_st(rets, 1, windows_available);
> >>>>>>>> +    /* Return maximum number as all RAM was 4K pages */
> >>>>>>>> +    rtas_st(rets, 2, ram_size >> SPAPR_TCE_PAGE_SHIFT);
> >>>>>>>
> >>>>>>> I'm assuming this is the allowed size of the dynamic windows.
> >>>>>>> Shouldn't that be reported by a PHB callback, rather than hardcoded
> >>>>>>> here?
> >>>>>>
> >>>>>> Why PHB? This is DMA memory. @ram_size is the upper limit, we can make more
> >>>>>> only when we have memory hotplug (which we do not have) and the guest can
> >>>>>> create smaller windows if it wants so I do not really follow you here.
> >>>>>
> >>>>> What I'm not clear on is what this RTAS return actually means.  Is it
> >>>>> saying the maximum size of the DMA window, or the maximum address
> >>>>> which can be mapped by that window?  Remember I don't have access to
> >>>>> PAPR documentation any more - nor do others reading these patches.
> >>>>
> >>>>
> >>>> It is literally "Largest contiguous block of TCEs allocated specifically
> >>>> for (that is, are reserved for) this PE". Which I understand as the maximum
> >>>> number of TCEs.
> >>>
> >>> Ok, so essentially it's a property of the IOMMU.  Hrm, I guess
> >>> ram_size is good enough for now then.
> >>>
> >>> [snip]
> >>>>> [snip]
> >>>>>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >>>>>>>> +                                          sPAPREnvironment *spapr,
> >>>>>>>> +                                          uint32_t token, uint32_t nargs,
> >>>>>>>> +                                          target_ulong args,
> >>>>>>>> +                                          uint32_t nret, target_ulong rets)
> >>>>>>>> +{
> >>>>>>>> +    sPAPRPHBState *sphb;
> >>>>>>>> +    sPAPRPHBClass *spc;
> >>>>>>>> +    sPAPRTCETable *tcet = NULL;
> >>>>>>>> +    uint32_t addr, page_shift, window_shift, liobn;
> >>>>>>>> +    uint64_t buid;
> >>>>>>>> +    long ret;
> >>>>>>>> +
> >>>>>>>> +    if ((nargs != 5) || (nret != 4)) {
> >>>>>>>> +        goto param_error_exit;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>>>>>>> +    addr = rtas_ld(args, 0);
> >>>>>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
> >>>>>>>> +    if (!sphb) {
> >>>>>>>> +        goto param_error_exit;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    spc = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(sphb);
> >>>>>>>> +    if (!spc->ddw_create) {
> >>>>>>>> +        goto hw_error_exit;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    page_shift = rtas_ld(args, 3);
> >>>>>>>> +    window_shift = rtas_ld(args, 4);
> >>>>>>>> +    liobn = sphb->dma_liobn + 0x10000;
> >>>>>>>
> >>>>>>> Isn't using a fixed LIOBN here assuming you can only have a single DDW
> >>>>>>> per PHB?  That's true for now, but in theory shouldn't it be reported
> >>>>>>> by the PHB code itself?
> >>>>>>
> >>>>>>
> >>>>>> This should be a unique LIOBN so it is not up to PHB to choose. And we
> >>>>>> cannot make it completely random for migration purposes. I'll make it
> >>>>>> something like
> >>>>>>
> >>>>>> #define SPAPR_DDW_LIOBN(sphb, windownum) ((sphb)->dma_liobn | windownum)
> >>>>>
> >>>>> Ok.
> >>>>>
> >>>>> Really, the assigned liobns should be included in the migration stream
> >>>>> if they're not already.
> >>>>
> >>>> LIOBNs already migrate, liobn itself is an instance id of a TCE table
> >>>> object in the migration stream.
> >>>
> >>> Ok, so couldn't we just add an alloc_liobn() function instead of
> >>> hardcoding how the liobns are constructed?
> >>
> >>
> >> No. If we did so, exact numbers would depend on the device order in the
> >> QEMU command line - QEMU command line produced by libvirt from handmade XML
> >> and QEMU command line produced by libvirt from XML printed by "dumpxml" can
> >> have devices in different order, so interrupt numbers, LIOBNs - all of this
> >> gets broken.
> > 
> > Ah, duh.  Clearly I'm still swapping back in my knowledge of qemu
> > migration.
> > 
> > What I was meaning before is that qemu should reconstruct the TCE
> > tables / liobns based on the info in the migration stream, which
> > remains true, but it can't really be done without more broadly
> > addressing the fact that qemu doesn't transmit the hardware
> > configuration in the migration stream.
> > 
> > Ok, so what should probably be done here is to explicitly assign
> > fields in the LIOBN to "device type", "device id" (vio reg or PHB
> > buid), and per-device instance (DDW number for PCI).
> > 
> 
> Yes, this is my new queue:
> 
> ------------------------------ hw/ppc/spapr_pci.c
> ------------------------------
> index 5c46c0d..17eb0d8 100644
> @@ -527,11 +527,11 @@ static void spapr_phb_realize(DeviceState *dev, Error
> **errp)
>                         " be specified for PAPR PHB, not both");
>              return;
>          }
> 
>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> -        sphb->dma_liobn = SPAPR_PCI_BASE_LIOBN + sphb->index;
> +        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> 
>          windows_base = SPAPR_PCI_WINDOW_BASE
>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>          sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
>          sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> 
> ---------------------------- include/hw/ppc/spapr.h
> ----------------------------
> index c9d6c6c..92c6e2c 100644
> @@ -441,11 +441,11 @@ int spapr_rtas_device_tree_setup(void *fdt, hwaddr
> rtas_addr,
>  #define SPAPR_TCE_PAGE_SHIFT   12
>  #define SPAPR_TCE_PAGE_SIZE    (1ULL << SPAPR_TCE_PAGE_SHIFT)
>  #define SPAPR_TCE_PAGE_MASK    (SPAPR_TCE_PAGE_SIZE - 1)
> 
>  #define SPAPR_VIO_BASE_LIOBN    0x00000000
> -#define SPAPR_PCI_BASE_LIOBN    0x80000000
> +#define SPAPR_PCI_LIOBN(i, n)   (0x80000000 | ((i) << 8) | (n))
> 
>  #define RTAS_ERROR_LOG_MAX      2048

Looks good.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
  2014-08-15  3:16                     ` Alexey Kardashevskiy
@ 2014-08-15  7:37                       ` Alexander Graf
  0 siblings, 0 replies; 55+ messages in thread
From: Alexander Graf @ 2014-08-15  7:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc


On 15.08.14 05:16, Alexey Kardashevskiy wrote:
> On 08/14/2014 11:38 PM, Alexander Graf wrote:
>> On 13.08.14 02:18, Alexey Kardashevskiy wrote:
>>> On 08/13/2014 01:28 AM, Alexander Graf wrote:
>>>> On 12.08.14 17:10, Alexey Kardashevskiy wrote:
>>>>> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>>>>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>>>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>>>>>> This implements DDW for VFIO. Host kernel support is required for
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>>        hw/ppc/spapr_pci_vfio.c | 75
>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>        1 file changed, 75 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>>>>>            /* Register default 32bit DMA window */
>>>>>>>>>>>            memory_region_add_subregion(&sphb->iommu_root,
>>>>>>>>>>> tcet->bus_offset,
>>>>>>>>>>>                                        spapr_tce_get_iommu(tcet));
>>>>>>>>>>> +
>>>>>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>>>>>> +                                    uint32_t *windows_available,
>>>>>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>>>>>> +{
>>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>>>>>> sizeof(query) };
>>>>>>>>>>> +    int ret;
>>>>>>>>>>> +
>>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as,
>>>>>>>>>>> svphb->iommugroupid,
>>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>>>>>> +    if (ret) {
>>>>>>>>>>> +        return ret;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    *windows_available = query.windows_available;
>>>>>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>>>>>> page_shift,
>>>>>>>>>>> +                                     uint32_t window_shift,
>>>>>>>>>>> uint32_t
>>>>>>>>>>> liobn,
>>>>>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>>>>>> +{
>>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>>>>>> +        .argsz = sizeof(create),
>>>>>>>>>>> +        .page_shift = page_shift,
>>>>>>>>>>> +        .window_shift = window_shift,
>>>>>>>>>>> +        .start_addr = 0
>>>>>>>>>>> +    };
>>>>>>>>>>> +    int ret;
>>>>>>>>>>> +
>>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as,
>>>>>>>>>>> svphb->iommugroupid,
>>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE,
>>>>>>>>>>> &create);
>>>>>>>>>>> +    if (ret) {
>>>>>>>>>>> +        return ret;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>>>>>> create.start_addr,
>>>>>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>>>>>> page_shift),
>>>>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but
>>>>>>>>>> please
>>>>>>>>>> just always use ULL when you pass around addresses.
>>>>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>>>>>> honoration means. If I use THP, what page size granularity can I use
>>>>>>>>>> for
>>>>>>>>>> TCE entries?
>>>>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>>>>>> support
>>>>>>>>>
>>>>>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>>>>>> +        };
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Supported page sizes are returned by the host kernel via "query". For
>>>>>>>>> 16MB
>>>>>>>>> pages, page shift will return
>>>>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>>>>>> Or I did not understand the question...
>>>>>>>> Why do we care about the sizes? Anything bigger than what we support
>>>>>>>> should
>>>>>>>> always work, no? What happens if the guest creates a 16MB map but my
>>>>>>>> pages
>>>>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>>>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>>>>>> pages, I have to make sure these 16M are continuous - there will be one
>>>>>>> TCE
>>>>>>> entry for it and no more translations besides IOMMU. What do I miss now?
>>>>>> Who does the shadow translation where? Does it exist at all?
>>>>> IOMMU? I am not sure I am following you... This IOMMU will look as direct
>>>>> DMA for the guest but the real IOMMU table is sparse and it is populated
>>>>> via a bunch of H_PUT_TCE calls as the default small window.
>>>>>
>>>>> There is a direct mapping in the host called "bypass window" but it is not
>>>>> used here as sPAPR does not define that for paravirtualization.
>>>> Ok, imagine I have 16MB of guest physical memory that is in reality backed
>>>> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
>>>> this (from its point of view contiguous) chunk of memory.
>>>>
>>>> Do we allow this?
>>> No, we do not. We tell the guest what it can use.
>>>
>>>> Or do we force the guest to create 64k TCE entries?
>>> 16MB TCE pages are only allowed if qemu is running with hugepages.
>> That's unfortunate ;)
>
> This is my limitation, not SPAPR spec or anything like that.
>
>> but as long as we have to pin TCEd memory anyway, I
>> guess it doesn't hurt as badly.
> Yep.
>
>
>>>
>>>> If we allow it, why would we ever put any restriction at the upper end of
>>>> TCE entry sizes? If we already implement enough logic to map things lazily
>>>> around, we could as well have the guest create a 256M TCE entry and just
>>>> split it on the host view to 64k TCE entries.
>>> Oh, thiiiiiis is what you meant...
>>>
>>> Well, we could, just for now current linux guests support 4K/64K/16M only
>>> and they choose depending on what hypervisor supports - look at
>>> enable_ddw() in the guest. What you suggest seems to be an unnecessary code
>>> duplication for 16MB pages case. For bigger page sizes - for example, for
>>> 64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
>>> awesome enough, no?
>> In "normal" invironments guests won't be backed by 16M pages, but by 64k
>> pages with the occasional THP huge page merge that you can't rely on.
>>
>> That's why I figured it'd be smart to support 16MB TCEs even when the
>> underlying memory is only backed by 64k pages.
>
> Real TCE table will be using 64K pages anyway - whether we fake 16MB pages
> to the guest and split them silently in QEMU/host OR guest itself uses 64K
> pages, in any case we end up with a bigger 64K TCE table so I do not see
> what we win if we fake 16MB pages for the guest.

Well, what if you for example want to live migrate from a hugepage 
backed guest to a non-hugepage backed guest? In that case you have to 
migrate the TCEs as well, but the new location can't use the 16MB ones, 
right?

We also win in that we have fewer hypercalls to add TCEs, but I suppose 
that part is almost negligible.


Alex


> My point it the code is already in the guest, it can do 64K and 16MB pages,
> if it was not there, yes, everything you suggest would make sense to
> implement :)

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2014-08-15  7:37 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 01/10] qom: Make object_child_foreach safe for objects removal Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows Alexey Kardashevskiy
2014-08-12  1:17   ` David Gibson
2014-08-12  7:32     ` Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public Alexey Kardashevskiy
2014-08-11 11:39   ` Alexander Graf
2014-08-11 14:56     ` Alexey Kardashevskiy
2014-08-11 17:16       ` Alexander Graf
2014-08-12  1:19   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 04/10] spapr_iommu: Make spapr_tce_find_by_liobn() public Alexey Kardashevskiy
2014-08-12  1:19   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW Alexey Kardashevskiy
2014-08-12  1:20   ` David Gibson
2014-08-12  7:16     ` Alexey Kardashevskiy
2014-08-13  3:23       ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support Alexey Kardashevskiy
2014-08-11 11:51   ` Alexander Graf
2014-08-11 15:34     ` Alexey Kardashevskiy
2014-08-12  1:45   ` David Gibson
2014-08-12  7:25     ` Alexey Kardashevskiy
2014-08-13  3:27       ` David Gibson
2014-08-14  8:29         ` Alexey Kardashevskiy
2014-08-15  0:04           ` David Gibson
2014-08-15  3:09             ` Alexey Kardashevskiy
2014-08-15  4:20               ` David Gibson
2014-08-15  5:27                 ` Alexey Kardashevskiy
2014-08-15  5:30                   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 07/10] spapr: Add "ddw" machine option Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW Alexey Kardashevskiy
2014-08-11 11:59   ` Alexander Graf
2014-08-11 15:26     ` Alexey Kardashevskiy
2014-08-11 17:29       ` Alexander Graf
2014-08-12  0:13         ` Alexey Kardashevskiy
2014-08-12  3:59           ` Alexey Kardashevskiy
2014-08-12  9:36             ` Alexander Graf
2014-08-12  2:10   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: " Alexey Kardashevskiy
2014-08-11 12:02   ` Alexander Graf
2014-08-11 15:01     ` Alexey Kardashevskiy
2014-08-11 17:30       ` Alexander Graf
2014-08-12  0:03         ` Alexey Kardashevskiy
2014-08-12  9:37           ` Alexander Graf
2014-08-12 15:10             ` Alexey Kardashevskiy
2014-08-12 15:28               ` Alexander Graf
2014-08-13  0:18                 ` Alexey Kardashevskiy
2014-08-14 13:38                   ` Alexander Graf
2014-08-15  0:09                     ` David Gibson
2014-08-15  3:22                       ` Alexey Kardashevskiy
2014-08-15  3:16                     ` Alexey Kardashevskiy
2014-08-15  7:37                       ` Alexander Graf
2014-08-12  2:14   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 10/10] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
2014-08-05  1:30 ` [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2014-08-10 23:50   ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.