All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries
@ 2016-10-06  3:03 David Gibson
  2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06  3:03 UTC (permalink / raw)
  To: qemu-ppc
  Cc: qemu-devel, benh, thuth, lvivier, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik, David Gibson

The current way we organize the IO windows into PCI space for the
pseries machine type has several problems.

  - It makes it difficult to create very large MMIO spaces which is
    necessary for certain PCI devices with very large BARs.  This
    problem has been known for a while.

  - More recently we discovered a more serious problem: it prevents
    more than 1TiB of RAM being added to a pseries guest.

  - It doesn't make very efficient use of address space.

Fixing this is complicated by keeping migration from old versionss
working and working out what things belong on which side of the
abstraction barrier between the machine type and the host bridge
device.

This patch addresses all these problems.  Patches 1-2/4 represent a
minimal fix for the most serious problem (the 1 TiB limit) - once
polished, I'll consider submiting these for the stable branch.  3-4/4
complete a more comprehensive fix.

David Gibson (4):
  spapr_pci: Delegate placement of PCI host bridges to machine type
  spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM
  spapr_pci: Add a 64-bit MMIO window
  spapr: Improved placement of PCI host bridges in guest memory map

 hw/ppc/spapr.c              | 121 +++++++++++++++++++++++++++++++++++++++++++-
 hw/ppc/spapr_pci.c          |  90 ++++++++++++++++++++++----------
 include/hw/pci-host/spapr.h |  24 ++++-----
 include/hw/ppc/spapr.h      |   5 ++
 4 files changed, 199 insertions(+), 41 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-06  3:03 [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries David Gibson
@ 2016-10-06  3:03 ` David Gibson
  2016-10-06  7:10   ` Laurent Vivier
                     ` (2 more replies)
  2016-10-06  3:03 ` [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM David Gibson
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06  3:03 UTC (permalink / raw)
  To: qemu-ppc
  Cc: qemu-devel, benh, thuth, lvivier, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik, David Gibson

The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
and PAPR guests) to have numerous independent PHBs, each controlling a
separate PCI domain.

There are two ways of configuring the spapr-pci-host-bridge device: first
it can be done fully manually, specifying the locations and sizes of all
the IO windows.  This gives the most control, but is very awkward with 6
mandatory parameters.  Alternatively just an "index" can be specified
which essentially selects from an array of predefined PHB locations.
The PHB at index 0 is automatically created as the default PHB.

The current set of default locations causes some problems for guests with
large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
GPGPU cards via VFIO).  Obviously, for migration we can only change the
locations on a new machine type, however.

This is awkward, because the placement is currently decided within the
spapr-pci-host-bridge code, so it breaks abstraction to look inside the
machine type version.

So, this patch delegates the "default mode" PHB placement from the
spapr-pci-host-bridge device back to the machine type via a public method
in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
can do.

For now, this just changes where the calculation is done.  It doesn't
change the actual location of the host bridges, or any other behaviour.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_pci.c          | 22 ++++++++--------------
 include/hw/pci-host/spapr.h | 11 +----------
 include/hw/ppc/spapr.h      |  4 ++++
 4 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 03e3803..f6e9c2a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
     return head;
 }
 
+static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
+                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
+                                hwaddr *mmio, hwaddr *mmio_size,
+                                unsigned n_dma, uint32_t *liobns, Error **errp)
+{
+    const uint64_t base_buid = 0x800000020000000ULL;
+    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
+    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
+    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
+    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
+    const uint32_t max_index = 255;
+
+    hwaddr phb_base;
+    int i;
+
+    if (index > max_index) {
+        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
+                   max_index);
+        return;
+    }
+
+    *buid = base_buid + index;
+    for (i = 0; i < n_dma; ++i) {
+        liobns[i] = SPAPR_PCI_LIOBN(index, i);
+    }
+
+    phb_base = phb0_base + index * phb_spacing;
+    *pio = phb_base + pio_offset;
+    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
+    *mmio = phb_base + mmio_offset;
+    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
+}
+
 static void spapr_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
@@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
     mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
     fwc->get_dev_path = spapr_get_fw_dev_path;
     nc->nmi_monitor_handler = spapr_nmi;
+    smc->phb_placement = spapr_phb_placement;
 }
 
 static const TypeInfo spapr_machine_info = {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 4f00865..c0fc964 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
     if (sphb->index != (uint32_t)-1) {
-        hwaddr windows_base;
+        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+        Error *local_err = NULL;
 
         if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
             || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
@@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
             return;
         }
 
-        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
-            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
-                       SPAPR_PCI_MAX_INDEX);
+        smc->phb_placement(spapr, sphb->index,
+                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
+                           &sphb->mem_win_addr, &sphb->mem_win_size,
+                           windows_supported, sphb->dma_liobn, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
             return;
         }
-
-        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-        for (i = 0; i < windows_supported; ++i) {
-            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
-        }
-
-        windows_base = SPAPR_PCI_WINDOW_BASE
-            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
-        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
-        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
     }
 
     if (sphb->buid == (uint64_t)-1) {
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 30dbd46..8c9ebfd 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -79,18 +79,9 @@ struct sPAPRPHBState {
     uint32_t numa_node;
 };
 
-#define SPAPR_PCI_MAX_INDEX          255
-
-#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
-
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
 
-#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
-#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
-#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
-#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
-                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
-#define SPAPR_PCI_IO_WIN_OFF         0x80000000
+#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
 #define SPAPR_PCI_IO_WIN_SIZE        0x10000
 
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 39dadaa..9e1c5a5 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -40,6 +40,10 @@ struct sPAPRMachineClass {
     bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
     bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
     const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
+    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
+                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
+                          hwaddr *mmio, hwaddr *mmio_size,
+                          unsigned n_dma, uint32_t *liobns, Error **errp);
 };
 
 /**
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM
  2016-10-06  3:03 [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries David Gibson
  2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
@ 2016-10-06  3:03 ` David Gibson
  2016-10-06  7:21   ` Laurent Vivier
  2016-10-06  9:36   ` Laurent Vivier
  2016-10-06  3:03 ` [Qemu-devel] [RFC 3/4] spapr_pci: Add a 64-bit MMIO window David Gibson
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06  3:03 UTC (permalink / raw)
  To: qemu-ppc
  Cc: qemu-devel, benh, thuth, lvivier, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik, David Gibson

Currently the default PCI host bridge for the 'pseries' machine type is
constructed with its IO windows in the 1TiB..(1TiB + 64GiB) range in
guest memory space.  This means that if > 1TiB of guest RAM is specified,
the RAM will collide with the PCI IO windows, causing serious problems.

Problems won't be obvious until guest RAM goes a bit beyond 1TiB, because
there's a little unused space at the bottom of the area reserved for PCI,
but essentially this means that > 1TiB of RAM has never worked with the
pseries machine type.

This patch fixes this by altering the placement of PHBs on large-RAM VMs.
Instead of always placing the first PHB at 1TiB, it is placed at the next
1 TiB boundary after the maximum RAM address.

Technically, this changes behaviour in a migration-breaking way for
existing machines with > 1TiB maximum memory, but since having > 1 TiB
memory was broken anyway, this seems like a reasonable trade-off.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index f6e9c2a..9f3e004 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2376,12 +2376,15 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
                                 unsigned n_dma, uint32_t *liobns, Error **errp)
 {
     const uint64_t base_buid = 0x800000020000000ULL;
-    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
     const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
     const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
     const hwaddr pio_offset = 0x80000000; /* 2 GiB */
     const uint32_t max_index = 255;
+    const hwaddr phb0_alignment = 0x10000000000ULL; /* 1 TiB */
 
+    uint64_t max_hotplug_addr = spapr->hotplug_memory.base +
+        memory_region_size(&spapr->hotplug_memory.mr);
+    hwaddr phb0_base = QEMU_ALIGN_UP(max_hotplug_addr, phb0_alignment);
     hwaddr phb_base;
     int i;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 3/4] spapr_pci: Add a 64-bit MMIO window
  2016-10-06  3:03 [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries David Gibson
  2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
  2016-10-06  3:03 ` [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM David Gibson
@ 2016-10-06  3:03 ` David Gibson
  2016-10-06  3:03 ` [Qemu-devel] [RFC 4/4] spapr: Improved placement of PCI host bridges in guest memory map David Gibson
  2016-10-10 15:53 ` [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries no-reply
  4 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06  3:03 UTC (permalink / raw)
  To: qemu-ppc
  Cc: qemu-devel, benh, thuth, lvivier, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik, David Gibson

On real hardware, and under pHyp, the PCI host bridges on Power machines
typically advertise two outbound MMIO windows from the guest's physical
memory space to PCI memory space:
  - A 32-bit window which maps onto 2GiB..4GiB in the PCI address space
  - A 64-bit window which maps onto a large region somewhere high in PCI
    address space (traditionally this used an identity mapping from guest
    physical address to PCI address, but that's not always the case)

The qemu implementation in spapr-pci-host-bridge, however, only supports a
single outbound MMIO window, however.  At least some Linux versions expect
the two windows however, so we arranged this window to map onto the PCI
memory space from 2 GiB..~64 GiB, then advertised it as two contiguous
windows, the "32-bit" window from 2G..4G and the "64-bit" window from
4G..~64G.

This approach means, however, that the 64G window is not naturally aligned.
In turn this limits the size of the largest BAR we can map (which does have
to be naturally aligned) to roughly half of the total window.  With some
large nVidia GPGPU cards which have huge memory BARs, this is starting to
be a problem.

This patch adds true support for separate 32-bit and 64-bit outbound MMIO
windows to the spapr-pci-host-bridge implementation, each of which can
be independently configured.  The 32-bit window always maps to 2G.. in PCI
space, but the PCI address of the 64-bit window can be configured (it
defaults to the same as the guest physical address).

So as not to break possible existing configurations, as long as a 64-bit
window is not specified, a large single window can be specified.  This
will appear the same way to the guest as the old approach, although it's
now implemented by two contiguous memory regions rather than a single one.

For now, this only adds the possibility of 64-bit windows.  The default
configuration still uses the legacy mode.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c              | 12 ++++++---
 hw/ppc/spapr_pci.c          | 65 ++++++++++++++++++++++++++++++++++++---------
 include/hw/pci-host/spapr.h |  7 +++--
 include/hw/ppc/spapr.h      |  3 ++-
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 9f3e004..864a48b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2372,7 +2372,8 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
 
 static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
                                 uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
-                                hwaddr *mmio, hwaddr *mmio_size,
+                                hwaddr *mmio32, hwaddr *mmio32_size,
+                                hwaddr *mmio64, hwaddr *mmio64_size,
                                 unsigned n_dma, uint32_t *liobns, Error **errp)
 {
     const uint64_t base_buid = 0x800000020000000ULL;
@@ -2402,8 +2403,13 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
     phb_base = phb0_base + index * phb_spacing;
     *pio = phb_base + pio_offset;
     *pio_size = SPAPR_PCI_IO_WIN_SIZE;
-    *mmio = phb_base + mmio_offset;
-    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
+    *mmio32 = phb_base + mmio_offset;
+    *mmio32_size = SPAPR_PCI_MMIO_WIN_SIZE;
+    /*
+     * We don't set the 64-bit MMIO window, relying on the PHB's
+     * fallback behaviour of automatically splitting a large "32-bit"
+     * window into contiguous 32-bit and 64-bit windows
+     */
 }
 
 static void spapr_machine_class_init(ObjectClass *oc, void *data)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index c0fc964..c4579c3 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1317,6 +1317,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
             || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
             || (sphb->mem_win_addr != (hwaddr)-1)
+            || (sphb->mem64_win_addr != (hwaddr)-1)
             || (sphb->io_win_addr != (hwaddr)-1)) {
             error_setg(errp, "Either \"index\" or other parameters must"
                        " be specified for PAPR PHB, not both");
@@ -1326,6 +1327,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         smc->phb_placement(spapr, sphb->index,
                            &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
                            &sphb->mem_win_addr, &sphb->mem_win_size,
+                           &sphb->mem64_win_addr, &sphb->mem64_win_size,
                            windows_supported, sphb->dma_liobn, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
@@ -1354,6 +1356,37 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    if (sphb->mem64_win_size != 0) {
+        if (sphb->mem64_win_addr == (hwaddr)-1) {
+            error_setg(errp, "64-bit memory window address not specified for PHB");
+            return;
+        }
+
+        if (sphb->mem_win_size > SPAPR_PCI_MEM32_WIN_SIZE) {
+            error_setg(errp, "32-bit memory window of size 0x%"HWADDR_PRIx" (max 2 GiB)",
+                       sphb->mem_win_size);
+            return;
+        }
+
+        if (sphb->mem64_win_pciaddr == (hwaddr)-1) {
+            /* 64-bit window defaults to identity mapping */
+            sphb->mem64_win_pciaddr = sphb->mem64_win_addr;
+        }
+    } else if (sphb->mem_win_size > SPAPR_PCI_MEM32_WIN_SIZE) {
+        /*
+         * For compatibility with old configuration, if no 64-bit MMIO
+         * window is specified, but the ordinary (32-bit) memory
+         * window is specified as > 2GiB, we treat it as a 2GiB 32-bit
+         * window, with a 64-bit MMIO window following on immediately
+         * afterwards
+         */
+        sphb->mem64_win_size = sphb->mem_win_size - SPAPR_PCI_MEM32_WIN_SIZE;
+        sphb->mem64_win_addr = sphb->mem_win_addr + SPAPR_PCI_MEM32_WIN_SIZE;
+        sphb->mem64_win_pciaddr =
+            SPAPR_PCI_MEM_WIN_BUS_OFFSET + SPAPR_PCI_MEM32_WIN_SIZE;
+        sphb->mem_win_size = SPAPR_PCI_MEM32_WIN_SIZE;
+    }
+
     if (spapr_pci_find_phb(spapr, sphb->buid)) {
         error_setg(errp, "PCI host bridges must have unique BUIDs");
         return;
@@ -1367,12 +1400,19 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     sprintf(namebuf, "%s.mmio", sphb->dtbusname);
     memory_region_init(&sphb->memspace, OBJECT(sphb), namebuf, UINT64_MAX);
 
-    sprintf(namebuf, "%s.mmio-alias", sphb->dtbusname);
-    memory_region_init_alias(&sphb->memwindow, OBJECT(sphb),
+    sprintf(namebuf, "%s.mmio32-alias", sphb->dtbusname);
+    memory_region_init_alias(&sphb->mem32window, OBJECT(sphb),
                              namebuf, &sphb->memspace,
                              SPAPR_PCI_MEM_WIN_BUS_OFFSET, sphb->mem_win_size);
     memory_region_add_subregion(get_system_memory(), sphb->mem_win_addr,
-                                &sphb->memwindow);
+                                &sphb->mem32window);
+
+    sprintf(namebuf, "%s.mmio64-alias", sphb->dtbusname);
+    memory_region_init_alias(&sphb->mem64window, OBJECT(sphb),
+                             namebuf, &sphb->memspace,
+                             sphb->mem64_win_pciaddr, sphb->mem64_win_size);
+    memory_region_add_subregion(get_system_memory(), sphb->mem64_win_addr,
+                                &sphb->mem64window);
 
     /* Initialize IO regions */
     sprintf(namebuf, "%s.io", sphb->dtbusname);
@@ -1526,6 +1566,10 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
     DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
                        SPAPR_PCI_MMIO_WIN_SIZE),
+    DEFINE_PROP_UINT64("mem64_win_addr", sPAPRPHBState, mem64_win_addr, -1),
+    DEFINE_PROP_UINT64("mem64_win_size", sPAPRPHBState, mem64_win_size, 0),
+    DEFINE_PROP_UINT64("mem64_win_pciaddr", sPAPRPHBState, mem64_win_pciaddr,
+                       -1),
     DEFINE_PROP_UINT64("io_win_addr", sPAPRPHBState, io_win_addr, -1),
     DEFINE_PROP_UINT64("io_win_size", sPAPRPHBState, io_win_size,
                        SPAPR_PCI_IO_WIN_SIZE),
@@ -1761,10 +1805,6 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     int bus_off, i, j, ret;
     char nodename[FDT_NAME_MAX];
     uint32_t bus_range[] = { cpu_to_be32(0), cpu_to_be32(0xff) };
-    const uint64_t mmiosize = memory_region_size(&phb->memwindow);
-    const uint64_t w32max = (1ULL << 32) - SPAPR_PCI_MEM_WIN_BUS_OFFSET;
-    const uint64_t w32size = MIN(w32max, mmiosize);
-    const uint64_t w64size = (mmiosize > w32size) ? (mmiosize - w32size) : 0;
     struct {
         uint32_t hi;
         uint64_t child;
@@ -1779,15 +1819,16 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
         {
             cpu_to_be32(b_ss(2)), cpu_to_be64(SPAPR_PCI_MEM_WIN_BUS_OFFSET),
             cpu_to_be64(phb->mem_win_addr),
-            cpu_to_be64(w32size),
+            cpu_to_be64(phb->mem_win_size),
         },
         {
-            cpu_to_be32(b_ss(3)), cpu_to_be64(1ULL << 32),
-            cpu_to_be64(phb->mem_win_addr + w32size),
-            cpu_to_be64(w64size)
+            cpu_to_be32(b_ss(3)), cpu_to_be64(phb->mem64_win_pciaddr),
+            cpu_to_be64(phb->mem64_win_addr),
+            cpu_to_be64(phb->mem64_win_size),
         },
     };
-    const unsigned sizeof_ranges = (w64size ? 3 : 2) * sizeof(ranges[0]);
+    const unsigned sizeof_ranges =
+        (phb->mem64_win_size ? 3 : 2) * sizeof(ranges[0]);
     uint64_t bus_reg[] = { cpu_to_be64(phb->buid), 0 };
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 8c9ebfd..5324d4c 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -53,8 +53,10 @@ struct sPAPRPHBState {
     bool dr_enabled;
 
     MemoryRegion memspace, iospace;
-    hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
-    MemoryRegion memwindow, iowindow, msiwindow;
+    hwaddr mem_win_addr, mem_win_size, mem64_win_addr, mem64_win_size;
+    uint64_t mem64_win_pciaddr;
+    hwaddr io_win_addr, io_win_size;
+    MemoryRegion mem32window, mem64window, iowindow, msiwindow;
 
     uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
     hwaddr dma_win_addr, dma_win_size;
@@ -80,6 +82,7 @@ struct sPAPRPHBState {
 };
 
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
+#define SPAPR_PCI_MEM32_WIN_SIZE     ((1ULL <<32) - SPAPR_PCI_MEM_WIN_BUS_OFFSET)
 
 #define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
 #define SPAPR_PCI_IO_WIN_SIZE        0x10000
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 9e1c5a5..9054034 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -42,7 +42,8 @@ struct sPAPRMachineClass {
     const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
     void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
-                          hwaddr *mmio, hwaddr *mmio_size,
+                          hwaddr *mmio32, hwaddr *mmio32_size,
+                          hwaddr *mmio64, hwaddr *mmio64_size,
                           unsigned n_dma, uint32_t *liobns, Error **errp);
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 4/4] spapr: Improved placement of PCI host bridges in guest memory map
  2016-10-06  3:03 [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries David Gibson
                   ` (2 preceding siblings ...)
  2016-10-06  3:03 ` [Qemu-devel] [RFC 3/4] spapr_pci: Add a 64-bit MMIO window David Gibson
@ 2016-10-06  3:03 ` David Gibson
  2016-10-10 15:53 ` [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries no-reply
  4 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06  3:03 UTC (permalink / raw)
  To: qemu-ppc
  Cc: qemu-devel, benh, thuth, lvivier, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik, David Gibson

Currently, the MMIO space for accessing PCI on pseries guests begins at
1 TiB in guest address space.  Each PCI host bridge (PHB) has a 64 GiB
chunk of address space in which it places its outbound PIO and 32-bit and
64-bit MMIO windows.

This scheme as several problems:
  - It limits guest RAM to 1 TiB (though we have a limited fix for this
    now)
  - It limits the total MMIO window to 64 GiB.  This is not always enough
    for some of the large nVidia GPGPU cards
  - Putting all the windows into a single 64 GiB area means that naturally
    aligning things within there will waste more address space.
In addition there was a miscalculation in some of the defaults, which meant
that the MMIO windows for each PHB actually slightly overran the 64 GiB
region for that PHB.  We got away without nasty consequences because
the overrun fit within an unused area at the beginning of the next PHB's
region, but it's not pretty.

This patch implements a new scheme which addresses those problems, and is
also closer to what bare metal hardware and pHyp guests generally use.

Because some guest versions (including most current distro kernels) can't
access PCI MMIO above 64 TiB, we put all the PCI windows between 32 TiB and
64 TiB.  This is broken into 1 TiB chunks.  The 1 TiB contains the PIO
(64 kiB) and 32-bit MMIO (2 GiB) windows for all of the PHBs.  Each
subsequent TiB chunk contains a naturally aligned 64-bit MMIO window for
one PHB each.

This reduces the number of allowed PHBs (without full manual configuration
of all the windows) from 256 to 31, but this should still be plenty in
practice.

We also change some of the default window sizes for manually configured
PHBs to saner values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c              | 118 +++++++++++++++++++++++++++++++++++---------
 hw/ppc/spapr_pci.c          |   5 +-
 include/hw/pci-host/spapr.h |   8 ++-
 3 files changed, 106 insertions(+), 25 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 864a48b..d842306 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2376,22 +2376,42 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
                                 hwaddr *mmio64, hwaddr *mmio64_size,
                                 unsigned n_dma, uint32_t *liobns, Error **errp)
 {
+    /*
+     * New-style PHB window placement.
+     *
+     * Goals: Gives large (1TiB), naturally aligned 64-bit MMIO window
+     * for each PHB, in addition to 2GiB 32-bit MMIO and 64kiB PIO
+     * windows.
+     *
+     * Some guest kernels can't work with MMIO windows above 1<<46
+     * (64TiB), so we place up to 31 PHBs in the area 32TiB..64TiB
+     *
+     * 32TiB..33TiB contains the PIO and 32-bit MMIO windows for all
+     * PHBs.  33..34TiB has the 64-bit MMIO window for PHB0, 34..35
+     * has the 64-bit window for PHB1 and so forth.
+     */
     const uint64_t base_buid = 0x800000020000000ULL;
-    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
-    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
-    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
-    const uint32_t max_index = 255;
-    const hwaddr phb0_alignment = 0x10000000000ULL; /* 1 TiB */
+    const hwaddr mmio64_win_size = (1ULL << 40); /* 1 TiB */
 
-    uint64_t max_hotplug_addr = spapr->hotplug_memory.base +
-        memory_region_size(&spapr->hotplug_memory.mr);
-    hwaddr phb0_base = QEMU_ALIGN_UP(max_hotplug_addr, phb0_alignment);
-    hwaddr phb_base;
+    int max_phbs = (SPAPR_PCI_LIMIT - SPAPR_PCI_BASE) / mmio64_win_size - 1;
+    hwaddr mmio32_base = SPAPR_PCI_BASE + SPAPR_PCI_MEM32_WIN_SIZE;
+    hwaddr mmio64_base = SPAPR_PCI_BASE + mmio64_win_size;
     int i;
 
-    if (index > max_index) {
+    /* Sanity check natural alignments */
+    assert((SPAPR_PCI_BASE % mmio64_win_size) == 0);
+    assert((SPAPR_PCI_LIMIT % mmio64_win_size) == 0);
+    assert((mmio64_win_size % SPAPR_PCI_MEM32_WIN_SIZE) == 0);
+    assert((SPAPR_PCI_MEM32_WIN_SIZE % SPAPR_PCI_IO_WIN_SIZE) == 0);
+    /* Sanity check bounds */
+    assert((SPAPR_PCI_BASE + max_phbs * SPAPR_PCI_IO_WIN_SIZE)
+           <= mmio32_base);
+    assert(mmio32_base + max_phbs * SPAPR_PCI_MEM32_WIN_SIZE
+           <= mmio64_base);
+
+    if (index >= max_phbs) {
         error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
-                   max_index);
+                   max_phbs - 1);
         return;
     }
 
@@ -2400,16 +2420,14 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
         liobns[i] = SPAPR_PCI_LIOBN(index, i);
     }
 
-    phb_base = phb0_base + index * phb_spacing;
-    *pio = phb_base + pio_offset;
+    *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
     *pio_size = SPAPR_PCI_IO_WIN_SIZE;
-    *mmio32 = phb_base + mmio_offset;
-    *mmio32_size = SPAPR_PCI_MMIO_WIN_SIZE;
-    /*
-     * We don't set the 64-bit MMIO window, relying on the PHB's
-     * fallback behaviour of automatically splitting a large "32-bit"
-     * window into contiguous 32-bit and 64-bit windows
-     */
+
+    *mmio32 = mmio32_base + index * SPAPR_PCI_MEM32_WIN_SIZE;
+    *mmio32_size = SPAPR_PCI_MEM32_WIN_SIZE;
+
+    *mmio64 = mmio64_base + index * mmio64_win_size;
+    *mmio64_size = mmio64_win_size;
 }
 
 static void spapr_machine_class_init(ObjectClass *oc, void *data)
@@ -2513,8 +2531,63 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
 /*
  * pseries-2.7
  */
-#define SPAPR_COMPAT_2_7 \
-    HW_COMPAT_2_7 \
+#define SPAPR_COMPAT_2_7                            \
+    HW_COMPAT_2_7                                   \
+    {                                               \
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
+        .property = "mem_win_size",                 \
+        .value    = stringify(SPAPR_PCI_2_7_MMIO_WIN_SIZE),\
+    },                                              \
+    {                                               \
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
+        .property = "mem64_win_size",               \
+        .value    = "0",                            \
+    },
+
+        
+
+static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
+                              uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
+                              hwaddr *mmio32, hwaddr *mmio32_size,
+                              hwaddr *mmio64, hwaddr *mmio64_size,
+                              unsigned n_dma, uint32_t *liobns, Error **errp)
+{
+    /* Legacy PHB placement for pseries-2.7 and earlier machine types */
+    const uint64_t base_buid = 0x800000020000000ULL;
+    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
+    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
+    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
+    const uint32_t max_index = 255;
+    const hwaddr phb0_alignment = 0x10000000000ULL; /* 1 TiB */
+
+    uint64_t max_hotplug_addr = spapr->hotplug_memory.base +
+        memory_region_size(&spapr->hotplug_memory.mr);
+    hwaddr phb0_base = QEMU_ALIGN_UP(max_hotplug_addr, phb0_alignment);
+    hwaddr phb_base;
+    int i;
+
+    if (index > max_index) {
+        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
+                   max_index);
+        return;
+    }
+
+    *buid = base_buid + index;
+    for (i = 0; i < n_dma; ++i) {
+        liobns[i] = SPAPR_PCI_LIOBN(index, i);
+    }
+
+    phb_base = phb0_base + index * phb_spacing;
+    *pio = phb_base + pio_offset;
+    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
+    *mmio32 = phb_base + mmio_offset;
+    *mmio32_size = SPAPR_PCI_2_7_MMIO_WIN_SIZE;
+    /*
+     * We don't set the 64-bit MMIO window, relying on the PHB's
+     * fallback behaviour of automatically splitting a large "32-bit"
+     * window into contiguous 32-bit and 64-bit windows
+     */
+}
 
 static void spapr_machine_2_7_instance_options(MachineState *machine)
 {
@@ -2527,6 +2600,7 @@ static void spapr_machine_2_7_class_options(MachineClass *mc)
     spapr_machine_2_8_class_options(mc);
     smc->tcg_default_cpu = "POWER7";
     SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_7);
+    smc->phb_placement = phb_placement_2_7;
 }
 
 DEFINE_SPAPR_MACHINE(2_7, "2.7", false);
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index c4579c3..2cd9e6e 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1565,9 +1565,10 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
     DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
     DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
-                       SPAPR_PCI_MMIO_WIN_SIZE),
+                       SPAPR_PCI_MEM32_WIN_SIZE),
     DEFINE_PROP_UINT64("mem64_win_addr", sPAPRPHBState, mem64_win_addr, -1),
-    DEFINE_PROP_UINT64("mem64_win_size", sPAPRPHBState, mem64_win_size, 0),
+    DEFINE_PROP_UINT64("mem64_win_size", sPAPRPHBState, mem64_win_size,
+                       SPAPR_PCI_MEM64_WIN_SIZE),
     DEFINE_PROP_UINT64("mem64_win_pciaddr", sPAPRPHBState, mem64_win_pciaddr,
                        -1),
     DEFINE_PROP_UINT64("io_win_addr", sPAPRPHBState, io_win_addr, -1),
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 5324d4c..239082b 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -83,8 +83,14 @@ struct sPAPRPHBState {
 
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
 #define SPAPR_PCI_MEM32_WIN_SIZE     ((1ULL <<32) - SPAPR_PCI_MEM_WIN_BUS_OFFSET)
+#define SPAPR_PCI_MEM64_WIN_SIZE     0x10000000000ULL /* 1 TiB */
 
-#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
+/* Without manual configuration, all PCI outbound windows will be
+ * within this range */
+#define SPAPR_PCI_BASE               (1ULL << 45) /* 32 TiB */
+#define SPAPR_PCI_LIMIT              (1ULL << 46) /* 64 TiB */
+
+#define SPAPR_PCI_2_7_MMIO_WIN_SIZE  0xf80000000
 #define SPAPR_PCI_IO_WIN_SIZE        0x10000
 
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
@ 2016-10-06  7:10   ` Laurent Vivier
  2016-10-06  8:11     ` David Gibson
  2016-10-06  9:36   ` Laurent Vivier
  2016-10-07  3:57   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 20+ messages in thread
From: Laurent Vivier @ 2016-10-06  7:10 UTC (permalink / raw)
  To: David Gibson, qemu-ppc
  Cc: qemu-devel, benh, thuth, agraf, mst, aik, mdroth, nikunj,
	bharata, abologna, mpolednik



On 06/10/2016 05:03, David Gibson wrote:
> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> and PAPR guests) to have numerous independent PHBs, each controlling a
> separate PCI domain.
> 
> There are two ways of configuring the spapr-pci-host-bridge device: first
> it can be done fully manually, specifying the locations and sizes of all
> the IO windows.  This gives the most control, but is very awkward with 6
> mandatory parameters.  Alternatively just an "index" can be specified
> which essentially selects from an array of predefined PHB locations.
> The PHB at index 0 is automatically created as the default PHB.
> 
> The current set of default locations causes some problems for guests with
> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> GPGPU cards via VFIO).  Obviously, for migration we can only change the
> locations on a new machine type, however.
> 
> This is awkward, because the placement is currently decided within the
> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> machine type version.
> 
> So, this patch delegates the "default mode" PHB placement from the
> spapr-pci-host-bridge device back to the machine type via a public method
> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> can do.
> 
> For now, this just changes where the calculation is done.  It doesn't
> change the actual location of the host bridges, or any other behaviour.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
>  include/hw/pci-host/spapr.h | 11 +----------
>  include/hw/ppc/spapr.h      |  4 ++++
>  4 files changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 03e3803..f6e9c2a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
>      return head;
>  }
>  
> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> +                                hwaddr *mmio, hwaddr *mmio_size,
> +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> +{
> +    const uint64_t base_buid = 0x800000020000000ULL;
> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> +    const uint32_t max_index = 255;

Why do you use new "const" instead of already defined "#define" from
spapr.h?

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM
  2016-10-06  3:03 ` [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM David Gibson
@ 2016-10-06  7:21   ` Laurent Vivier
  2016-10-06  8:46     ` David Gibson
  2016-10-06  9:36   ` Laurent Vivier
  1 sibling, 1 reply; 20+ messages in thread
From: Laurent Vivier @ 2016-10-06  7:21 UTC (permalink / raw)
  To: David Gibson, qemu-ppc
  Cc: qemu-devel, benh, thuth, agraf, mst, aik, mdroth, nikunj,
	bharata, abologna, mpolednik



On 06/10/2016 05:03, David Gibson wrote:
> Currently the default PCI host bridge for the 'pseries' machine type is
> constructed with its IO windows in the 1TiB..(1TiB + 64GiB) range in
> guest memory space.  This means that if > 1TiB of guest RAM is specified,
> the RAM will collide with the PCI IO windows, causing serious problems.
> 
> Problems won't be obvious until guest RAM goes a bit beyond 1TiB, because
> there's a little unused space at the bottom of the area reserved for PCI,
> but essentially this means that > 1TiB of RAM has never worked with the
> pseries machine type.
> 
> This patch fixes this by altering the placement of PHBs on large-RAM VMs.
> Instead of always placing the first PHB at 1TiB, it is placed at the next
> 1 TiB boundary after the maximum RAM address.
> 
> Technically, this changes behaviour in a migration-breaking way for
> existing machines with > 1TiB maximum memory, but since having > 1 TiB
> memory was broken anyway, this seems like a reasonable trade-off.

Perhaps you can add an SPAPR_COMPAT_XX property with PHB0 base to not
break compatibility? I think spapr without PCI card (only VIO, for
instance), should work with 1TiB+.

On another side, how the SPAPR kernel manages memory?
Is it possible to add an hole in RAM between 1TiB and 1TiB+64GiB to
allow the kernel to register the I/O space?

Laurent

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-06  7:10   ` Laurent Vivier
@ 2016-10-06  8:11     ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06  8:11 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: qemu-ppc, qemu-devel, benh, thuth, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 3528 bytes --]

On Thu, Oct 06, 2016 at 09:10:41AM +0200, Laurent Vivier wrote:
> 
> 
> On 06/10/2016 05:03, David Gibson wrote:
> > The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> > for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> > and PAPR guests) to have numerous independent PHBs, each controlling a
> > separate PCI domain.
> > 
> > There are two ways of configuring the spapr-pci-host-bridge device: first
> > it can be done fully manually, specifying the locations and sizes of all
> > the IO windows.  This gives the most control, but is very awkward with 6
> > mandatory parameters.  Alternatively just an "index" can be specified
> > which essentially selects from an array of predefined PHB locations.
> > The PHB at index 0 is automatically created as the default PHB.
> > 
> > The current set of default locations causes some problems for guests with
> > large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> > GPGPU cards via VFIO).  Obviously, for migration we can only change the
> > locations on a new machine type, however.
> > 
> > This is awkward, because the placement is currently decided within the
> > spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> > machine type version.
> > 
> > So, this patch delegates the "default mode" PHB placement from the
> > spapr-pci-host-bridge device back to the machine type via a public method
> > in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> > can do.
> > 
> > For now, this just changes where the calculation is done.  It doesn't
> > change the actual location of the host bridges, or any other behaviour.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> >  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> >  include/hw/pci-host/spapr.h | 11 +----------
> >  include/hw/ppc/spapr.h      |  4 ++++
> >  4 files changed, 47 insertions(+), 24 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 03e3803..f6e9c2a 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
> >      return head;
> >  }
> >  
> > +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> > +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > +                                hwaddr *mmio, hwaddr *mmio_size,
> > +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> > +{
> > +    const uint64_t base_buid = 0x800000020000000ULL;
> > +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> > +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> > +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> > +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> > +    const uint32_t max_index = 255;
> 
> Why do you use new "const" instead of already defined "#define" from
> spapr.h?

Because the point of this change is that all the knowledge about where
the PHBis placed is supposed to be restrictedto this function.  Making
these local consts enforces that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM
  2016-10-06  7:21   ` Laurent Vivier
@ 2016-10-06  8:46     ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06  8:46 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: qemu-ppc, qemu-devel, benh, thuth, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 1859 bytes --]

On Thu, Oct 06, 2016 at 09:21:56AM +0200, Laurent Vivier wrote:
> 
> 
> On 06/10/2016 05:03, David Gibson wrote:
> > Currently the default PCI host bridge for the 'pseries' machine type is
> > constructed with its IO windows in the 1TiB..(1TiB + 64GiB) range in
> > guest memory space.  This means that if > 1TiB of guest RAM is specified,
> > the RAM will collide with the PCI IO windows, causing serious problems.
> > 
> > Problems won't be obvious until guest RAM goes a bit beyond 1TiB, because
> > there's a little unused space at the bottom of the area reserved for PCI,
> > but essentially this means that > 1TiB of RAM has never worked with the
> > pseries machine type.
> > 
> > This patch fixes this by altering the placement of PHBs on large-RAM VMs.
> > Instead of always placing the first PHB at 1TiB, it is placed at the next
> > 1 TiB boundary after the maximum RAM address.
> > 
> > Technically, this changes behaviour in a migration-breaking way for
> > existing machines with > 1TiB maximum memory, but since having > 1 TiB
> > memory was broken anyway, this seems like a reasonable trade-off.
> 
> Perhaps you can add an SPAPR_COMPAT_XX property with PHB0 base to not
> break compatibility? I think spapr without PCI card (only VIO, for
> instance), should work with 1TiB+.

No, it won't because we always create the default PHB even if we don't
actually have any PCI devices.

> On another side, how the SPAPR kernel manages memory?
> Is it possible to add an hole in RAM between 1TiB and 1TiB+64GiB to
> allow the kernel to register the I/O space?

Possible, yes, but I don't really see any advantage to it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
  2016-10-06  7:10   ` Laurent Vivier
@ 2016-10-06  9:36   ` Laurent Vivier
  2016-10-06 23:51     ` David Gibson
  2016-10-07  3:57   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 20+ messages in thread
From: Laurent Vivier @ 2016-10-06  9:36 UTC (permalink / raw)
  To: David Gibson, qemu-ppc
  Cc: qemu-devel, benh, thuth, agraf, mst, aik, mdroth, nikunj,
	bharata, abologna, mpolednik



On 06/10/2016 05:03, David Gibson wrote:
> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> and PAPR guests) to have numerous independent PHBs, each controlling a
> separate PCI domain.
> 
> There are two ways of configuring the spapr-pci-host-bridge device: first
> it can be done fully manually, specifying the locations and sizes of all
> the IO windows.  This gives the most control, but is very awkward with 6
> mandatory parameters.  Alternatively just an "index" can be specified
> which essentially selects from an array of predefined PHB locations.
> The PHB at index 0 is automatically created as the default PHB.
> 
> The current set of default locations causes some problems for guests with
> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> GPGPU cards via VFIO).  Obviously, for migration we can only change the
> locations on a new machine type, however.
> 
> This is awkward, because the placement is currently decided within the
> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> machine type version.
> 
> So, this patch delegates the "default mode" PHB placement from the
> spapr-pci-host-bridge device back to the machine type via a public method
> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> can do.
> 
> For now, this just changes where the calculation is done.  It doesn't
> change the actual location of the host bridges, or any other behaviour.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
>  include/hw/pci-host/spapr.h | 11 +----------
>  include/hw/ppc/spapr.h      |  4 ++++
>  4 files changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 03e3803..f6e9c2a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
>      return head;
>  }
>  
> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> +                                hwaddr *mmio, hwaddr *mmio_size,
> +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> +{
> +    const uint64_t base_buid = 0x800000020000000ULL;
> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> +    const uint32_t max_index = 255;
> +
> +    hwaddr phb_base;
> +    int i;
> +
> +    if (index > max_index) {
> +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> +                   max_index);
> +        return;
> +    }
> +
> +    *buid = base_buid + index;
> +    for (i = 0; i < n_dma; ++i) {
> +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> +    }
> +
> +    phb_base = phb0_base + index * phb_spacing;
> +    *pio = phb_base + pio_offset;
> +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> +    *mmio = phb_base + mmio_offset;
> +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> +}
> +
>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
>  {
>      MachineClass *mc = MACHINE_CLASS(oc);
> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
>      fwc->get_dev_path = spapr_get_fw_dev_path;
>      nc->nmi_monitor_handler = spapr_nmi;
> +    smc->phb_placement = spapr_phb_placement;
>  }
>  
>  static const TypeInfo spapr_machine_info = {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 4f00865..c0fc964 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>  
>      if (sphb->index != (uint32_t)-1) {
> -        hwaddr windows_base;
> +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> +        Error *local_err = NULL;
>  
>          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>              return;
>          }
>  
> -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> -                       SPAPR_PCI_MAX_INDEX);
> +        smc->phb_placement(spapr, sphb->index,
> +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> +                           windows_supported, sphb->dma_liobn, &local_err);


Why don't you pass "sphb" instead of "&sphb->buid, &sphb->io_win_addr,
&sphb->io_win_size, &sphb->mem_win_addr, &sphb->mem_win_size,
sphb->dma_liobn"?

Something like:

    smc->phb_placement(spapr, sphb, windows_supported, &local_err);

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM
  2016-10-06  3:03 ` [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM David Gibson
  2016-10-06  7:21   ` Laurent Vivier
@ 2016-10-06  9:36   ` Laurent Vivier
  1 sibling, 0 replies; 20+ messages in thread
From: Laurent Vivier @ 2016-10-06  9:36 UTC (permalink / raw)
  To: David Gibson, qemu-ppc
  Cc: qemu-devel, benh, thuth, agraf, mst, aik, mdroth, nikunj,
	bharata, abologna, mpolednik



On 06/10/2016 05:03, David Gibson wrote:
> Currently the default PCI host bridge for the 'pseries' machine type is
> constructed with its IO windows in the 1TiB..(1TiB + 64GiB) range in
> guest memory space.  This means that if > 1TiB of guest RAM is specified,
> the RAM will collide with the PCI IO windows, causing serious problems.
> 
> Problems won't be obvious until guest RAM goes a bit beyond 1TiB, because
> there's a little unused space at the bottom of the area reserved for PCI,
> but essentially this means that > 1TiB of RAM has never worked with the
> pseries machine type.
> 
> This patch fixes this by altering the placement of PHBs on large-RAM VMs.
> Instead of always placing the first PHB at 1TiB, it is placed at the next
> 1 TiB boundary after the maximum RAM address.
> 
> Technically, this changes behaviour in a migration-breaking way for
> existing machines with > 1TiB maximum memory, but since having > 1 TiB
> memory was broken anyway, this seems like a reasonable trade-off.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Reviewed-by: Laurent Vivier <lvivier@redhat.com>

> ---
>  hw/ppc/spapr.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index f6e9c2a..9f3e004 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2376,12 +2376,15 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>                                  unsigned n_dma, uint32_t *liobns, Error **errp)
>  {
>      const uint64_t base_buid = 0x800000020000000ULL;
> -    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
>      const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
>      const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
>      const hwaddr pio_offset = 0x80000000; /* 2 GiB */
>      const uint32_t max_index = 255;
> +    const hwaddr phb0_alignment = 0x10000000000ULL; /* 1 TiB */
>  
> +    uint64_t max_hotplug_addr = spapr->hotplug_memory.base +
> +        memory_region_size(&spapr->hotplug_memory.mr);
> +    hwaddr phb0_base = QEMU_ALIGN_UP(max_hotplug_addr, phb0_alignment);
>      hwaddr phb_base;
>      int i;
>  
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-06  9:36   ` Laurent Vivier
@ 2016-10-06 23:51     ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-10-06 23:51 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: qemu-ppc, qemu-devel, benh, thuth, agraf, mst, aik, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 6152 bytes --]

On Thu, Oct 06, 2016 at 11:36:12AM +0200, Laurent Vivier wrote:
> 
> 
> On 06/10/2016 05:03, David Gibson wrote:
> > The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> > for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> > and PAPR guests) to have numerous independent PHBs, each controlling a
> > separate PCI domain.
> > 
> > There are two ways of configuring the spapr-pci-host-bridge device: first
> > it can be done fully manually, specifying the locations and sizes of all
> > the IO windows.  This gives the most control, but is very awkward with 6
> > mandatory parameters.  Alternatively just an "index" can be specified
> > which essentially selects from an array of predefined PHB locations.
> > The PHB at index 0 is automatically created as the default PHB.
> > 
> > The current set of default locations causes some problems for guests with
> > large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> > GPGPU cards via VFIO).  Obviously, for migration we can only change the
> > locations on a new machine type, however.
> > 
> > This is awkward, because the placement is currently decided within the
> > spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> > machine type version.
> > 
> > So, this patch delegates the "default mode" PHB placement from the
> > spapr-pci-host-bridge device back to the machine type via a public method
> > in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> > can do.
> > 
> > For now, this just changes where the calculation is done.  It doesn't
> > change the actual location of the host bridges, or any other behaviour.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> >  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> >  include/hw/pci-host/spapr.h | 11 +----------
> >  include/hw/ppc/spapr.h      |  4 ++++
> >  4 files changed, 47 insertions(+), 24 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 03e3803..f6e9c2a 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
> >      return head;
> >  }
> >  
> > +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> > +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > +                                hwaddr *mmio, hwaddr *mmio_size,
> > +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> > +{
> > +    const uint64_t base_buid = 0x800000020000000ULL;
> > +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> > +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> > +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> > +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> > +    const uint32_t max_index = 255;
> > +
> > +    hwaddr phb_base;
> > +    int i;
> > +
> > +    if (index > max_index) {
> > +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > +                   max_index);
> > +        return;
> > +    }
> > +
> > +    *buid = base_buid + index;
> > +    for (i = 0; i < n_dma; ++i) {
> > +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> > +    }
> > +
> > +    phb_base = phb0_base + index * phb_spacing;
> > +    *pio = phb_base + pio_offset;
> > +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> > +    *mmio = phb_base + mmio_offset;
> > +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> > +}
> > +
> >  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >  {
> >      MachineClass *mc = MACHINE_CLASS(oc);
> > @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >      fwc->get_dev_path = spapr_get_fw_dev_path;
> >      nc->nmi_monitor_handler = spapr_nmi;
> > +    smc->phb_placement = spapr_phb_placement;
> >  }
> >  
> >  static const TypeInfo spapr_machine_info = {
> > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > index 4f00865..c0fc964 100644
> > --- a/hw/ppc/spapr_pci.c
> > +++ b/hw/ppc/spapr_pci.c
> > @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >  
> >      if (sphb->index != (uint32_t)-1) {
> > -        hwaddr windows_base;
> > +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> > +        Error *local_err = NULL;
> >  
> >          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> >              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> > @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >              return;
> >          }
> >  
> > -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> > -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > -                       SPAPR_PCI_MAX_INDEX);
> > +        smc->phb_placement(spapr, sphb->index,
> > +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> > +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> > +                           windows_supported, sphb->dma_liobn, &local_err);
> 
> 
> Why don't you pass "sphb" instead of "&sphb->buid, &sphb->io_win_addr,
> &sphb->io_win_size, &sphb->mem_win_addr, &sphb->mem_win_size,
> sphb->dma_liobn"?
> 
> Something like:
> 
>     smc->phb_placement(spapr, sphb, windows_supported, &local_err);

I thought about that, but I wasn't sure if it was safe for something
external to go setting properties on the PHB object while it was in
the middle of realize().

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
  2016-10-06  7:10   ` Laurent Vivier
  2016-10-06  9:36   ` Laurent Vivier
@ 2016-10-07  3:57   ` Alexey Kardashevskiy
  2016-10-07  5:10     ` David Gibson
  2 siblings, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-07  3:57 UTC (permalink / raw)
  To: David Gibson, qemu-ppc
  Cc: qemu-devel, benh, thuth, lvivier, agraf, mst, mdroth, nikunj,
	bharata, abologna, mpolednik

On 06/10/16 14:03, David Gibson wrote:
> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> and PAPR guests) to have numerous independent PHBs, each controlling a
> separate PCI domain.
> 
> There are two ways of configuring the spapr-pci-host-bridge device: first
> it can be done fully manually, specifying the locations and sizes of all
> the IO windows.  This gives the most control, but is very awkward with 6
> mandatory parameters.  Alternatively just an "index" can be specified
> which essentially selects from an array of predefined PHB locations.
> The PHB at index 0 is automatically created as the default PHB.
> 
> The current set of default locations causes some problems for guests with
> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> GPGPU cards via VFIO).  Obviously, for migration we can only change the
> locations on a new machine type, however.
> 
> This is awkward, because the placement is currently decided within the
> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> machine type version.
> 
> So, this patch delegates the "default mode" PHB placement from the
> spapr-pci-host-bridge device back to the machine type via a public method
> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> can do.
> 
> For now, this just changes where the calculation is done.  It doesn't
> change the actual location of the host bridges, or any other behaviour.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
>  include/hw/pci-host/spapr.h | 11 +----------
>  include/hw/ppc/spapr.h      |  4 ++++
>  4 files changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 03e3803..f6e9c2a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
>      return head;
>  }
>  
> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> +                                hwaddr *mmio, hwaddr *mmio_size,
> +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> +{
> +    const uint64_t base_buid = 0x800000020000000ULL;
> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> +    const uint32_t max_index = 255;
> +
> +    hwaddr phb_base;
> +    int i;
> +
> +    if (index > max_index) {
> +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> +                   max_index);
> +        return;
> +    }
> +
> +    *buid = base_buid + index;
> +    for (i = 0; i < n_dma; ++i) {
> +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> +    }
> +
> +    phb_base = phb0_base + index * phb_spacing;
> +    *pio = phb_base + pio_offset;
> +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> +    *mmio = phb_base + mmio_offset;
> +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> +}
> +
>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
>  {
>      MachineClass *mc = MACHINE_CLASS(oc);
> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
>      fwc->get_dev_path = spapr_get_fw_dev_path;
>      nc->nmi_monitor_handler = spapr_nmi;
> +    smc->phb_placement = spapr_phb_placement;
>  }
>  
>  static const TypeInfo spapr_machine_info = {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 4f00865..c0fc964 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>  
>      if (sphb->index != (uint32_t)-1) {
> -        hwaddr windows_base;
> +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> +        Error *local_err = NULL;
>  
>          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>              return;
>          }
>  
> -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> -                       SPAPR_PCI_MAX_INDEX);
> +        smc->phb_placement(spapr, sphb->index,
> +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> +                           windows_supported, sphb->dma_liobn, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
>              return;
>          }
> -
> -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> -        for (i = 0; i < windows_supported; ++i) {
> -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> -        }
> -
> -        windows_base = SPAPR_PCI_WINDOW_BASE
> -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
>      }
>  
>      if (sphb->buid == (uint64_t)-1) {
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 30dbd46..8c9ebfd 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -79,18 +79,9 @@ struct sPAPRPHBState {
>      uint32_t numa_node;
>  };
>  
> -#define SPAPR_PCI_MAX_INDEX          255
> -
> -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> -
>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>  
> -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
> -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
> -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
> -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
> +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
>  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
>  
>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 39dadaa..9e1c5a5 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
>      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
>      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
>      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
> +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> +                          hwaddr *mmio, hwaddr *mmio_size,
> +                          unsigned n_dma, uint32_t *liobns, Error **errp);


I still like idea of getting rid of index better. In order to reduce the
command line length, you could not enable IO and MMIO32 by default, the
default size for MMIO64 could be set to 1TB or something like this, BUID
could be just a raw MMIO64 address value.

So essentially you would only have to specify MMIO64 and 2xLIOBN in the
command line which does not seem that much of overkill assuming that a
second (third,...) PHB is almost never used and even if it is, it will be
used for SRIOV VF (which can do 64bit) or virtio (the same).

With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
addresses from the command line parameters which did hit us with migration
before as the QEMU command line on the source side and the destination side
is not exactly the same, and this might hit again...



>  };
>  
>  /**
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-07  3:57   ` Alexey Kardashevskiy
@ 2016-10-07  5:10     ` David Gibson
  2016-10-07  5:34       ` Alexey Kardashevskiy
  2016-10-11  3:17       ` David Gibson
  0 siblings, 2 replies; 20+ messages in thread
From: David Gibson @ 2016-10-07  5:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-ppc, qemu-devel, benh, thuth, lvivier, agraf, mst, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 10150 bytes --]

On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
> On 06/10/16 14:03, David Gibson wrote:
> > The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> > for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> > and PAPR guests) to have numerous independent PHBs, each controlling a
> > separate PCI domain.
> > 
> > There are two ways of configuring the spapr-pci-host-bridge device: first
> > it can be done fully manually, specifying the locations and sizes of all
> > the IO windows.  This gives the most control, but is very awkward with 6
> > mandatory parameters.  Alternatively just an "index" can be specified
> > which essentially selects from an array of predefined PHB locations.
> > The PHB at index 0 is automatically created as the default PHB.
> > 
> > The current set of default locations causes some problems for guests with
> > large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> > GPGPU cards via VFIO).  Obviously, for migration we can only change the
> > locations on a new machine type, however.
> > 
> > This is awkward, because the placement is currently decided within the
> > spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> > machine type version.
> > 
> > So, this patch delegates the "default mode" PHB placement from the
> > spapr-pci-host-bridge device back to the machine type via a public method
> > in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> > can do.
> > 
> > For now, this just changes where the calculation is done.  It doesn't
> > change the actual location of the host bridges, or any other behaviour.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> >  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> >  include/hw/pci-host/spapr.h | 11 +----------
> >  include/hw/ppc/spapr.h      |  4 ++++
> >  4 files changed, 47 insertions(+), 24 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 03e3803..f6e9c2a 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
> >      return head;
> >  }
> >  
> > +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> > +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > +                                hwaddr *mmio, hwaddr *mmio_size,
> > +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> > +{
> > +    const uint64_t base_buid = 0x800000020000000ULL;
> > +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> > +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> > +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> > +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> > +    const uint32_t max_index = 255;
> > +
> > +    hwaddr phb_base;
> > +    int i;
> > +
> > +    if (index > max_index) {
> > +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > +                   max_index);
> > +        return;
> > +    }
> > +
> > +    *buid = base_buid + index;
> > +    for (i = 0; i < n_dma; ++i) {
> > +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> > +    }
> > +
> > +    phb_base = phb0_base + index * phb_spacing;
> > +    *pio = phb_base + pio_offset;
> > +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> > +    *mmio = phb_base + mmio_offset;
> > +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> > +}
> > +
> >  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >  {
> >      MachineClass *mc = MACHINE_CLASS(oc);
> > @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >      fwc->get_dev_path = spapr_get_fw_dev_path;
> >      nc->nmi_monitor_handler = spapr_nmi;
> > +    smc->phb_placement = spapr_phb_placement;
> >  }
> >  
> >  static const TypeInfo spapr_machine_info = {
> > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > index 4f00865..c0fc964 100644
> > --- a/hw/ppc/spapr_pci.c
> > +++ b/hw/ppc/spapr_pci.c
> > @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >  
> >      if (sphb->index != (uint32_t)-1) {
> > -        hwaddr windows_base;
> > +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> > +        Error *local_err = NULL;
> >  
> >          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> >              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> > @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >              return;
> >          }
> >  
> > -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> > -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > -                       SPAPR_PCI_MAX_INDEX);
> > +        smc->phb_placement(spapr, sphb->index,
> > +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> > +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> > +                           windows_supported, sphb->dma_liobn, &local_err);
> > +        if (local_err) {
> > +            error_propagate(errp, local_err);
> >              return;
> >          }
> > -
> > -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> > -        for (i = 0; i < windows_supported; ++i) {
> > -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> > -        }
> > -
> > -        windows_base = SPAPR_PCI_WINDOW_BASE
> > -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> > -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> > -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> >      }
> >  
> >      if (sphb->buid == (uint64_t)-1) {
> > diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> > index 30dbd46..8c9ebfd 100644
> > --- a/include/hw/pci-host/spapr.h
> > +++ b/include/hw/pci-host/spapr.h
> > @@ -79,18 +79,9 @@ struct sPAPRPHBState {
> >      uint32_t numa_node;
> >  };
> >  
> > -#define SPAPR_PCI_MAX_INDEX          255
> > -
> > -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> > -
> >  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> >  
> > -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> > -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
> > -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
> > -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
> > -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> > -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
> > +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
> >  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
> >  
> >  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index 39dadaa..9e1c5a5 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
> >      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
> >      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
> >      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
> > +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> > +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > +                          hwaddr *mmio, hwaddr *mmio_size,
> > +                          unsigned n_dma, uint32_t *liobns, Error **errp);
> 
> 
> I still like idea of getting rid of index better. In order to reduce the
> command line length, you could not enable IO and MMIO32 by default, the
> default size for MMIO64 could be set to 1TB or something like this, BUID
> could be just a raw MMIO64 address value.

Hm, interesting idea.  I do like the idea of making the BUID equal to
the MMIO64 address.  Obviously we'd still need addresses for the
default PHB, including 32-bit and PIO windows.

Trickier, we'd still need to support 'index' for compatibility with
older command lines.  I guess we could reserve index purely for the
"legacy" locations - 2.8+ machine types would create the default
bridge with explicit windows, rather than just using index 0.

> So essentially you would only have to specify MMIO64 and 2xLIOBN in the
> command line which does not seem that much of overkill assuming that a
> second (third,...) PHB is almost never used and even if it is, it will be
> used for SRIOV VF (which can do 64bit) or virtio (the same).

Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
theoretically be optional, although you're going to want it for
basically every case where you're creating an additional PHB.

Or.. what if we insisted the mmio64 window had at least 4G alignment,
then we might be able to use mmio64 >> 32 (== buid >> 32) as the
liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
extra bit.

..and I wonder if this might be easier to manage if we introduced a
new spapr-pci-host-bridge-2 subtype.

Let me think about this..

> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
> addresses from the command line parameters which did hit us with migration
> before as the QEMU command line on the source side and the destination side
> is not exactly the same, and this might hit again...

Uh.. I'm not quite following you.  What's the situation which would
break migration with the proposed scheme?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-07  5:10     ` David Gibson
@ 2016-10-07  5:34       ` Alexey Kardashevskiy
  2016-10-07  9:17         ` David Gibson
  2016-10-11  3:17       ` David Gibson
  1 sibling, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-07  5:34 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, benh, thuth, lvivier, agraf, mst, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 10450 bytes --]

On 07/10/16 16:10, David Gibson wrote:
> On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
>> On 06/10/16 14:03, David Gibson wrote:
>>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
>>> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
>>> and PAPR guests) to have numerous independent PHBs, each controlling a
>>> separate PCI domain.
>>>
>>> There are two ways of configuring the spapr-pci-host-bridge device: first
>>> it can be done fully manually, specifying the locations and sizes of all
>>> the IO windows.  This gives the most control, but is very awkward with 6
>>> mandatory parameters.  Alternatively just an "index" can be specified
>>> which essentially selects from an array of predefined PHB locations.
>>> The PHB at index 0 is automatically created as the default PHB.
>>>
>>> The current set of default locations causes some problems for guests with
>>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
>>> GPGPU cards via VFIO).  Obviously, for migration we can only change the
>>> locations on a new machine type, however.
>>>
>>> This is awkward, because the placement is currently decided within the
>>> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
>>> machine type version.
>>>
>>> So, this patch delegates the "default mode" PHB placement from the
>>> spapr-pci-host-bridge device back to the machine type via a public method
>>> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
>>> can do.
>>>
>>> For now, this just changes where the calculation is done.  It doesn't
>>> change the actual location of the host bridges, or any other behaviour.
>>>
>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>> ---
>>>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
>>>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
>>>  include/hw/pci-host/spapr.h | 11 +----------
>>>  include/hw/ppc/spapr.h      |  4 ++++
>>>  4 files changed, 47 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 03e3803..f6e9c2a 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
>>>      return head;
>>>  }
>>>  
>>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>> +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
>>> +                                hwaddr *mmio, hwaddr *mmio_size,
>>> +                                unsigned n_dma, uint32_t *liobns, Error **errp)
>>> +{
>>> +    const uint64_t base_buid = 0x800000020000000ULL;
>>> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
>>> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
>>> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
>>> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
>>> +    const uint32_t max_index = 255;
>>> +
>>> +    hwaddr phb_base;
>>> +    int i;
>>> +
>>> +    if (index > max_index) {
>>> +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
>>> +                   max_index);
>>> +        return;
>>> +    }
>>> +
>>> +    *buid = base_buid + index;
>>> +    for (i = 0; i < n_dma; ++i) {
>>> +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
>>> +    }
>>> +
>>> +    phb_base = phb0_base + index * phb_spacing;
>>> +    *pio = phb_base + pio_offset;
>>> +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
>>> +    *mmio = phb_base + mmio_offset;
>>> +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
>>> +}
>>> +
>>>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
>>>  {
>>>      MachineClass *mc = MACHINE_CLASS(oc);
>>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>>>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
>>>      fwc->get_dev_path = spapr_get_fw_dev_path;
>>>      nc->nmi_monitor_handler = spapr_nmi;
>>> +    smc->phb_placement = spapr_phb_placement;
>>>  }
>>>  
>>>  static const TypeInfo spapr_machine_info = {
>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>> index 4f00865..c0fc964 100644
>>> --- a/hw/ppc/spapr_pci.c
>>> +++ b/hw/ppc/spapr_pci.c
>>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>>>  
>>>      if (sphb->index != (uint32_t)-1) {
>>> -        hwaddr windows_base;
>>> +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>>> +        Error *local_err = NULL;
>>>  
>>>          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>>>              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
>>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>              return;
>>>          }
>>>  
>>> -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
>>> -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
>>> -                       SPAPR_PCI_MAX_INDEX);
>>> +        smc->phb_placement(spapr, sphb->index,
>>> +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
>>> +                           &sphb->mem_win_addr, &sphb->mem_win_size,
>>> +                           windows_supported, sphb->dma_liobn, &local_err);
>>> +        if (local_err) {
>>> +            error_propagate(errp, local_err);
>>>              return;
>>>          }
>>> -
>>> -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>>> -        for (i = 0; i < windows_supported; ++i) {
>>> -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
>>> -        }
>>> -
>>> -        windows_base = SPAPR_PCI_WINDOW_BASE
>>> -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>>> -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
>>> -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
>>>      }
>>>  
>>>      if (sphb->buid == (uint64_t)-1) {
>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>> index 30dbd46..8c9ebfd 100644
>>> --- a/include/hw/pci-host/spapr.h
>>> +++ b/include/hw/pci-host/spapr.h
>>> @@ -79,18 +79,9 @@ struct sPAPRPHBState {
>>>      uint32_t numa_node;
>>>  };
>>>  
>>> -#define SPAPR_PCI_MAX_INDEX          255
>>> -
>>> -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
>>> -
>>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>>>  
>>> -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
>>> -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
>>> -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
>>> -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
>>> -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
>>> -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
>>> +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
>>>  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
>>>  
>>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>> index 39dadaa..9e1c5a5 100644
>>> --- a/include/hw/ppc/spapr.h
>>> +++ b/include/hw/ppc/spapr.h
>>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
>>>      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
>>>      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
>>>      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
>>> +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
>>> +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
>>> +                          hwaddr *mmio, hwaddr *mmio_size,
>>> +                          unsigned n_dma, uint32_t *liobns, Error **errp);
>>
>>
>> I still like idea of getting rid of index better. In order to reduce the
>> command line length, you could not enable IO and MMIO32 by default, the
>> default size for MMIO64 could be set to 1TB or something like this, BUID
>> could be just a raw MMIO64 address value.
> 
> Hm, interesting idea.  I do like the idea of making the BUID equal to
> the MMIO64 address.  Obviously we'd still need addresses for the
> default PHB, including 32-bit and PIO windows.
> 
> Trickier, we'd still need to support 'index' for compatibility with
> older command lines.  I guess we could reserve index purely for the
> "legacy" locations - 2.8+ machine types would create the default
> bridge with explicit windows, rather than just using index 0.
> 
>> So essentially you would only have to specify MMIO64 and 2xLIOBN in the
>> command line which does not seem that much of overkill assuming that a
>> second (third,...) PHB is almost never used and even if it is, it will be
>> used for SRIOV VF (which can do 64bit) or virtio (the same).
> 
> Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
> theoretically be optional, although you're going to want it for
> basically every case where you're creating an additional PHB.
> 
> Or.. what if we insisted the mmio64 window had at least 4G alignment,
> then we might be able to use mmio64 >> 32 (== buid >> 32) as the
> liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
> extra bit.


As you mentioned in 4/4, guests cannot access above 64 TiB, we could use
upper bits to store the window number.


> 
> ..and I wonder if this might be easier to manage if we introduced a
> new spapr-pci-host-bridge-2 subtype.
> 
> Let me think about this..
> 
>> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
>> addresses from the command line parameters which did hit us with migration
>> before as the QEMU command line on the source side and the destination side
>> is not exactly the same, and this might hit again...
> 
> Uh.. I'm not quite following you.  What's the situation which would
> break migration with the proposed scheme?

Well, I cannot think of any other device available on SPAPR which we could
map to the CPU address space but if it ever appears, we will have a problem
unless this new imaginary device does not pick an address from the CPU
space automatically.



-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-07  5:34       ` Alexey Kardashevskiy
@ 2016-10-07  9:17         ` David Gibson
  2016-10-10  1:04           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2016-10-07  9:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-ppc, qemu-devel, benh, thuth, lvivier, agraf, mst, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 11329 bytes --]

On Fri, Oct 07, 2016 at 04:34:59PM +1100, Alexey Kardashevskiy wrote:
> On 07/10/16 16:10, David Gibson wrote:
> > On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
> >> On 06/10/16 14:03, David Gibson wrote:
> >>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> >>> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> >>> and PAPR guests) to have numerous independent PHBs, each controlling a
> >>> separate PCI domain.
> >>>
> >>> There are two ways of configuring the spapr-pci-host-bridge device: first
> >>> it can be done fully manually, specifying the locations and sizes of all
> >>> the IO windows.  This gives the most control, but is very awkward with 6
> >>> mandatory parameters.  Alternatively just an "index" can be specified
> >>> which essentially selects from an array of predefined PHB locations.
> >>> The PHB at index 0 is automatically created as the default PHB.
> >>>
> >>> The current set of default locations causes some problems for guests with
> >>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> >>> GPGPU cards via VFIO).  Obviously, for migration we can only change the
> >>> locations on a new machine type, however.
> >>>
> >>> This is awkward, because the placement is currently decided within the
> >>> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> >>> machine type version.
> >>>
> >>> So, this patch delegates the "default mode" PHB placement from the
> >>> spapr-pci-host-bridge device back to the machine type via a public method
> >>> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> >>> can do.
> >>>
> >>> For now, this just changes where the calculation is done.  It doesn't
> >>> change the actual location of the host bridges, or any other behaviour.
> >>>
> >>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >>> ---
> >>>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> >>>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> >>>  include/hw/pci-host/spapr.h | 11 +----------
> >>>  include/hw/ppc/spapr.h      |  4 ++++
> >>>  4 files changed, 47 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 03e3803..f6e9c2a 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
> >>>      return head;
> >>>  }
> >>>  
> >>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>> +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> >>> +                                hwaddr *mmio, hwaddr *mmio_size,
> >>> +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> >>> +{
> >>> +    const uint64_t base_buid = 0x800000020000000ULL;
> >>> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> >>> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> >>> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> >>> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> >>> +    const uint32_t max_index = 255;
> >>> +
> >>> +    hwaddr phb_base;
> >>> +    int i;
> >>> +
> >>> +    if (index > max_index) {
> >>> +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> >>> +                   max_index);
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    *buid = base_buid + index;
> >>> +    for (i = 0; i < n_dma; ++i) {
> >>> +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> >>> +    }
> >>> +
> >>> +    phb_base = phb0_base + index * phb_spacing;
> >>> +    *pio = phb_base + pio_offset;
> >>> +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> >>> +    *mmio = phb_base + mmio_offset;
> >>> +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> >>> +}
> >>> +
> >>>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >>>  {
> >>>      MachineClass *mc = MACHINE_CLASS(oc);
> >>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >>>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >>>      fwc->get_dev_path = spapr_get_fw_dev_path;
> >>>      nc->nmi_monitor_handler = spapr_nmi;
> >>> +    smc->phb_placement = spapr_phb_placement;
> >>>  }
> >>>  
> >>>  static const TypeInfo spapr_machine_info = {
> >>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>> index 4f00865..c0fc964 100644
> >>> --- a/hw/ppc/spapr_pci.c
> >>> +++ b/hw/ppc/spapr_pci.c
> >>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >>>  
> >>>      if (sphb->index != (uint32_t)-1) {
> >>> -        hwaddr windows_base;
> >>> +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> >>> +        Error *local_err = NULL;
> >>>  
> >>>          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> >>>              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> >>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>              return;
> >>>          }
> >>>  
> >>> -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> >>> -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> >>> -                       SPAPR_PCI_MAX_INDEX);
> >>> +        smc->phb_placement(spapr, sphb->index,
> >>> +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> >>> +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> >>> +                           windows_supported, sphb->dma_liobn, &local_err);
> >>> +        if (local_err) {
> >>> +            error_propagate(errp, local_err);
> >>>              return;
> >>>          }
> >>> -
> >>> -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> >>> -        for (i = 0; i < windows_supported; ++i) {
> >>> -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> >>> -        }
> >>> -
> >>> -        windows_base = SPAPR_PCI_WINDOW_BASE
> >>> -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> >>> -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> >>> -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> >>>      }
> >>>  
> >>>      if (sphb->buid == (uint64_t)-1) {
> >>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >>> index 30dbd46..8c9ebfd 100644
> >>> --- a/include/hw/pci-host/spapr.h
> >>> +++ b/include/hw/pci-host/spapr.h
> >>> @@ -79,18 +79,9 @@ struct sPAPRPHBState {
> >>>      uint32_t numa_node;
> >>>  };
> >>>  
> >>> -#define SPAPR_PCI_MAX_INDEX          255
> >>> -
> >>> -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> >>> -
> >>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> >>>  
> >>> -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> >>> -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
> >>> -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
> >>> -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
> >>> -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> >>> -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
> >>> +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
> >>>  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
> >>>  
> >>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> >>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>> index 39dadaa..9e1c5a5 100644
> >>> --- a/include/hw/ppc/spapr.h
> >>> +++ b/include/hw/ppc/spapr.h
> >>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
> >>>      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
> >>>      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
> >>>      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
> >>> +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> >>> +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> >>> +                          hwaddr *mmio, hwaddr *mmio_size,
> >>> +                          unsigned n_dma, uint32_t *liobns, Error **errp);
> >>
> >>
> >> I still like idea of getting rid of index better. In order to reduce the
> >> command line length, you could not enable IO and MMIO32 by default, the
> >> default size for MMIO64 could be set to 1TB or something like this, BUID
> >> could be just a raw MMIO64 address value.
> > 
> > Hm, interesting idea.  I do like the idea of making the BUID equal to
> > the MMIO64 address.  Obviously we'd still need addresses for the
> > default PHB, including 32-bit and PIO windows.
> > 
> > Trickier, we'd still need to support 'index' for compatibility with
> > older command lines.  I guess we could reserve index purely for the
> > "legacy" locations - 2.8+ machine types would create the default
> > bridge with explicit windows, rather than just using index 0.
> > 
> >> So essentially you would only have to specify MMIO64 and 2xLIOBN in the
> >> command line which does not seem that much of overkill assuming that a
> >> second (third,...) PHB is almost never used and even if it is, it will be
> >> used for SRIOV VF (which can do 64bit) or virtio (the same).
> > 
> > Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
> > theoretically be optional, although you're going to want it for
> > basically every case where you're creating an additional PHB.
> > 
> > Or.. what if we insisted the mmio64 window had at least 4G alignment,
> > then we might be able to use mmio64 >> 32 (== buid >> 32) as the
> > liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
> > extra bit.
> 
> 
> As you mentioned in 4/4, guests cannot access above 64 TiB, we could use
> upper bits to store the window number.

*Some* guests can't access above 64 TiB, that's not an inherent
limitation.  So I'd rather not do that.

> 
> 
> > 
> > ..and I wonder if this might be easier to manage if we introduced a
> > new spapr-pci-host-bridge-2 subtype.
> > 
> > Let me think about this..
> > 
> >> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
> >> addresses from the command line parameters which did hit us with migration
> >> before as the QEMU command line on the source side and the destination side
> >> is not exactly the same, and this might hit again...
> > 
> > Uh.. I'm not quite following you.  What's the situation which would
> > break migration with the proposed scheme?
> 
> Well, I cannot think of any other device available on SPAPR which we could
> map to the CPU address space but if it ever appears, we will have a problem
> unless this new imaginary device does not pick an address from the CPU
> space automatically.

I followed that even less that the previous one, I can't even parse that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-07  9:17         ` David Gibson
@ 2016-10-10  1:04           ` Alexey Kardashevskiy
  2016-10-10  4:07             ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-10  1:04 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, benh, thuth, lvivier, agraf, mst, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 12135 bytes --]

On 07/10/16 20:17, David Gibson wrote:
> On Fri, Oct 07, 2016 at 04:34:59PM +1100, Alexey Kardashevskiy wrote:
>> On 07/10/16 16:10, David Gibson wrote:
>>> On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
>>>> On 06/10/16 14:03, David Gibson wrote:
>>>>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
>>>>> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
>>>>> and PAPR guests) to have numerous independent PHBs, each controlling a
>>>>> separate PCI domain.
>>>>>
>>>>> There are two ways of configuring the spapr-pci-host-bridge device: first
>>>>> it can be done fully manually, specifying the locations and sizes of all
>>>>> the IO windows.  This gives the most control, but is very awkward with 6
>>>>> mandatory parameters.  Alternatively just an "index" can be specified
>>>>> which essentially selects from an array of predefined PHB locations.
>>>>> The PHB at index 0 is automatically created as the default PHB.
>>>>>
>>>>> The current set of default locations causes some problems for guests with
>>>>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
>>>>> GPGPU cards via VFIO).  Obviously, for migration we can only change the
>>>>> locations on a new machine type, however.
>>>>>
>>>>> This is awkward, because the placement is currently decided within the
>>>>> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
>>>>> machine type version.
>>>>>
>>>>> So, this patch delegates the "default mode" PHB placement from the
>>>>> spapr-pci-host-bridge device back to the machine type via a public method
>>>>> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
>>>>> can do.
>>>>>
>>>>> For now, this just changes where the calculation is done.  It doesn't
>>>>> change the actual location of the host bridges, or any other behaviour.
>>>>>
>>>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>>>> ---
>>>>>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
>>>>>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
>>>>>  include/hw/pci-host/spapr.h | 11 +----------
>>>>>  include/hw/ppc/spapr.h      |  4 ++++
>>>>>  4 files changed, 47 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>> index 03e3803..f6e9c2a 100644
>>>>> --- a/hw/ppc/spapr.c
>>>>> +++ b/hw/ppc/spapr.c
>>>>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
>>>>>      return head;
>>>>>  }
>>>>>  
>>>>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>>>> +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
>>>>> +                                hwaddr *mmio, hwaddr *mmio_size,
>>>>> +                                unsigned n_dma, uint32_t *liobns, Error **errp)
>>>>> +{
>>>>> +    const uint64_t base_buid = 0x800000020000000ULL;
>>>>> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
>>>>> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
>>>>> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
>>>>> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
>>>>> +    const uint32_t max_index = 255;
>>>>> +
>>>>> +    hwaddr phb_base;
>>>>> +    int i;
>>>>> +
>>>>> +    if (index > max_index) {
>>>>> +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
>>>>> +                   max_index);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    *buid = base_buid + index;
>>>>> +    for (i = 0; i < n_dma; ++i) {
>>>>> +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
>>>>> +    }
>>>>> +
>>>>> +    phb_base = phb0_base + index * phb_spacing;
>>>>> +    *pio = phb_base + pio_offset;
>>>>> +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
>>>>> +    *mmio = phb_base + mmio_offset;
>>>>> +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
>>>>> +}
>>>>> +
>>>>>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
>>>>>  {
>>>>>      MachineClass *mc = MACHINE_CLASS(oc);
>>>>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>>>>>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
>>>>>      fwc->get_dev_path = spapr_get_fw_dev_path;
>>>>>      nc->nmi_monitor_handler = spapr_nmi;
>>>>> +    smc->phb_placement = spapr_phb_placement;
>>>>>  }
>>>>>  
>>>>>  static const TypeInfo spapr_machine_info = {
>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>> index 4f00865..c0fc964 100644
>>>>> --- a/hw/ppc/spapr_pci.c
>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>>          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>>>>>  
>>>>>      if (sphb->index != (uint32_t)-1) {
>>>>> -        hwaddr windows_base;
>>>>> +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>>>>> +        Error *local_err = NULL;
>>>>>  
>>>>>          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>>>>>              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
>>>>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>>              return;
>>>>>          }
>>>>>  
>>>>> -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
>>>>> -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
>>>>> -                       SPAPR_PCI_MAX_INDEX);
>>>>> +        smc->phb_placement(spapr, sphb->index,
>>>>> +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
>>>>> +                           &sphb->mem_win_addr, &sphb->mem_win_size,
>>>>> +                           windows_supported, sphb->dma_liobn, &local_err);
>>>>> +        if (local_err) {
>>>>> +            error_propagate(errp, local_err);
>>>>>              return;
>>>>>          }
>>>>> -
>>>>> -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>>>>> -        for (i = 0; i < windows_supported; ++i) {
>>>>> -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
>>>>> -        }
>>>>> -
>>>>> -        windows_base = SPAPR_PCI_WINDOW_BASE
>>>>> -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>>>>> -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
>>>>> -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
>>>>>      }
>>>>>  
>>>>>      if (sphb->buid == (uint64_t)-1) {
>>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>>>> index 30dbd46..8c9ebfd 100644
>>>>> --- a/include/hw/pci-host/spapr.h
>>>>> +++ b/include/hw/pci-host/spapr.h
>>>>> @@ -79,18 +79,9 @@ struct sPAPRPHBState {
>>>>>      uint32_t numa_node;
>>>>>  };
>>>>>  
>>>>> -#define SPAPR_PCI_MAX_INDEX          255
>>>>> -
>>>>> -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
>>>>> -
>>>>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>>>>>  
>>>>> -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
>>>>> -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
>>>>> -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
>>>>> -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
>>>>> -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
>>>>> -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
>>>>> +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
>>>>>  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
>>>>>  
>>>>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>>> index 39dadaa..9e1c5a5 100644
>>>>> --- a/include/hw/ppc/spapr.h
>>>>> +++ b/include/hw/ppc/spapr.h
>>>>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
>>>>>      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
>>>>>      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
>>>>>      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
>>>>> +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
>>>>> +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
>>>>> +                          hwaddr *mmio, hwaddr *mmio_size,
>>>>> +                          unsigned n_dma, uint32_t *liobns, Error **errp);
>>>>
>>>>
>>>> I still like idea of getting rid of index better. In order to reduce the
>>>> command line length, you could not enable IO and MMIO32 by default, the
>>>> default size for MMIO64 could be set to 1TB or something like this, BUID
>>>> could be just a raw MMIO64 address value.
>>>
>>> Hm, interesting idea.  I do like the idea of making the BUID equal to
>>> the MMIO64 address.  Obviously we'd still need addresses for the
>>> default PHB, including 32-bit and PIO windows.
>>>
>>> Trickier, we'd still need to support 'index' for compatibility with
>>> older command lines.  I guess we could reserve index purely for the
>>> "legacy" locations - 2.8+ machine types would create the default
>>> bridge with explicit windows, rather than just using index 0.
>>>
>>>> So essentially you would only have to specify MMIO64 and 2xLIOBN in the
>>>> command line which does not seem that much of overkill assuming that a
>>>> second (third,...) PHB is almost never used and even if it is, it will be
>>>> used for SRIOV VF (which can do 64bit) or virtio (the same).
>>>
>>> Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
>>> theoretically be optional, although you're going to want it for
>>> basically every case where you're creating an additional PHB.
>>>
>>> Or.. what if we insisted the mmio64 window had at least 4G alignment,
>>> then we might be able to use mmio64 >> 32 (== buid >> 32) as the
>>> liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
>>> extra bit.
>>
>>
>> As you mentioned in 4/4, guests cannot access above 64 TiB, we could use
>> upper bits to store the window number.
> 
> *Some* guests can't access above 64 TiB, that's not an inherent
> limitation.  So I'd rather not do that.
> 
>>
>>
>>>
>>> ..and I wonder if this might be easier to manage if we introduced a
>>> new spapr-pci-host-bridge-2 subtype.
>>>
>>> Let me think about this..
>>>
>>>> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
>>>> addresses from the command line parameters which did hit us with migration
>>>> before as the QEMU command line on the source side and the destination side
>>>> is not exactly the same, and this might hit again...
>>>
>>> Uh.. I'm not quite following you.  What's the situation which would
>>> break migration with the proposed scheme?
>>
>> Well, I cannot think of any other device available on SPAPR which we could
>> map to the CPU address space but if it ever appears, we will have a problem
>> unless this new imaginary device does not pick an address from the CPU
>> space automatically.
> 
> I followed that even less that the previous one, I can't even parse that.

For example, 2 ibm-veth devices, each with assigned MAC. The source libvirt
puts them in the QEMU command line in som order and they get automatically
assigned interrupts from a pool. Then we migrate and since libvirt does not
read the ibm-veth.irq QOM property on the source side and therefore does
not send it to the destination side to use in the QEMU command line, there
is a change that on the destination side our ibm-veth devices will get
different IRQ numbers and VMSTATE_UINT32_EQUAL(irq, VIOsPAPRDevice) will fail.

So either guest should be able to keep working with new IRQ numbers
(impossible) or we should make sure that libvirt sends them over or libvirt
always picks the correct number (derived from reg? should not depend on any
other hardware). The same applies to addresses where you map PHB MMIO.


ps. may be libvirt does send irq for vio devices now, dunno, but there was
a bug like this :)


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-10  1:04           ` Alexey Kardashevskiy
@ 2016-10-10  4:07             ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-10-10  4:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-ppc, qemu-devel, benh, thuth, lvivier, agraf, mst, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 13370 bytes --]

On Mon, Oct 10, 2016 at 12:04:29PM +1100, Alexey Kardashevskiy wrote:
> On 07/10/16 20:17, David Gibson wrote:
> > On Fri, Oct 07, 2016 at 04:34:59PM +1100, Alexey Kardashevskiy wrote:
> >> On 07/10/16 16:10, David Gibson wrote:
> >>> On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
> >>>> On 06/10/16 14:03, David Gibson wrote:
> >>>>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> >>>>> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> >>>>> and PAPR guests) to have numerous independent PHBs, each controlling a
> >>>>> separate PCI domain.
> >>>>>
> >>>>> There are two ways of configuring the spapr-pci-host-bridge device: first
> >>>>> it can be done fully manually, specifying the locations and sizes of all
> >>>>> the IO windows.  This gives the most control, but is very awkward with 6
> >>>>> mandatory parameters.  Alternatively just an "index" can be specified
> >>>>> which essentially selects from an array of predefined PHB locations.
> >>>>> The PHB at index 0 is automatically created as the default PHB.
> >>>>>
> >>>>> The current set of default locations causes some problems for guests with
> >>>>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> >>>>> GPGPU cards via VFIO).  Obviously, for migration we can only change the
> >>>>> locations on a new machine type, however.
> >>>>>
> >>>>> This is awkward, because the placement is currently decided within the
> >>>>> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> >>>>> machine type version.
> >>>>>
> >>>>> So, this patch delegates the "default mode" PHB placement from the
> >>>>> spapr-pci-host-bridge device back to the machine type via a public method
> >>>>> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> >>>>> can do.
> >>>>>
> >>>>> For now, this just changes where the calculation is done.  It doesn't
> >>>>> change the actual location of the host bridges, or any other behaviour.
> >>>>>
> >>>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >>>>> ---
> >>>>>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> >>>>>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> >>>>>  include/hw/pci-host/spapr.h | 11 +----------
> >>>>>  include/hw/ppc/spapr.h      |  4 ++++
> >>>>>  4 files changed, 47 insertions(+), 24 deletions(-)
> >>>>>
> >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>>> index 03e3803..f6e9c2a 100644
> >>>>> --- a/hw/ppc/spapr.c
> >>>>> +++ b/hw/ppc/spapr.c
> >>>>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
> >>>>>      return head;
> >>>>>  }
> >>>>>  
> >>>>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>>>> +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> >>>>> +                                hwaddr *mmio, hwaddr *mmio_size,
> >>>>> +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> >>>>> +{
> >>>>> +    const uint64_t base_buid = 0x800000020000000ULL;
> >>>>> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> >>>>> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> >>>>> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> >>>>> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> >>>>> +    const uint32_t max_index = 255;
> >>>>> +
> >>>>> +    hwaddr phb_base;
> >>>>> +    int i;
> >>>>> +
> >>>>> +    if (index > max_index) {
> >>>>> +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> >>>>> +                   max_index);
> >>>>> +        return;
> >>>>> +    }
> >>>>> +
> >>>>> +    *buid = base_buid + index;
> >>>>> +    for (i = 0; i < n_dma; ++i) {
> >>>>> +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> >>>>> +    }
> >>>>> +
> >>>>> +    phb_base = phb0_base + index * phb_spacing;
> >>>>> +    *pio = phb_base + pio_offset;
> >>>>> +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> >>>>> +    *mmio = phb_base + mmio_offset;
> >>>>> +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> >>>>> +}
> >>>>> +
> >>>>>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >>>>>  {
> >>>>>      MachineClass *mc = MACHINE_CLASS(oc);
> >>>>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >>>>>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >>>>>      fwc->get_dev_path = spapr_get_fw_dev_path;
> >>>>>      nc->nmi_monitor_handler = spapr_nmi;
> >>>>> +    smc->phb_placement = spapr_phb_placement;
> >>>>>  }
> >>>>>  
> >>>>>  static const TypeInfo spapr_machine_info = {
> >>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>>>> index 4f00865..c0fc964 100644
> >>>>> --- a/hw/ppc/spapr_pci.c
> >>>>> +++ b/hw/ppc/spapr_pci.c
> >>>>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>>          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >>>>>  
> >>>>>      if (sphb->index != (uint32_t)-1) {
> >>>>> -        hwaddr windows_base;
> >>>>> +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> >>>>> +        Error *local_err = NULL;
> >>>>>  
> >>>>>          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> >>>>>              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> >>>>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>>              return;
> >>>>>          }
> >>>>>  
> >>>>> -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> >>>>> -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> >>>>> -                       SPAPR_PCI_MAX_INDEX);
> >>>>> +        smc->phb_placement(spapr, sphb->index,
> >>>>> +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> >>>>> +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> >>>>> +                           windows_supported, sphb->dma_liobn, &local_err);
> >>>>> +        if (local_err) {
> >>>>> +            error_propagate(errp, local_err);
> >>>>>              return;
> >>>>>          }
> >>>>> -
> >>>>> -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> >>>>> -        for (i = 0; i < windows_supported; ++i) {
> >>>>> -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> >>>>> -        }
> >>>>> -
> >>>>> -        windows_base = SPAPR_PCI_WINDOW_BASE
> >>>>> -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> >>>>> -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> >>>>> -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> >>>>>      }
> >>>>>  
> >>>>>      if (sphb->buid == (uint64_t)-1) {
> >>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >>>>> index 30dbd46..8c9ebfd 100644
> >>>>> --- a/include/hw/pci-host/spapr.h
> >>>>> +++ b/include/hw/pci-host/spapr.h
> >>>>> @@ -79,18 +79,9 @@ struct sPAPRPHBState {
> >>>>>      uint32_t numa_node;
> >>>>>  };
> >>>>>  
> >>>>> -#define SPAPR_PCI_MAX_INDEX          255
> >>>>> -
> >>>>> -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> >>>>> -
> >>>>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> >>>>>  
> >>>>> -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> >>>>> -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
> >>>>> -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
> >>>>> -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
> >>>>> -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> >>>>> -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
> >>>>> +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
> >>>>>  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
> >>>>>  
> >>>>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> >>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>>> index 39dadaa..9e1c5a5 100644
> >>>>> --- a/include/hw/ppc/spapr.h
> >>>>> +++ b/include/hw/ppc/spapr.h
> >>>>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
> >>>>>      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
> >>>>>      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
> >>>>>      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
> >>>>> +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> >>>>> +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> >>>>> +                          hwaddr *mmio, hwaddr *mmio_size,
> >>>>> +                          unsigned n_dma, uint32_t *liobns, Error **errp);
> >>>>
> >>>>
> >>>> I still like idea of getting rid of index better. In order to reduce the
> >>>> command line length, you could not enable IO and MMIO32 by default, the
> >>>> default size for MMIO64 could be set to 1TB or something like this, BUID
> >>>> could be just a raw MMIO64 address value.
> >>>
> >>> Hm, interesting idea.  I do like the idea of making the BUID equal to
> >>> the MMIO64 address.  Obviously we'd still need addresses for the
> >>> default PHB, including 32-bit and PIO windows.
> >>>
> >>> Trickier, we'd still need to support 'index' for compatibility with
> >>> older command lines.  I guess we could reserve index purely for the
> >>> "legacy" locations - 2.8+ machine types would create the default
> >>> bridge with explicit windows, rather than just using index 0.
> >>>
> >>>> So essentially you would only have to specify MMIO64 and 2xLIOBN in the
> >>>> command line which does not seem that much of overkill assuming that a
> >>>> second (third,...) PHB is almost never used and even if it is, it will be
> >>>> used for SRIOV VF (which can do 64bit) or virtio (the same).
> >>>
> >>> Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
> >>> theoretically be optional, although you're going to want it for
> >>> basically every case where you're creating an additional PHB.
> >>>
> >>> Or.. what if we insisted the mmio64 window had at least 4G alignment,
> >>> then we might be able to use mmio64 >> 32 (== buid >> 32) as the
> >>> liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
> >>> extra bit.
> >>
> >>
> >> As you mentioned in 4/4, guests cannot access above 64 TiB, we could use
> >> upper bits to store the window number.
> > 
> > *Some* guests can't access above 64 TiB, that's not an inherent
> > limitation.  So I'd rather not do that.
> > 
> >>
> >>
> >>>
> >>> ..and I wonder if this might be easier to manage if we introduced a
> >>> new spapr-pci-host-bridge-2 subtype.
> >>>
> >>> Let me think about this..
> >>>
> >>>> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
> >>>> addresses from the command line parameters which did hit us with migration
> >>>> before as the QEMU command line on the source side and the destination side
> >>>> is not exactly the same, and this might hit again...
> >>>
> >>> Uh.. I'm not quite following you.  What's the situation which would
> >>> break migration with the proposed scheme?
> >>
> >> Well, I cannot think of any other device available on SPAPR which we could
> >> map to the CPU address space but if it ever appears, we will have a problem
> >> unless this new imaginary device does not pick an address from the CPU
> >> space automatically.
> > 
> > I followed that even less that the previous one, I can't even parse that.
> 
> For example, 2 ibm-veth devices, each with assigned MAC. The source libvirt
> puts them in the QEMU command line in som order and they get automatically
> assigned interrupts from a pool. Then we migrate and since libvirt does not
> read the ibm-veth.irq QOM property on the source side and therefore does
> not send it to the destination side to use in the QEMU command line, there
> is a change that on the destination side our ibm-veth devices will get
> different IRQ numbers and VMSTATE_UINT32_EQUAL(irq, VIOsPAPRDevice) will fail.
> 
> So either guest should be able to keep working with new IRQ numbers
> (impossible) or we should make sure that libvirt sends them over or libvirt
> always picks the correct number (derived from reg? should not depend on any
> other hardware). The same applies to addresses where you map PHB
> MMIO.

Ah, right.  Yeah, we have a bunch of problems like this because I did
a bunch of irq allocation from a pool, before I was thinking about
migration.  I've tried to avoid that more recently, but it's hard to
see alternatives when you can have so many virtual devices.

For PHBs we already migrate the necessary information (lsi_table and
the equivalent stuff for MSI).

If we don't for VIO, then that's a bug, but I don't see that it's
really related to this rework of PHB placement.

> ps. may be libvirt does send irq for vio devices now, dunno, but there was
> a bug like this :)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries
  2016-10-06  3:03 [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries David Gibson
                   ` (3 preceding siblings ...)
  2016-10-06  3:03 ` [Qemu-devel] [RFC 4/4] spapr: Improved placement of PCI host bridges in guest memory map David Gibson
@ 2016-10-10 15:53 ` no-reply
  4 siblings, 0 replies; 20+ messages in thread
From: no-reply @ 2016-10-10 15:53 UTC (permalink / raw)
  To: david
  Cc: famz, qemu-ppc, lvivier, thuth, mdroth, nikunj, mst, aik,
	qemu-devel, agraf, abologna, bharata, mpolednik

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Message-id: 1475722987-18644-1-git-send-email-david@gibson.dropbear.id.au
Subject: [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git show --no-patch --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/1475208369-13756-1-git-send-email-zhang.zhanghailiang@huawei.com -> patchew/1475208369-13756-1-git-send-email-zhang.zhanghailiang@huawei.com
 - [tag update]      patchew/1475219846-32609-1-git-send-email-pbonzini@redhat.com -> patchew/1475219846-32609-1-git-send-email-pbonzini@redhat.com
 - [tag update]      patchew/1475580648-6470-1-git-send-email-lvivier@redhat.com -> patchew/1475580648-6470-1-git-send-email-lvivier@redhat.com
 - [tag update]      patchew/1475617134-28180-1-git-send-email-wei@redhat.com -> patchew/1475617134-28180-1-git-send-email-wei@redhat.com
 - [tag update]      patchew/1475685327-22767-1-git-send-email-pbonzini@redhat.com -> patchew/1475685327-22767-1-git-send-email-pbonzini@redhat.com
 * [new tag]         patchew/1475722987-18644-1-git-send-email-david@gibson.dropbear.id.au -> patchew/1475722987-18644-1-git-send-email-david@gibson.dropbear.id.au
 - [tag update]      patchew/20161005130657.3399-1-rkrcmar@redhat.com -> patchew/20161005130657.3399-1-rkrcmar@redhat.com
 - [tag update]      patchew/57ea5f06.821e6b0a.4ee38.31ea@mx.google.com -> patchew/57ea5f06.821e6b0a.4ee38.31ea@mx.google.com
 - [tag update]      patchew/cover.bcfb6a7121f5fd92c9ebda9f199ff06cb0d4bc05.1473159543.git-series.james.hogan@imgtec.com -> patchew/cover.bcfb6a7121f5fd92c9ebda9f199ff06cb0d4bc05.1473159543.git-series.james.hogan@imgtec.com
Switched to a new branch 'test'
e147fe5 spapr: Improved placement of PCI host bridges in guest memory map
2858c98 spapr_pci: Add a 64-bit MMIO window
b5718f3 spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM
10b7f69 spapr_pci: Delegate placement of PCI host bridges to machine type

=== OUTPUT BEGIN ===
Checking PATCH 1/4: spapr_pci: Delegate placement of PCI host bridges to machine type...
Checking PATCH 2/4: spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM...
Checking PATCH 3/4: spapr_pci: Add a 64-bit MMIO window...
WARNING: line over 80 characters
#100: FILE: hw/ppc/spapr_pci.c:1361:
+            error_setg(errp, "64-bit memory window address not specified for PHB");

WARNING: line over 80 characters
#105: FILE: hw/ppc/spapr_pci.c:1366:
+            error_setg(errp, "32-bit memory window of size 0x%"HWADDR_PRIx" (max 2 GiB)",

WARNING: line over 80 characters
#220: FILE: include/hw/pci-host/spapr.h:85:
+#define SPAPR_PCI_MEM32_WIN_SIZE     ((1ULL <<32) - SPAPR_PCI_MEM_WIN_BUS_OFFSET)

ERROR: spaces required around that '<<' (ctx:WxV)
#220: FILE: include/hw/pci-host/spapr.h:85:
+#define SPAPR_PCI_MEM32_WIN_SIZE     ((1ULL <<32) - SPAPR_PCI_MEM_WIN_BUS_OFFSET)
                                             ^

total: 1 errors, 3 warnings, 166 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 4/4: spapr: Improved placement of PCI host bridges in guest memory map...
ERROR: Macros with multiple statements should be enclosed in a do - while loop
#133: FILE: hw/ppc/spapr.c:2534:
+#define SPAPR_COMPAT_2_7                            \
+    HW_COMPAT_2_7                                   \
+    {                                               \
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
+        .property = "mem_win_size",                 \
+        .value    = stringify(SPAPR_PCI_2_7_MMIO_WIN_SIZE),\
+    },                                              \
+    {                                               \
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
+        .property = "mem64_win_size",               \
+        .value    = "0",                            \
+    },

ERROR: trailing whitespace
#146: FILE: hw/ppc/spapr.c:2547:
+        $

total: 2 errors, 0 warnings, 175 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
  2016-10-07  5:10     ` David Gibson
  2016-10-07  5:34       ` Alexey Kardashevskiy
@ 2016-10-11  3:17       ` David Gibson
  1 sibling, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-10-11  3:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-ppc, qemu-devel, benh, thuth, lvivier, agraf, mst, mdroth,
	nikunj, bharata, abologna, mpolednik

[-- Attachment #1: Type: text/plain, Size: 11272 bytes --]

On Fri, Oct 07, 2016 at 04:10:08PM +1100, David Gibson wrote:
> On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
> > On 06/10/16 14:03, David Gibson wrote:
> > > The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> > > for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> > > and PAPR guests) to have numerous independent PHBs, each controlling a
> > > separate PCI domain.
> > > 
> > > There are two ways of configuring the spapr-pci-host-bridge device: first
> > > it can be done fully manually, specifying the locations and sizes of all
> > > the IO windows.  This gives the most control, but is very awkward with 6
> > > mandatory parameters.  Alternatively just an "index" can be specified
> > > which essentially selects from an array of predefined PHB locations.
> > > The PHB at index 0 is automatically created as the default PHB.
> > > 
> > > The current set of default locations causes some problems for guests with
> > > large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> > > GPGPU cards via VFIO).  Obviously, for migration we can only change the
> > > locations on a new machine type, however.
> > > 
> > > This is awkward, because the placement is currently decided within the
> > > spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> > > machine type version.
> > > 
> > > So, this patch delegates the "default mode" PHB placement from the
> > > spapr-pci-host-bridge device back to the machine type via a public method
> > > in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> > > can do.
> > > 
> > > For now, this just changes where the calculation is done.  It doesn't
> > > change the actual location of the host bridges, or any other behaviour.
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> > >  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> > >  include/hw/pci-host/spapr.h | 11 +----------
> > >  include/hw/ppc/spapr.h      |  4 ++++
> > >  4 files changed, 47 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index 03e3803..f6e9c2a 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
> > >      return head;
> > >  }
> > >  
> > > +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> > > +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > > +                                hwaddr *mmio, hwaddr *mmio_size,
> > > +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> > > +{
> > > +    const uint64_t base_buid = 0x800000020000000ULL;
> > > +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> > > +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> > > +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> > > +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> > > +    const uint32_t max_index = 255;
> > > +
> > > +    hwaddr phb_base;
> > > +    int i;
> > > +
> > > +    if (index > max_index) {
> > > +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > > +                   max_index);
> > > +        return;
> > > +    }
> > > +
> > > +    *buid = base_buid + index;
> > > +    for (i = 0; i < n_dma; ++i) {
> > > +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> > > +    }
> > > +
> > > +    phb_base = phb0_base + index * phb_spacing;
> > > +    *pio = phb_base + pio_offset;
> > > +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> > > +    *mmio = phb_base + mmio_offset;
> > > +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> > > +}
> > > +
> > >  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> > >  {
> > >      MachineClass *mc = MACHINE_CLASS(oc);
> > > @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> > >      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> > >      fwc->get_dev_path = spapr_get_fw_dev_path;
> > >      nc->nmi_monitor_handler = spapr_nmi;
> > > +    smc->phb_placement = spapr_phb_placement;
> > >  }
> > >  
> > >  static const TypeInfo spapr_machine_info = {
> > > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > > index 4f00865..c0fc964 100644
> > > --- a/hw/ppc/spapr_pci.c
> > > +++ b/hw/ppc/spapr_pci.c
> > > @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> > >          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> > >  
> > >      if (sphb->index != (uint32_t)-1) {
> > > -        hwaddr windows_base;
> > > +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> > > +        Error *local_err = NULL;
> > >  
> > >          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> > >              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> > > @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> > >              return;
> > >          }
> > >  
> > > -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> > > -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > > -                       SPAPR_PCI_MAX_INDEX);
> > > +        smc->phb_placement(spapr, sphb->index,
> > > +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> > > +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> > > +                           windows_supported, sphb->dma_liobn, &local_err);
> > > +        if (local_err) {
> > > +            error_propagate(errp, local_err);
> > >              return;
> > >          }
> > > -
> > > -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> > > -        for (i = 0; i < windows_supported; ++i) {
> > > -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> > > -        }
> > > -
> > > -        windows_base = SPAPR_PCI_WINDOW_BASE
> > > -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> > > -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> > > -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> > >      }
> > >  
> > >      if (sphb->buid == (uint64_t)-1) {
> > > diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> > > index 30dbd46..8c9ebfd 100644
> > > --- a/include/hw/pci-host/spapr.h
> > > +++ b/include/hw/pci-host/spapr.h
> > > @@ -79,18 +79,9 @@ struct sPAPRPHBState {
> > >      uint32_t numa_node;
> > >  };
> > >  
> > > -#define SPAPR_PCI_MAX_INDEX          255
> > > -
> > > -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> > > -
> > >  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> > >  
> > > -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> > > -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
> > > -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
> > > -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
> > > -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> > > -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
> > > +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
> > >  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
> > >  
> > >  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> > > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > > index 39dadaa..9e1c5a5 100644
> > > --- a/include/hw/ppc/spapr.h
> > > +++ b/include/hw/ppc/spapr.h
> > > @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
> > >      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
> > >      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
> > >      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
> > > +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> > > +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > > +                          hwaddr *mmio, hwaddr *mmio_size,
> > > +                          unsigned n_dma, uint32_t *liobns, Error **errp);
> > 
> > 
> > I still like idea of getting rid of index better. In order to reduce the
> > command line length, you could not enable IO and MMIO32 by default, the
> > default size for MMIO64 could be set to 1TB or something like this, BUID
> > could be just a raw MMIO64 address value.
> 
> Hm, interesting idea.  I do like the idea of making the BUID equal to
> the MMIO64 address.  Obviously we'd still need addresses for the
> default PHB, including 32-bit and PIO windows.
> 
> Trickier, we'd still need to support 'index' for compatibility with
> older command lines.  I guess we could reserve index purely for the
> "legacy" locations - 2.8+ machine types would create the default
> bridge with explicit windows, rather than just using index 0.
> 
> > So essentially you would only have to specify MMIO64 and 2xLIOBN in the
> > command line which does not seem that much of overkill assuming that a
> > second (third,...) PHB is almost never used and even if it is, it will be
> > used for SRIOV VF (which can do 64bit) or virtio (the same).
> 
> Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
> theoretically be optional, although you're going to want it for
> basically every case where you're creating an additional PHB.
> 
> Or.. what if we insisted the mmio64 window had at least 4G alignment,
> then we might be able to use mmio64 >> 32 (== buid >> 32) as the
> liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
> extra bit.
> 
> ..and I wonder if this might be easier to manage if we introduced a
> new spapr-pci-host-bridge-2 subtype.
> 
> Let me think about this..

Alas, no.  I still like the idea of getting rid of index in principle,
but I don't think it's workable in practice.  I had a discussion about
the libvirt consequences of this with Andrea Bolognani, and removing
index is just going to make life horrible for users.

What he pointed out boiled down to the fact that removing index means
we're making something outside qemu responsible for assigning the VM's
address space.  And the things above qemu don't really have enough
information do do that well.  We could do it in libvirt but it's (a) a
lot of work and (b) puts more knowledge of qemu's internal workings
into libvirt than we want.  The only real alternative is for libvirt
to punt the configuration to the user - but the user shouldn't have to
have knowledge of the whole spapr machine memory map in order to
create an extra PHB.

So, back to plan A.  I'll rebase and polish this series a bit and work
towards merging it ASAP.

Maybe we can get rid of index sometime in the future, but we can do it
on top of this interim measure.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-10-11  4:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-06  3:03 [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries David Gibson
2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
2016-10-06  7:10   ` Laurent Vivier
2016-10-06  8:11     ` David Gibson
2016-10-06  9:36   ` Laurent Vivier
2016-10-06 23:51     ` David Gibson
2016-10-07  3:57   ` Alexey Kardashevskiy
2016-10-07  5:10     ` David Gibson
2016-10-07  5:34       ` Alexey Kardashevskiy
2016-10-07  9:17         ` David Gibson
2016-10-10  1:04           ` Alexey Kardashevskiy
2016-10-10  4:07             ` David Gibson
2016-10-11  3:17       ` David Gibson
2016-10-06  3:03 ` [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM David Gibson
2016-10-06  7:21   ` Laurent Vivier
2016-10-06  8:46     ` David Gibson
2016-10-06  9:36   ` Laurent Vivier
2016-10-06  3:03 ` [Qemu-devel] [RFC 3/4] spapr_pci: Add a 64-bit MMIO window David Gibson
2016-10-06  3:03 ` [Qemu-devel] [RFC 4/4] spapr: Improved placement of PCI host bridges in guest memory map David Gibson
2016-10-10 15:53 ` [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.