On 07/10/16 16:10, David Gibson wrote: > On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote: >> On 06/10/16 14:03, David Gibson wrote: >>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB) >>> for a PAPR guest. Unlike on x86, it's routine on Power (both bare metal >>> and PAPR guests) to have numerous independent PHBs, each controlling a >>> separate PCI domain. >>> >>> There are two ways of configuring the spapr-pci-host-bridge device: first >>> it can be done fully manually, specifying the locations and sizes of all >>> the IO windows. This gives the most control, but is very awkward with 6 >>> mandatory parameters. Alternatively just an "index" can be specified >>> which essentially selects from an array of predefined PHB locations. >>> The PHB at index 0 is automatically created as the default PHB. >>> >>> The current set of default locations causes some problems for guests with >>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia >>> GPGPU cards via VFIO). Obviously, for migration we can only change the >>> locations on a new machine type, however. >>> >>> This is awkward, because the placement is currently decided within the >>> spapr-pci-host-bridge code, so it breaks abstraction to look inside the >>> machine type version. >>> >>> So, this patch delegates the "default mode" PHB placement from the >>> spapr-pci-host-bridge device back to the machine type via a public method >>> in sPAPRMachineClass. It's still a bit ugly, but it's about the best we >>> can do. >>> >>> For now, this just changes where the calculation is done. It doesn't >>> change the actual location of the host bridges, or any other behaviour. >>> >>> Signed-off-by: David Gibson >>> --- >>> hw/ppc/spapr.c | 34 ++++++++++++++++++++++++++++++++++ >>> hw/ppc/spapr_pci.c | 22 ++++++++-------------- >>> include/hw/pci-host/spapr.h | 11 +---------- >>> include/hw/ppc/spapr.h | 4 ++++ >>> 4 files changed, 47 insertions(+), 24 deletions(-) >>> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>> index 03e3803..f6e9c2a 100644 >>> --- a/hw/ppc/spapr.c >>> +++ b/hw/ppc/spapr.c >>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine) >>> return head; >>> } >>> >>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index, >>> + uint64_t *buid, hwaddr *pio, hwaddr *pio_size, >>> + hwaddr *mmio, hwaddr *mmio_size, >>> + unsigned n_dma, uint32_t *liobns, Error **errp) >>> +{ >>> + const uint64_t base_buid = 0x800000020000000ULL; >>> + const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */ >>> + const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */ >>> + const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */ >>> + const hwaddr pio_offset = 0x80000000; /* 2 GiB */ >>> + const uint32_t max_index = 255; >>> + >>> + hwaddr phb_base; >>> + int i; >>> + >>> + if (index > max_index) { >>> + error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)", >>> + max_index); >>> + return; >>> + } >>> + >>> + *buid = base_buid + index; >>> + for (i = 0; i < n_dma; ++i) { >>> + liobns[i] = SPAPR_PCI_LIOBN(index, i); >>> + } >>> + >>> + phb_base = phb0_base + index * phb_spacing; >>> + *pio = phb_base + pio_offset; >>> + *pio_size = SPAPR_PCI_IO_WIN_SIZE; >>> + *mmio = phb_base + mmio_offset; >>> + *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE; >>> +} >>> + >>> static void spapr_machine_class_init(ObjectClass *oc, void *data) >>> { >>> MachineClass *mc = MACHINE_CLASS(oc); >>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data) >>> mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus; >>> fwc->get_dev_path = spapr_get_fw_dev_path; >>> nc->nmi_monitor_handler = spapr_nmi; >>> + smc->phb_placement = spapr_phb_placement; >>> } >>> >>> static const TypeInfo spapr_machine_info = { >>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c >>> index 4f00865..c0fc964 100644 >>> --- a/hw/ppc/spapr_pci.c >>> +++ b/hw/ppc/spapr_pci.c >>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp) >>> sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1; >>> >>> if (sphb->index != (uint32_t)-1) { >>> - hwaddr windows_base; >>> + sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr); >>> + Error *local_err = NULL; >>> >>> if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1) >>> || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2) >>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp) >>> return; >>> } >>> >>> - if (sphb->index > SPAPR_PCI_MAX_INDEX) { >>> - error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)", >>> - SPAPR_PCI_MAX_INDEX); >>> + smc->phb_placement(spapr, sphb->index, >>> + &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size, >>> + &sphb->mem_win_addr, &sphb->mem_win_size, >>> + windows_supported, sphb->dma_liobn, &local_err); >>> + if (local_err) { >>> + error_propagate(errp, local_err); >>> return; >>> } >>> - >>> - sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index; >>> - for (i = 0; i < windows_supported; ++i) { >>> - sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i); >>> - } >>> - >>> - windows_base = SPAPR_PCI_WINDOW_BASE >>> - + sphb->index * SPAPR_PCI_WINDOW_SPACING; >>> - sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF; >>> - sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF; >>> } >>> >>> if (sphb->buid == (uint64_t)-1) { >>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h >>> index 30dbd46..8c9ebfd 100644 >>> --- a/include/hw/pci-host/spapr.h >>> +++ b/include/hw/pci-host/spapr.h >>> @@ -79,18 +79,9 @@ struct sPAPRPHBState { >>> uint32_t numa_node; >>> }; >>> >>> -#define SPAPR_PCI_MAX_INDEX 255 >>> - >>> -#define SPAPR_PCI_BASE_BUID 0x800000020000000ULL >>> - >>> #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL >>> >>> -#define SPAPR_PCI_WINDOW_BASE 0x10000000000ULL >>> -#define SPAPR_PCI_WINDOW_SPACING 0x1000000000ULL >>> -#define SPAPR_PCI_MMIO_WIN_OFF 0xA0000000 >>> -#define SPAPR_PCI_MMIO_WIN_SIZE (SPAPR_PCI_WINDOW_SPACING - \ >>> - SPAPR_PCI_MEM_WIN_BUS_OFFSET) >>> -#define SPAPR_PCI_IO_WIN_OFF 0x80000000 >>> +#define SPAPR_PCI_MMIO_WIN_SIZE 0xf80000000 >>> #define SPAPR_PCI_IO_WIN_SIZE 0x10000 >>> >>> #define SPAPR_PCI_MSI_WINDOW 0x40000000000ULL >>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h >>> index 39dadaa..9e1c5a5 100644 >>> --- a/include/hw/ppc/spapr.h >>> +++ b/include/hw/ppc/spapr.h >>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass { >>> bool dr_lmb_enabled; /* enable dynamic-reconfig/hotplug of LMBs */ >>> bool use_ohci_by_default; /* use USB-OHCI instead of XHCI */ >>> const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */ >>> + void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index, >>> + uint64_t *buid, hwaddr *pio, hwaddr *pio_size, >>> + hwaddr *mmio, hwaddr *mmio_size, >>> + unsigned n_dma, uint32_t *liobns, Error **errp); >> >> >> I still like idea of getting rid of index better. In order to reduce the >> command line length, you could not enable IO and MMIO32 by default, the >> default size for MMIO64 could be set to 1TB or something like this, BUID >> could be just a raw MMIO64 address value. > > Hm, interesting idea. I do like the idea of making the BUID equal to > the MMIO64 address. Obviously we'd still need addresses for the > default PHB, including 32-bit and PIO windows. > > Trickier, we'd still need to support 'index' for compatibility with > older command lines. I guess we could reserve index purely for the > "legacy" locations - 2.8+ machine types would create the default > bridge with explicit windows, rather than just using index 0. > >> So essentially you would only have to specify MMIO64 and 2xLIOBN in the >> command line which does not seem that much of overkill assuming that a >> second (third,...) PHB is almost never used and even if it is, it will be >> used for SRIOV VF (which can do 64bit) or virtio (the same). > > Hm, maybe. 3 values is still a bit of a PITA. I guess liobn64 could > theoretically be optional, although you're going to want it for > basically every case where you're creating an additional PHB. > > Or.. what if we insisted the mmio64 window had at least 4G alignment, > then we might be able to use mmio64 >> 32 (== buid >> 32) as the > liobn. Hmm.. guess it would need to be 8G alignment, so we have one > extra bit. As you mentioned in 4/4, guests cannot access above 64 TiB, we could use upper bits to store the window number. > > ..and I wonder if this might be easier to manage if we introduced a > new spapr-pci-host-bridge-2 subtype. > > Let me think about this.. > >> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO >> addresses from the command line parameters which did hit us with migration >> before as the QEMU command line on the source side and the destination side >> is not exactly the same, and this might hit again... > > Uh.. I'm not quite following you. What's the situation which would > break migration with the proposed scheme? Well, I cannot think of any other device available on SPAPR which we could map to the CPU address space but if it ever appears, we will have a problem unless this new imaginary device does not pick an address from the CPU space automatically. -- Alexey