All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: qemu-ppc@nongnu.org, qemu-devel@nongnu.org,
	benh@kernel.crashing.org, thuth@redhat.com, lvivier@redhat.com,
	agraf@suse.de, mst@redhat.com, mdroth@linux.vnet.ibm.com,
	nikunj@linux.vnet.ibm.com, bharata@linux.vnet.ibm.com,
	abologna@redhat.com, mpolednik@redhat.com
Subject: Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
Date: Tue, 11 Oct 2016 14:17:34 +1100	[thread overview]
Message-ID: <20161011031734.GC8952@umbus.fritz.box> (raw)
In-Reply-To: <20161007051008.GV18490@umbus.fritz.box>

[-- Attachment #1: Type: text/plain, Size: 11272 bytes --]

On Fri, Oct 07, 2016 at 04:10:08PM +1100, David Gibson wrote:
> On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
> > On 06/10/16 14:03, David Gibson wrote:
> > > The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> > > for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> > > and PAPR guests) to have numerous independent PHBs, each controlling a
> > > separate PCI domain.
> > > 
> > > There are two ways of configuring the spapr-pci-host-bridge device: first
> > > it can be done fully manually, specifying the locations and sizes of all
> > > the IO windows.  This gives the most control, but is very awkward with 6
> > > mandatory parameters.  Alternatively just an "index" can be specified
> > > which essentially selects from an array of predefined PHB locations.
> > > The PHB at index 0 is automatically created as the default PHB.
> > > 
> > > The current set of default locations causes some problems for guests with
> > > large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> > > GPGPU cards via VFIO).  Obviously, for migration we can only change the
> > > locations on a new machine type, however.
> > > 
> > > This is awkward, because the placement is currently decided within the
> > > spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> > > machine type version.
> > > 
> > > So, this patch delegates the "default mode" PHB placement from the
> > > spapr-pci-host-bridge device back to the machine type via a public method
> > > in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> > > can do.
> > > 
> > > For now, this just changes where the calculation is done.  It doesn't
> > > change the actual location of the host bridges, or any other behaviour.
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> > >  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> > >  include/hw/pci-host/spapr.h | 11 +----------
> > >  include/hw/ppc/spapr.h      |  4 ++++
> > >  4 files changed, 47 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index 03e3803..f6e9c2a 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotpluggable_cpus(MachineState *machine)
> > >      return head;
> > >  }
> > >  
> > > +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> > > +                                uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > > +                                hwaddr *mmio, hwaddr *mmio_size,
> > > +                                unsigned n_dma, uint32_t *liobns, Error **errp)
> > > +{
> > > +    const uint64_t base_buid = 0x800000020000000ULL;
> > > +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> > > +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> > > +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> > > +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> > > +    const uint32_t max_index = 255;
> > > +
> > > +    hwaddr phb_base;
> > > +    int i;
> > > +
> > > +    if (index > max_index) {
> > > +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > > +                   max_index);
> > > +        return;
> > > +    }
> > > +
> > > +    *buid = base_buid + index;
> > > +    for (i = 0; i < n_dma; ++i) {
> > > +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> > > +    }
> > > +
> > > +    phb_base = phb0_base + index * phb_spacing;
> > > +    *pio = phb_base + pio_offset;
> > > +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> > > +    *mmio = phb_base + mmio_offset;
> > > +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> > > +}
> > > +
> > >  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> > >  {
> > >      MachineClass *mc = MACHINE_CLASS(oc);
> > > @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> > >      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> > >      fwc->get_dev_path = spapr_get_fw_dev_path;
> > >      nc->nmi_monitor_handler = spapr_nmi;
> > > +    smc->phb_placement = spapr_phb_placement;
> > >  }
> > >  
> > >  static const TypeInfo spapr_machine_info = {
> > > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > > index 4f00865..c0fc964 100644
> > > --- a/hw/ppc/spapr_pci.c
> > > +++ b/hw/ppc/spapr_pci.c
> > > @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> > >          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> > >  
> > >      if (sphb->index != (uint32_t)-1) {
> > > -        hwaddr windows_base;
> > > +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> > > +        Error *local_err = NULL;
> > >  
> > >          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> > >              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
> > > @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> > >              return;
> > >          }
> > >  
> > > -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> > > -            error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> > > -                       SPAPR_PCI_MAX_INDEX);
> > > +        smc->phb_placement(spapr, sphb->index,
> > > +                           &sphb->buid, &sphb->io_win_addr, &sphb->io_win_size,
> > > +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> > > +                           windows_supported, sphb->dma_liobn, &local_err);
> > > +        if (local_err) {
> > > +            error_propagate(errp, local_err);
> > >              return;
> > >          }
> > > -
> > > -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> > > -        for (i = 0; i < windows_supported; ++i) {
> > > -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> > > -        }
> > > -
> > > -        windows_base = SPAPR_PCI_WINDOW_BASE
> > > -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> > > -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> > > -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> > >      }
> > >  
> > >      if (sphb->buid == (uint64_t)-1) {
> > > diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> > > index 30dbd46..8c9ebfd 100644
> > > --- a/include/hw/pci-host/spapr.h
> > > +++ b/include/hw/pci-host/spapr.h
> > > @@ -79,18 +79,9 @@ struct sPAPRPHBState {
> > >      uint32_t numa_node;
> > >  };
> > >  
> > > -#define SPAPR_PCI_MAX_INDEX          255
> > > -
> > > -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> > > -
> > >  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> > >  
> > > -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> > > -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
> > > -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
> > > -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
> > > -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> > > -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
> > > +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
> > >  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
> > >  
> > >  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> > > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > > index 39dadaa..9e1c5a5 100644
> > > --- a/include/hw/ppc/spapr.h
> > > +++ b/include/hw/ppc/spapr.h
> > > @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
> > >      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of LMBs */
> > >      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
> > >      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by default */
> > > +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> > > +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> > > +                          hwaddr *mmio, hwaddr *mmio_size,
> > > +                          unsigned n_dma, uint32_t *liobns, Error **errp);
> > 
> > 
> > I still like idea of getting rid of index better. In order to reduce the
> > command line length, you could not enable IO and MMIO32 by default, the
> > default size for MMIO64 could be set to 1TB or something like this, BUID
> > could be just a raw MMIO64 address value.
> 
> Hm, interesting idea.  I do like the idea of making the BUID equal to
> the MMIO64 address.  Obviously we'd still need addresses for the
> default PHB, including 32-bit and PIO windows.
> 
> Trickier, we'd still need to support 'index' for compatibility with
> older command lines.  I guess we could reserve index purely for the
> "legacy" locations - 2.8+ machine types would create the default
> bridge with explicit windows, rather than just using index 0.
> 
> > So essentially you would only have to specify MMIO64 and 2xLIOBN in the
> > command line which does not seem that much of overkill assuming that a
> > second (third,...) PHB is almost never used and even if it is, it will be
> > used for SRIOV VF (which can do 64bit) or virtio (the same).
> 
> Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
> theoretically be optional, although you're going to want it for
> basically every case where you're creating an additional PHB.
> 
> Or.. what if we insisted the mmio64 window had at least 4G alignment,
> then we might be able to use mmio64 >> 32 (== buid >> 32) as the
> liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
> extra bit.
> 
> ..and I wonder if this might be easier to manage if we introduced a
> new spapr-pci-host-bridge-2 subtype.
> 
> Let me think about this..

Alas, no.  I still like the idea of getting rid of index in principle,
but I don't think it's workable in practice.  I had a discussion about
the libvirt consequences of this with Andrea Bolognani, and removing
index is just going to make life horrible for users.

What he pointed out boiled down to the fact that removing index means
we're making something outside qemu responsible for assigning the VM's
address space.  And the things above qemu don't really have enough
information do do that well.  We could do it in libvirt but it's (a) a
lot of work and (b) puts more knowledge of qemu's internal workings
into libvirt than we want.  The only real alternative is for libvirt
to punt the configuration to the user - but the user shouldn't have to
have knowledge of the whole spapr machine memory map in order to
create an extra PHB.

So, back to plan A.  I'll rebase and polish this series a bit and work
towards merging it ASAP.

Maybe we can get rid of index sometime in the future, but we can do it
on top of this interim measure.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  parent reply	other threads:[~2016-10-11  4:08 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-06  3:03 [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries David Gibson
2016-10-06  3:03 ` [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type David Gibson
2016-10-06  7:10   ` Laurent Vivier
2016-10-06  8:11     ` David Gibson
2016-10-06  9:36   ` Laurent Vivier
2016-10-06 23:51     ` David Gibson
2016-10-07  3:57   ` Alexey Kardashevskiy
2016-10-07  5:10     ` David Gibson
2016-10-07  5:34       ` Alexey Kardashevskiy
2016-10-07  9:17         ` David Gibson
2016-10-10  1:04           ` Alexey Kardashevskiy
2016-10-10  4:07             ` David Gibson
2016-10-11  3:17       ` David Gibson [this message]
2016-10-06  3:03 ` [Qemu-devel] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM David Gibson
2016-10-06  7:21   ` Laurent Vivier
2016-10-06  8:46     ` David Gibson
2016-10-06  9:36   ` Laurent Vivier
2016-10-06  3:03 ` [Qemu-devel] [RFC 3/4] spapr_pci: Add a 64-bit MMIO window David Gibson
2016-10-06  3:03 ` [Qemu-devel] [RFC 4/4] spapr: Improved placement of PCI host bridges in guest memory map David Gibson
2016-10-10 15:53 ` [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161011031734.GC8952@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=abologna@redhat.com \
    --cc=agraf@suse.de \
    --cc=aik@ozlabs.ru \
    --cc=benh@kernel.crashing.org \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=lvivier@redhat.com \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=mpolednik@redhat.com \
    --cc=mst@redhat.com \
    --cc=nikunj@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.